Sitemap Crawler is a web service that fetches the sitemap of a given domain. It's built using Express.js, Cheerio, and React.
- Fetches and displays the sitemap of a specified domain
- Customizable crawl depth
- Option to include parent paths
- Web-based user interface
- Node.js (LTS version recommended)
- npm (Node Package Manager)
- Docker (optional, for running the containerized version)
-
Clone the repository:
git clone https://github.com/athrvk/sitemap-crawler-redhat.git cd sitemap-crawler-redhat
-
Install dependencies:
npm install
You can also run Sitemap Crawler using Docker. Pull the image from Docker Hub:
docker pull athrvk/sitemap-crawler:latest
-
Start the server:
npm start
-
Open your browser and navigate to
http://localhost:3000
Run the Docker container:
docker run --pull=always -p 3000:3000 athrvk/sitemap-crawler:latest
Then access the application at http://localhost:3000
You can test the application using the demo website: https://demo.cyotek.com/
-
The include parent path checkbox is not perfect implementation of its logic, it has issues when root path has redirections. -
The depth condition needs improvement. If it shows just 1 link in case of any level set include parent path checkbox. -
Parallel processing is not yet implemented, which may lead to longer processing times for larger sitemaps.
-
Crawling depths greater than 3 on popular websites may take an extended amount of time (3+ minutes).
-
This project was developed on Linux and has not been tested on Windows.
-
Contributions to address known issues or add new features are welcome.
TODO