GitHub

This is a generic website crawler created by ATHENA R.C.

Given a website it collects all html data from the domain.
The crawler operates on a Breadth-First-Search manner and stops after a specific number of crawled pages.

In order to run the crawler:

python3.6 crawler.py --inpath=data2crawl.json --out_dir=./output/ --max_pages_to_visit=1000

The data will be collected in files under the director "output"

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
LICENSE		LICENSE
README.md		README.md
crawler.py		crawler.py
data2crawl.json		data2crawl.json
global_handlers.py		global_handlers.py
my_lang_detect.py		my_lang_detect.py
requirements.txt		requirements.txt
timeout.py		timeout.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

IntelCompH2020/GenericWebPageCrawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages