Skip to content

IntelCompH2020/GenericWebPageCrawler

Repository files navigation

This is a generic website crawler created by ATHENA R.C.

Given a website it collects all html data from the domain.
The crawler operates on a Breadth-First-Search manner and stops after a specific number of crawled pages.

In order to run the crawler:

python3.6 crawler.py --inpath=data2crawl.json --out_dir=./output/ --max_pages_to_visit=1000

The data will be collected in files under the director "output"

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101004870. H2020-SC6-GOVERNANCE-2018-2019-2020 / H2020-SC6-GOVERNANCE-2020

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages