Skip to content

Latest commit

 

History

History
20 lines (7 loc) · 713 Bytes

README.md

File metadata and controls

20 lines (7 loc) · 713 Bytes

This is a generic website crawler created by ATHENA R.C.

Given a website it collects all html data from the domain.
The crawler operates on a Breadth-First-Search manner and stops after a specific number of crawled pages.

In order to run the crawler:

python3.6 crawler.py --inpath=data2crawl.json --out_dir=./output/ --max_pages_to_visit=1000

The data will be collected in files under the director "output"

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101004870. H2020-SC6-GOVERNANCE-2018-2019-2020 / H2020-SC6-GOVERNANCE-2020