Forked from Web-Crawler. Visit the above link for detailed information about the project.
- Dowloading files with libcURL: One of the bugs in previous projects was crawler stopping at the end of its execution. In somecases
https
requests would not get any response and they would wait forever(very long time). Because of the use ofOpenSSL
, it was not possible to implement a time out functionality. By using libcURL, we can add the time out feature. Hurray! now crawler won't stop. - Saving and resume working state: Once crawler runs adn finishes its execution, it was dumping the data into output files. I have added the saving feature, now the state of the crawler gets saved in a
.zip
file. And at the time of initialization, it recovers its state.
sudo apt install libcurl4-openssl-dev
- use
make
to compile and run makDa, it would run with its default arguments. - supported arguments are:
maxlinks
: maximum links extracted from a single websitepagelimit
: maximum pages to processthreads
: maximum concurrent threads to createmaxfilesize
: maximum file size to downloadtimeout
: maximum time waiting for website responserestore_data
: restoring crawler state from previous runsave_data
: saving current crawler state for next run