Crawl Clean and Learn is a basic Data crawling, Data cleaning and Hadoop implementation. It has 3 parts.
In Part 1, data is crawled from WikiCFP website (http://www.wikicfp.com/cfp/) for the conferences on Data mining, Machine Learning, Database and Artificial Intelligence. For each of this category of conferences, the conference acronym, conference name and conference location is obtained for further processing. Moreover, for each category up to 20 pages are crawled.
In Part 2, data is cleaned using the OpenRefine tool and various inconsistencies are removed.
In Part 3, data is analyzed and different statistics are computed to get an insight of the data. For this, Hadoop MapReduce is used. In this part, total 4 different computations are performed.
A report is also included with the project code. More details can be found there.
The sourcecode does not come with any kind of warrenty or support, though you can contact me if you need.
Sharmistha Bardhan