Skip to content

Crawl Clean and Learn is a basic data mining project in which data is crawled, cleaned and analyzed to get an insight into the data.

Notifications You must be signed in to change notification settings

SharmisthaWWE/Crawl-Clean-and-Learn

Repository files navigation

Crawl Clean and Learn

Description

Crawl Clean and Learn is a basic Data crawling, Data cleaning and Hadoop implementation. It has 3 parts.

In Part 1, data is crawled from WikiCFP website (http://www.wikicfp.com/cfp/) for the conferences on Data mining, Machine Learning, Database and Artificial Intelligence. For each of this category of conferences, the conference acronym, conference name and conference location is obtained for further processing. Moreover, for each category up to 20 pages are crawled.

In Part 2, data is cleaned using the OpenRefine tool and various inconsistencies are removed.

In Part 3, data is analyzed and different statistics are computed to get an insight of the data. For this, Hadoop MapReduce is used. In this part, total 4 different computations are performed.

A report is also included with the project code. More details can be found there.

Disclaimer

The sourcecode does not come with any kind of warrenty or support, though you can contact me if you need.

Copyright

Sharmistha Bardhan

About

Crawl Clean and Learn is a basic data mining project in which data is crawled, cleaned and analyzed to get an insight into the data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages