The aim of this project was to manage, analyse and enrich a large amounts of data stored on Hadoop and processed through specific parallelized procedures (thanks to Dask). We have started by scraping the entire book's catalog from Feltrinelli, then it was enriched through the OpenLibrary's open source dataset and through Mondadori and Hoepli books catalog. Due to the large size of the analyzed datasets, they were not uploaded within the repository. You can download the OpenLibrary dataset directly from their site while scripts are required to be runned to collect the data for the books in the Feltrinelli, Mondadori and Hoepli catalogs.
Summary report of the work performed: LINK
First you need to install the pipenv library:
$ pip install pipenv
then go to the main directory of the project:
$ cd path/data-management-project-main
and install the virtual enviroment with all dependencies:
$ pipenv install
or:
$ pipenv install -r path/to/requirements.txt
Next, activate the Pipenv shell:
$ pipenv shell
and run the main.py
script:
$ python main.py
The central part of the project is located inside the script called main.py
, in it is possible to set which actions the program should perform.
Inside the script config.py
are located all the project's configurations (input and output file path, chunksize, etc.) .They are editable directly without having to make any other changes inside the program.
- Afify Andrea | @AndreaAfify
- Mingolla Daniele | @danielemingolla