Project for Data Management and Visualization - 2020/2021

The aim of this project was to manage, analyse and enrich a large amounts of data stored on Hadoop and processed through specific parallelized procedures (thanks to Dask). We have started by scraping the entire book's catalog from Feltrinelli, then it was enriched through the OpenLibrary's open source dataset and through Mondadori and Hoepli books catalog. Due to the large size of the analyzed datasets, they were not uploaded within the repository. You can download the OpenLibrary dataset directly from their site while scripts are required to be runned to collect the data for the books in the Feltrinelli, Mondadori and Hoepli catalogs.

Report

Summary report of the work performed: LINK

Install & Setup

First you need to install the pipenv library:

$ pip install pipenv

then go to the main directory of the project:

$ cd path/data-management-project-main

and install the virtual enviroment with all dependencies:

$ pipenv install

or:

$ pipenv  install  -r  path/to/requirements.txt

Next, activate the Pipenv shell:

$ pipenv shell

and run the main.py script:

$ python main.py

Usage

The central part of the project is located inside the script called main.py, in it is possible to set which actions the program should perform. Inside the script config.py are located all the project's configurations (input and output file path, chunksize, etc.) .They are editable directly without having to make any other changes inside the program.

Authors

Afify Andrea | @AndreaAfify
Mingolla Daniele | @danielemingolla

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data/feltrinelli		data/feltrinelli
lib		lib
test/jupyter_colab_test		test/jupyter_colab_test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
REPORT.pdf		REPORT.pdf
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project for Data Management and Visualization - 2020/2021

Table of Contents

Report

Install & Setup

Usage

Authors

License

About

Releases

Packages

Languages

License

mingolladaniele/parallelized-pipelines-data-science

Folders and files

Latest commit

History

Repository files navigation

Project for Data Management and Visualization - 2020/2021

Table of Contents

Report

Install & Setup

Usage

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages