Skip to content

An end-to-end data collection and analysis pipeline for Docker Hub.

Notifications You must be signed in to change notification settings

cshubhamrao/docker-hub-data

Repository files navigation

Software Engineering Trends on Docker Hub

An end-to-end framework which help the company to predict software engineering trends and the developers to know more about a docker image.

Our goal is to provide different companies with a dynamic dataset through which meaningful inferences can be made.

Our aim is to gather data from Docker Hub and analyse the trends. Docker Hub is a cloud-based repository in which Docker users and partners create, test, store and distribute container images.

This project was developed as part of coursework for Data-X at Berkeley.

Link to supporting presentation

Requirements

We use Conda to manage the environment and packages.

We use the following packages (among many others):

  • Python 3.6 or above
  • Pandas
  • Matplotlib
  • Plotly
  • Seaborn
  • boto3

To fetch new .json files from the AWS S3 bucket

cd data/
aws s3 sync s3://docker-recent recent-data

Installation

1. Downloading this Respository

Start by downloading or cloning this repository.

git clone https://github.com/cshubhamrao/docker-hub-data.git
cd docker-hub-data-x

2. Create and Activate Environment

Create the conda environment from the environment.yml file:

conda env create -f environment.yml

Now activate the environment by:

conda activate docker-hub

3. Run Jupyter Lab

jupyter lab

Contents

  1. Data - This folder contains all the data related files and folders that are generated or are stored for later use. This is also the folder where all the 'plots' generated by analytics.ipynb and another scripts.
  2. Misc - Contains all the miscellaneous scripts that are required for this project.
  3. Scripts - This folder is the main folder. This contains all the scripts that we used to scrape the data, clean that data, select required data to do analysis, and finally do analysis on the data and derive inference from the data.

Team Members

System Architecture:

Architecture

About

An end-to-end data collection and analysis pipeline for Docker Hub.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •