Skip to content

ETL pipeline πŸͺˆ for scraping, transforming, and loading YTS movie data 🎞️ into PostgreSQL πŸ›’οΈ Container using Docker🐳

Notifications You must be signed in to change notification settings

TheODDYSEY/YTS-Pipeline-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

YTS MovieData ETL Pipeline

Python PostgreSQL Docker Pandas BeautifulSoup

Overview

The MovieData ETL Pipeline project is a comprehensive solution for extracting, transforming, and loading movie data from web sources into a PostgreSQL database. This project is designed with scalability and efficiency in mind, leveraging Docker for containerization, Python for scripting, and PostgreSQL for data storage.

Project Workflow

Table of Contents

Project Structure

MovieData-ETL
β”œβ”€β”€ elt
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ elt_pipeline.py
β”‚   └── requirements.txt
β”œβ”€β”€ source_db
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ create_movies.sql
β”‚   └── insert_movies.sql
β”œβ”€β”€ worker.py
β”œβ”€β”€ output.csv
β”œβ”€β”€ output.xlsx
β”œβ”€β”€ report.md
β”œβ”€β”€ yts.txt
β”œβ”€β”€ docker-compose.yaml
└── README.md

elt Directory

  • Dockerfile: Defines the environment for running the ETL script.
  • elt_pipeline.py: Contains the ETL logic to load data into PostgreSQL.
  • requirements.txt: Lists the Python dependencies.

source_db Directory

  • Dockerfile: Defines the environment for the PostgreSQL database.
  • create_movies.sql: SQL script to create the movies table.
  • insert_movies.sql: SQL script to insert movie data.

Root Directory

  • worker.py: Script to scrape movie data from the web and generate SQL insert statements.
  • output.csv: CSV file containing scraped movie data.
  • output.xlsx: Excel file containing scraped movie data.
  • report.md: Markdown file for logging the ETL process.
  • yts.txt: Raw HTML content of the scraped web page.
  • docker-compose.yaml: Docker Compose configuration to orchestrate the ETL pipeline.

Prerequisites

Ensure you have the following installed:

Setup and Installation

  1. Clone the Repository:

    git clone https://github.com/TheODDYSEY/YTS-Pipeline-Postgres
    cd MovieData-ETL
  2. Build and Run the Docker Containers:

    docker-compose up --build

    This command builds the Docker images and starts the containers for both the PostgreSQL database and the ETL script.

Usage

  1. Scrape Movie Data:

    The worker.py script scrapes the latest movie data from YTS and generates SQL insert statements.

    python worker.py

    This will generate output.csv, output.xlsx, and insert_movies.sql files with the scraped movie data.

  2. Run the ETL Pipeline:

    The elt_pipeline.py script will wait for the PostgreSQL database to be ready, then load the movie data from the SQL script into the database.

    docker-compose up elt_script

ETL Process Details

Extract

The extraction step involves scraping movie data from the YTS website using the worker.py script. BeautifulSoup is used to parse the HTML and extract relevant movie information.

Transform

The extracted data is transformed into a structured format using Pandas. The data is saved into CSV and Excel files for reference and further processing.

Load

The transformed data is loaded into the PostgreSQL database. The elt_pipeline.py script uses the psycopg2 library to connect to the database and load the data using SQL insert statements generated by the worker.py script.

Script Breakdown

worker.py

  • Step 1: Sends a GET request to the specified URL to fetch movie data.
  • Step 2: Parses the HTML content using BeautifulSoup.
  • Step 3: Extracts movie details and stores them in a list.
  • Step 4: Creates a Pandas DataFrame from the extracted data.
  • Step 5: Saves the DataFrame to CSV and Excel files.
  • Step 6: Generates SQL insert statements and saves them to a .sql file.

elt_pipeline.py

  • Database Connection: Connects to the PostgreSQL database using psycopg2.
  • Data Loading: Loads the movie data from the SQL script into the database.
  • Retry Logic: Implements retry logic to handle transient connection issues.

Configuration

The PostgreSQL database configuration and credentials are set in the docker-compose.yaml file and passed as environment variables to the Docker containers. Ensure the following values are correctly set:

services:
  movies_postgres:
    environment:
      POSTGRES_DB: movies_db
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secret

Future Improvements

  • Automated Scheduling: Integrate a scheduler like cron or Airflow to automate the ETL process at regular intervals.
  • Data Validation: Implement data validation and cleaning steps to ensure data quality.
  • Enhanced Logging: Add more detailed logging to track the ETL process and handle errors more effectively.

Contributing

Contributions are welcome! Please read the contributing guidelines for more details on how to get started.

License

This project is licensed under the MIT License. See the LICENSE file for details.


For any inquiries or support, please open an issue on our GitHub repository.

About

ETL pipeline πŸͺˆ for scraping, transforming, and loading YTS movie data 🎞️ into PostgreSQL πŸ›’οΈ Container using Docker🐳

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published