The MovieData ETL Pipeline project is a comprehensive solution for extracting, transforming, and loading movie data from web sources into a PostgreSQL database. This project is designed with scalability and efficiency in mind, leveraging Docker for containerization, Python for scripting, and PostgreSQL for data storage.
- Project Structure
- Prerequisites
- Setup and Installation
- Usage
- ETL Process Details
- Configuration
- Future Improvements
- Contributing
- License
MovieData-ETL
βββ elt
β βββ Dockerfile
β βββ elt_pipeline.py
β βββ requirements.txt
βββ source_db
β βββ Dockerfile
β βββ create_movies.sql
β βββ insert_movies.sql
βββ worker.py
βββ output.csv
βββ output.xlsx
βββ report.md
βββ yts.txt
βββ docker-compose.yaml
βββ README.md
- Dockerfile: Defines the environment for running the ETL script.
- elt_pipeline.py: Contains the ETL logic to load data into PostgreSQL.
- requirements.txt: Lists the Python dependencies.
- Dockerfile: Defines the environment for the PostgreSQL database.
- create_movies.sql: SQL script to create the movies table.
- insert_movies.sql: SQL script to insert movie data.
- worker.py: Script to scrape movie data from the web and generate SQL insert statements.
- output.csv: CSV file containing scraped movie data.
- output.xlsx: Excel file containing scraped movie data.
- report.md: Markdown file for logging the ETL process.
- yts.txt: Raw HTML content of the scraped web page.
- docker-compose.yaml: Docker Compose configuration to orchestrate the ETL pipeline.
Ensure you have the following installed:
- Docker: For containerization.
- Docker Compose: To manage multi-container applications.
- Python 3.8: For running the worker script.
-
Clone the Repository:
git clone https://github.com/TheODDYSEY/YTS-Pipeline-Postgres cd MovieData-ETL
-
Build and Run the Docker Containers:
docker-compose up --build
This command builds the Docker images and starts the containers for both the PostgreSQL database and the ETL script.
-
Scrape Movie Data:
The
worker.py
script scrapes the latest movie data from YTS and generates SQL insert statements.python worker.py
This will generate
output.csv
,output.xlsx
, andinsert_movies.sql
files with the scraped movie data. -
Run the ETL Pipeline:
The
elt_pipeline.py
script will wait for the PostgreSQL database to be ready, then load the movie data from the SQL script into the database.docker-compose up elt_script
The extraction step involves scraping movie data from the YTS website using the worker.py
script. BeautifulSoup is used to parse the HTML and extract relevant movie information.
The extracted data is transformed into a structured format using Pandas. The data is saved into CSV and Excel files for reference and further processing.
The transformed data is loaded into the PostgreSQL database. The elt_pipeline.py
script uses the psycopg2
library to connect to the database and load the data using SQL insert statements generated by the worker.py
script.
- Step 1: Sends a GET request to the specified URL to fetch movie data.
- Step 2: Parses the HTML content using BeautifulSoup.
- Step 3: Extracts movie details and stores them in a list.
- Step 4: Creates a Pandas DataFrame from the extracted data.
- Step 5: Saves the DataFrame to CSV and Excel files.
- Step 6: Generates SQL insert statements and saves them to a
.sql
file.
- Database Connection: Connects to the PostgreSQL database using
psycopg2
. - Data Loading: Loads the movie data from the SQL script into the database.
- Retry Logic: Implements retry logic to handle transient connection issues.
The PostgreSQL database configuration and credentials are set in the docker-compose.yaml
file and passed as environment variables to the Docker containers. Ensure the following values are correctly set:
services:
movies_postgres:
environment:
POSTGRES_DB: movies_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: secret
- Automated Scheduling: Integrate a scheduler like
cron
orAirflow
to automate the ETL process at regular intervals. - Data Validation: Implement data validation and cleaning steps to ensure data quality.
- Enhanced Logging: Add more detailed logging to track the ETL process and handle errors more effectively.
Contributions are welcome! Please read the contributing guidelines for more details on how to get started.
This project is licensed under the MIT License. See the LICENSE file for details.
For any inquiries or support, please open an issue on our GitHub repository.