Skip to content

Latest commit

 

History

History
45 lines (40 loc) · 1.39 KB

README.md

File metadata and controls

45 lines (40 loc) · 1.39 KB

Webscraping Pipelines

This repo contains 2 data pipelines that gather and preprocess data from Glassdoor and StackOverFlow and save them to PostgreSQL

1. Instructions - Glassdoor

Install required dependencies

pip install -r requirements.txt

Add a .env file in glassdoorscraper directory with the following variables

# Specify Postgres Pipeline
MY_POSTGRES_HOST = "POSTGRES_HOST"
MY_POSTGRES_USER = "POSTGRES_USER"
MY_POSTGRES_PASSWORD_WIN = "POSTGRES_PASSWORD"

# Fake User Agent API https://scrapeops.io/docs/intro/
MY_SCRAPEOPS_API_KEY = "YOUR API KEYS"

Save Data to PostgreSQL

Next create a databases which is needed to saved to PostgreSQL (by default, it is saved as postgres_db_name="jobglassdoor") The init.py will save collected data to job_listings table by default. Thus, the following Query is run.

CREATE TABLE job_listings (
    job_id BIGINT PRIMARY KEY,
    job_title VARCHAR(255),
    job_description TEXT,
    company_id VARCHAR(50),
    company_name VARCHAR(255),
    job_url VARCHAR(255),
    job_location VARCHAR(100),
    company_size VARCHAR(100),
    founded_year VARCHAR(4),
    company_sector VARCHAR(100),
    country VARCHAR(100)
);

Save Data to CSV file

If you prefer to save it to CSV file. Uncommon line 113 and delete the old data_pipeline

Finally, run the script in init.py

python init.py