This project is an in-depth analysis designed to derive insights from LinkedIn job posts within the swedish market. It utilizes a robust architecture integrating various technologies and services to capture, store, process, and analyze jobs posted on LinkedIn on an hourly basis. Additionally, it incorporates population data at the city level to provide a comprehensive view of the swedish job market dynamics.
The architecture consists of two data pipelines. The first flow involves LinkedIn job data, which is ingested on an hourly basis into the bronze lakehouse. The second flow involves data from the Swedish Central Statistics Office (SCB), which is ingested into the bronze lakehouse on a monthly basis.
Both datasets are curated and transformed, then loaded into new tables in the silver lakehouse. Here, we use Azure OpenAI to enrich the job data by extracting important features from the job descriptions.
Finally, the data is aggregated in the default semantic model of the SQL endpoint of the silver lakehouse, and a Power BI report is built on top of it.
An overall picture of the architecture looks as follows:
The solution leverages a medallion architecture with two main lakehouses, bronze
and silver
, in the Fabric workspace. The workspace also consists of two main pipelines that orchestrate the data ingestion and transformations to and from the lakehouses:
_jobs_data_hourly
: Fetches LinkedIn jobs data in Sweden on an hourly basis, with a sliding window of one month. This means that data older than one month is automatically deleted. It also fetches the company's follower count to show the popularity of the company posting the jobs. After that, it transforms and cleans the data, then upserts it to the silver lakehouse based on job IDs. It finishes by running theAI_Inrich_Jobs_silver_Hourly
notebook, which feeds the job descriptions into Azure OpenAI to extract key information:
- Tools: A list of relevant tools mentioned in the job description.
- Requirements: A list of required skills, qualifications, or experience (excluding tools listed in Tools).
- Offer: A list of benefits provided by the employer.
- WorkType: This key can only be one of four values: Remote, Hybrid, In-Office, Null.
_population_data_monthly
: Fetches the population data using the Swedish Central Statistics Office SCB API using a copy activity that lands the data in thebronze
lakehouse in a csv format, then transform and overwrite the data in thesilver
lakehouse, since it is monthly data and we are only intersted in current month population stats.
The PowerBi report shows insights based on the hourly linkedin jobs ingested and with based on a swedish citis level. It consists of five different pages:
- Linkedin jobs key measures: Provides insights into job counts based on seniority level, popularity, titles, employment type, and company name. Additionally, it includes a heat map of job counts across cities in Sweden.
- Sweden population insights: Insights into the total population of Swedish cities, broken down by gender (men and women).
- Linkedin Job descriptions key measures: Insights derived from LinkedIn job descriptions, showcasing job counts based on job offers, key requirements, tools, and work type. It also highlights company popularity by the number of LinkedIn followers.
- Seniority level requirements: Analyzes the seniority level requirments.
- Seniority level Tools: Analyzes the seniority level required tools.
Microsoft Copilot played a pivotal role across various phases of the project. In the experimenting phase, Copilot assisted with notebooks, providing guidance on different Spark dialects, whether Spark SQL or PySpark. This support streamlined the data exploration and transformation processes. As the team transitioned to developing the Power BI report, Copilot helped create various measures, enhancing the report’s analytical depth. Finally, in the documentation phase, Copilot ensured comprehensive and clear documentation, capturing all critical aspects of the project.
To recreate the solution you should follow the steps below:
- Fork the Linkedin_jobs_datalake github repo.
- Create your own Fabric workspace.
- Follow Microsofts Learn documentation on how to connect a workspace to a Git repo and use
MS-Hackathon
as the git root folder andmain
as the branch as follows:
- Finally, connect and sync the artefacts into your new Fabric workspace.
- Add
ENDPOINT
andAPI_KEY
to the AI_Inrich_Jobs_Silver_Hourly notebook. - Done, you should now be able to use the report. 🎉
This project started as part of the Micrsoft Fabric Hackathon 2024. more Info can be found here.
- Mohammad Raja - Initial work and maintenance.
- Anas Mofleh - Initial work and maintenance.