The soccer database contains data for over 25,000 football matches played from 2008 to 2016 by 299 European football clubs.
The dataset contains seven tables, namely:
- Country (id, name)
- League (id, country_id, name)
- Match (id, country_id, league_id, home_team_goal, away_team_goal, and 100 other columns)
- Player
- Team
- Player Attributes
- Team Attributes
- Pandas: For storing and manipulating structured data. Pandas functionality is built on NumPy (upgrade to version 0.25.1)
- Numpy: For multi-dimensional array, matrix data structures and, performing mathematical operations
- Matplotlib: For all visualizations (including maps and graphs)
- PandaSQL: For querying pandas DataFrame using SQL syntax
The main steps for this project are as follows:
- Data Wrangling:
- Data Gathering
- Data Assessment
- Data Cleaning
- Exploratory Analysis
- Conclusions/Results
Based on the data and analysis carried out, I found that:
-
The best teams in each of the top 5 leagues, from 2008 to 2015, were:
- Based on matches won: Man Utd (EPL), Real Madrid and Barcelona (La Liga), Bayern Munich (Bundesliga), Juventus (Serie A), PSG (Ligue 1)
- Based on goals scored: Chelsea, Real Madrid, Bayern, Juventus, and Paris Saint-Germain
-
The teams that improved their goalscoring the most in each of the top 5 leagues (across the period) were Tottenham Hotspur, Real Madrid CF, Borussia Monchengladbach, Napoli, and Paris Saint-Germain.
-
I found zero to no correlation between the team attributes and goals scored or matches won.
Certain teams didn't have goal data in certain seasons.