Spotify Tracks Dataset

📚 Final project for the first module (Foundations) of the Data Mining course, Università di Pisa.

Dataset

📊 The Spotify Tracks Dataset used in this study was provided by the lecturers and contains information about audio tracks available in the Spotify catalog. These tracks span 20 different genres, such as chicago-house, black-metal and breakbeat. Each track is described by essential details (track’s name, artist, album name, ...) and other features like its level of popularity within the Spotify catalog. The dataset also contains audio-derived features representing various aspects like danceability, energy, key, and loudness.

Tasks

Data Understanding and Preparation: exploratory data analysis with the analytical tools studied; data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
Clustering: explore the dataset using centroid-based methods, density-based clustering and hierarchical clustering.
Classification: choice of at least target variable and classification by decision trees, KNN and Naive Bayes models. The final discussion must contain the evaluation of the quantitative performance w.r.t. confusion matrix, accuracy, precision, recall, F1-score and ROC-curves.
Regression and Pattern Mining: univariate and multivariate regression techniques choosing 2 or more continuous variables and using different regressor studied; frequent pattern extraction and association rules extration with discussion of the results.

Language and Packages

Language: Python 3.10
IDE: Google Colab, cloud-based platform that provides free computational resources, such as CPUs and GPUs, along with tools for writing, running, and sharing Python code.

Below are the Python packages and modules used in the project, categorized by their primary use, with their modules listed:

General Purpose:

numpy
random
statistics
warnings

Data Manipulation:

pandas
scipy: stats, spatial.distance, spatial
sklearn: preprocessing, feature_selection, model_selection, neighbors, tree, linear_model, naive_bayes, metrics

Data Visualization:

matplotlib: pyplot, colors, cm, font_manager
seaborn
plotly: graph_objects, express, subplots
mpl_toolkits: mplot3d
scikitplot: metrics
dtreeviz
graphviz
treeinterpreter

Machine Learning:

sklearn: cluster, decomposition, metrics.pairwise, neighbors, tree, model_selection, linear_model, naive_bayes, metrics
kneed

File and Data Management:

pickle
joblib
google.colab
tqdm: notebook

Other:

fim: apriori, fpgrowth

Final report (PDF) -> ProjectReport

Results

Clustering

DBSCAN is unable to provide optimal clustering, despite having tested several choices of eps and minPts, because it results mainly in large clusters that include almost the entire dataset, then only noise points; even the hierarchical methods produce highly unbalanced clusters. K-Means, applied to a dataset with selected features, proved to be the only algorithm capable of separating some clusters in a balanced way with an acceptable silhouette value (0.51).

Classification

Target variable genre:

KNN: [accuracy: 0.48, roc auc: 0.89, precision/recall auc: 0.44]. Although we have improved the basic model and we are above the expected value of an accuracy of 1/20, for pure analytical purposes we can consider the model acceptable but not usable in a real-world context, given the high error rate: about half of the data are not classified correctly.
Naive Bayes: In this case the error increases to about 60%, which means that the two models Gaussian and Categorical (on different feature groups, continuous and categor- ical) still perform worse than KNN.
DecisionTree: The accuracy of the model does not exceed 0.46, even after appropriate parameter tuning. However, we can still study the behavior of the model and how it was able to capture relationships between variables based on the importance given in the training phase. Target variable popularity:
The optimal configuration appears to be {’splitter’: ’best’, ’min_samples_split’: 94, ’min_samples_leaf’: 58, ’max_depth’: 5, ’criterion’: ’entropy’} with an average accuracy of 0.87. However, there is a problem of class imbalance: in fact, tracks with low pop- ularity cover almost the entire dataset, mediums are on the order of hundreds, and highs are a few dozen.
We can conclude that while the results were promising at first, if we go to consider the weights of the various classes (due to imbalance), the model loses its ability to generalize by a large margin.

Regression

Simple: The best performing model is the one between duration_min and n_bars. This confirms what we expected, because the length of the song increases as the number of bars it contains increases. One of the worst performing models is the one between n_bars and tempo. In fact, the number of bars in a song does not significantly influence its tempo, at least not in a linear way.
Multiple: the three best combination target-model are n_bars/DecisionTree (with R2=0.92, MAE=9.15), tempo/DecisionTree and n_bars/Lasso.
Multivariate: best performance (R2=0.49, MAE=4.49) was achieved by the target [popularity, danceability, energy] with model KNN.

Pattern Mining

Our dataset predominantly consists of non-explicit tracks with high volume, low speechiness, a duration range that remains below 22 minutes, and a low number of bars. Tracks with medium tempo and low speechiness also have a significant presence. The presence of liveness feature among the most frequent patterns also suggests that dataset contains lots of tracks that have a low “live” feel. From association rule extraction ew can see the following information:

Consequents: in our case it’s mainly related to energy.
Confidence: measures how often the rule has been found to be true. For example, a confidence of 0.72 for the first rule means that in about 72% of the transactions containing [High_acousticness, Low_valence, Low_liveness, Non-Explicit, Low_speechiness, Low_n_bars] appear to have a low energy level.
Lift: ratio of the observed support to that expected if the antecedent and the consequent were independent. In our case, all the higher lifts are around 5, which means that the rules are quite significant.

Authors

Acknowledgements

Python - Anaconda (>3.7): Anaconda is the leading open data science platform powered by Python. Download page
Scikit-learn: python library with tools for data mining and data analysis Documentation page
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
Header image: AI generated using Bing and Adobe Firefly.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
images		images
ipynb_files		ipynb_files
tex_files		tex_files
Project_Argento_Lattanzi_Montinaro.pdf		Project_Argento_Lattanzi_Montinaro.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Tracks Dataset

Dataset

Tasks

Language and Packages

Results

Clustering

Classification

Regression

Pattern Mining

Authors

Acknowledgements

About

Languages

aldomontinaroam/DM1-UniPI-Project

Folders and files

Latest commit

History

Repository files navigation

Spotify Tracks Dataset

Dataset

Tasks

Language and Packages

Results

Clustering

Classification

Regression

Pattern Mining

Authors

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages