Authors: Anh Tran, Amelia Oo
This project was completed as part of the course DSC 80 at UC San Diego.
Music popularity is difficult to explain with one simple factor. A song may become popular because of the artist, the genre, playlist exposure, social trends, or the way the song sounds. In this project, we focus on the measurable side of music: the audio and metadata features available for Spotify tracks. Our goal is to explore whether these features can help us understand and predict which tracks are more likely to be popular.
We use the Spotify Music Tracks dataset, which contains 114,000 rows and 22 columns. Each row represents a single Spotify track, and the columns contain metadata and audio features associated with that track, such as the artist name, album name, release date, duration, explicit status, genre, and Spotify popularity score. The dataset also includes audio features such as danceability, energy, valence, acousticness, instrumentalness, loudness, speechiness, liveness, and tempo. These variables allow us to compare songs not only by genre or popularity, but also by musical characteristics.
Our main data science question is: Can we predict whether a Spotify track becomes popular using its audio features, genre and track metadata?
This question is relevant because music platforms, artists, and playlist curators often rely on data to understand listener behavior and identify patterns behind successful songs. While popularity is influenced by many outside factors that are not captured in this dataset, the available audio features still give us a way to study whether certain types of songs are more likely to perform well.
The dataset contains many variables; however, our analysis focuses on the following columns:
| Column | Description |
|---|---|
track_id |
Spotify’s unique identifier for the track. |
popularity |
Spotify popularity score ranging from 0 to 100. |
track_genre |
Genre of the track as labeled by Spotify. |
danceability |
Measures how suitable a track is for dancing based on tempo, rhythm stability, and beat strength (0–1). |
energy |
Measures the perceptual intensity and activity level of a track (0–1). |
loudness |
Measures the overall loudness of a track in decibels (dB). |
speechiness |
Measures the presence of spoken words within a track. |
acousticness |
Estimates the confidence that the track is acoustic rather than electronically produced (0–1). |
instrumentalness |
Estimates the probability that the track contains little to no vocals (0–1). |
liveness |
Estimates the probability that the track was recorded with a live audience (0–1). |
valence |
Measures the musical positivity of a track, where higher values indicate happier or more cheerful sounds (0–1). |
tempo |
Estimated tempo of the track in beats per minute (BPM). |
explicit |
Indicates whether the track contains explicit content (True = explicit, False = not explicit). |
artists |
Name(s) of the artist(s). |
duration_ms |
Duration of the track in milliseconds. |
release_date |
Release date of the track. |
To focus on our research question, we kept only the relevant columns described above and removed the remaining columns from the dataset.
- We start by examining the dataset for missing values. We found that the
tempocolumn contained over 22,000 missing values, which could affect analyses involving tempo. Since tempo is an important audio feature, we chose to preserve these missing values for later missingness analysis rather than immediately removing the affected observations. - Next, we filtered the
track_genrecolumn to include only six genres: Classical, Hip-Hop, Country, Electronic, Metal, and Pop. We selected these genres because they represent a diverse range of musical styles, differing in characteristics such as instrumentation, energy, rhythm, and production techniques. The original dataset contains many genre categories, which can make analysis and visualization difficult to interpret. By focusing on a smaller set of representative genres, we were able to perform clearer genre-based comparisons while keeping the scope of the project manageable. - Then, we converted the
release_datecolumn to a datetime format and extracted the release year into a new column,release_year, which allows us to analyze how a track's release period may relate to its popularity. - We also created a new column by converting
duration_msinto a more interpretable feature,duration_min, which represents track duration in minutes. - Finally, after examining the distribution of Spotify popularity scores, we created a binary target variable,
is_popular, where tracks with a popularity score of 55 or greater were labeled as popular and tracks below 55 were labeled as not popular. We chose a threshold of 55 because, within our filtered dataset, approximately 37.7% of tracks meet or exceed this popularity level. This threshold provides a reasonably balanced class distribution while maintaining a meaningful definition of popularity for our prediction task.
After filtering the selected genres and engineering additional features, the final dataset consisted of 6,000 tracks and 19 variables. This dataset was used throughout our exploratory analysis and predictive modeling process.
Below is the head of our cleaned DataFrame.
| track_id | popularity | track_genre | danceability | acousticness | release_year | tempo | explicit | is_popular |
|---|---|---|---|---|---|---|---|---|
| 7wrYBASu0OoxoDErd4Edxd | 58 | classical | 0.643 | 0.593 | 2001 | nan | False | True |
| 72HdutlIHBZJ7WT1xVAAZT | 59 | classical | 0.484 | 0.365 | 2005 | nan | False | True |
| 7JGgKHHDgJCJkQCQxyHHdl | 54 | classical | 0.608 | 0.581 | 1984 | 140.109 | False | False |
| 3YRj4jmwois2ctPnhwSwFo | 68 | classical | 0.695 | 0.596 | 1972 | nan | False | True |
| 3tp3ij9dtY3CacQgd1OvRf | 59 | classical | 0.583 | 0.581 | 1987 | 118.226 | False | True |
Note: For readability, only a subset of the columns most relevant to our analysis is shown.
First, we explored the distribution of Spotify popularity scores.
<iframe src="assets/popularity_distribution.html" width="900" height="400" frameborder="0"> </iframe> The histogram above shows a right-skewed distribution, with most tracks receiving relatively low to moderate popularity scores and fewer tracks achieving very high popularity. This suggests that highly popular songs are less common in the dataset, which may make popularity prediction more challenging since there are fewer examples of highly popular tracks for the model to learn from.We then examined the distribution of danceability scores.
<iframe src="assets/danceability_distribution.html" width="900" height="400" frameborder="0"> </iframe> We observed that most songs have moderate-to-high danceability scores, indicating that Spotify tracks in this dataset tend to have rhythmically engaging music.Lastly, we examined the distribution of explicit and non-explicit tracks.
<iframe src="assets/explicit_distribution.html" width="900" height="400" frameborder="0"> </iframe> This chart reveals a significant difference between explicit and non-explicit tracks in the dataset. It appears that Non-explicit songs make up the vast majority of tracks, with over 100,000 entries, while explicit songs account for only a much smaller portion of the dataset. This suggests that mainstream music in the dataset is still largely dominated by non-explicit content. This imbalance may also affect later analysis or predictive modeling, where patterns associated with non-explicit tracks could have a stronger influence on the model due to their much larger representation in the dataset.To better understand factors associated with popularity, we examined how popularity varies across different track characteristics.
First, we examined how popularity varies across selected music genres included in our analysis.
<iframe src="assets/genre_popularity.html" width="800" height="400" frameborder="0"> </iframe>The boxplot shows differences in popularity across the six selected genres: classical, country, electronic, hip-hop, metal, and pop. Among these genres, pop and hip-hop tracks generally exhibit higher popularity scores, while classical and country tracks tend to have lower popularity on average. Electronic and metal tracks fall between these extremes. The observed differences suggest that genre may be a meaningful predictor of track popularity and could improve the performance of our popularity classification model.
Next, we compared popularity distributions between explicit and non-explicit tracks.
<iframe src="assets/explicit_popularity.html" width="900" height="400" frameborder="0"> </iframe> Interestingly, although non-explicit tracks make up the majority of the dataset, explicit tracks appear to have a slightly higher median popularity. This may suggest that explicit content is relatively common among more popular or commercially successful tracks. However, both groups still show a wide spread in popularity, indicating that explicitness alone is not enough to determine whether a song becomes popular.We also look at how audio features differ between popular and non-popular songs.
| is_popular | danceability | energy | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo |
|---|---|---|---|---|---|---|---|---|---|
| False | 0.56 | 0.64 | -8.36 | 0.09 | 0.32 | 0.17 | 0.22 | 0.48 | 123.43 |
| True | 0.59 | 0.64 | -7.82 | 0.08 | 0.29 | 0.1 | 0.18 | 0.47 | 121.79 |
The table above compares the average audio features between popular and non-popular songs. Popular songs tend to have slightly higher average danceability scores and higher loudness values, suggesting that more rhythmically engaging and louder tracks may perform better on Spotify. Popular songs also show lower average acousticness and instrumentalness, indicating that mainstream songs in the dataset are generally less acoustic and more vocal-focused.
However, many of the differences between the two groups remain relatively small, especially for features such as energy, speechiness, and valence. This suggests that no single audio feature alone strongly determines popularity, and that song popularity is likely influenced by a combination of multiple audio characteristics and genre.
We believe the missing values in the tempo column are unlikely to be Not Missing At Random (NMAR). Instead, the missingness may be related to limitations in Spotify's audio analysis process or metadata collection rather than the tempo value itself.
If additional information were available about how Spotify computes audio features, whether a track failed audio processing, or why certain metadata could not be extracted, this information could help explain why tempo values are missing. Since the missingness is likely associated with factors other than the tempo itself, we believe the missingness mechanism is more consistent with Missing At Random (MAR) than NMAR.
We next examined the missingness of the tempo column by performing permutation tests to determine whether its missingness depends on other observed variables in the dataset. Specifically, we investigated whether the missingness of tempo depends on track_genre or release_year.
To conduct these tests, we created a Boolean indicator column, tempo_missing, which records whether the tempo value is missing for each track.
Null Hypothesis: The missingness of tempo does not depend on track_genre.
Alternative Hypothesis: The missingness of tempo depends on track_genre.
Test Statistic: Total Variation Distance (TVD) between the distribution of genres for tracks with missing tempo values and the distribution of genres for tracks with non-missing tempo values.
Significance level: 0.05
<iframe src="assets/permutation_test_1.html" width="900" height="400" frameborder="0"> </iframe> After performing the permutation test, we obtained a p-value of 0.0. Since the p-value is less than 0.05, we reject the null hypothesis. This suggests that the missingness of `tempo` does depend on `track_genre`.Null Hypothesis: The missingness of tempo does not depend on a track's release_year.
Alternative Hypothesis: The missingness of tempo depends on a track's release_year.
Test Statistic: Absolute Difference in the mean release_year between tracks with missing tempo values and tracks with non-missing tempo values.
Significance level: 0.05
<iframe src="assets/permutation_test_2.html" width="900" height="400" frameborder="0"> </iframe> After performing the permutation test, we obtained a p-value of 0.0. Since the p-value is less than 0.05, we reject the null hypothesis. This suggests that the missingness of `tempo` does depend on `release_year`.To further validate our findings, we also explored the relationship between tempo missingness and several other observed variables in the dataset. Across these additional permutation tests, the observed test statistics were consistently much larger than those generated under the null distributions. These results align with the two tests presented above and provide further support for our conclusion that the missingness of tempo is more consistent with a MAR mechanism.
To investigate whether genre is associated with song popularity, we conducted a permutation test comparing the genre distributions of popular and non-popular songs.
Null Hypothesis: The distribution of track_genre is the same for popular and non-popular songs.
Alternative Hypothesis: The distribution of track_genre differs between popular and non-popular songs.
Test Statistic: Total Variation Distance (TVD) between the genre distributions of popular and non-popular songs.
Significance level: 0.05
P-value: 0.0
<iframe src="assets/hypothesis_test.html" width="900" height="400" frameborder="0"> </iframe>We chose a permutation test because we wanted to determine whether the observed association between genre and popularity could have occurred by chance. Under the null hypothesis, popularity labels are independent of genre, so randomly permuting the popularity labels simulates what we would expect to see if no relationship existed. We used TVD as our test statistic because it measures the overall difference between two categorical distributions and is therefore well-suited for comparing genre proportions across groups.
The observed TVD (0.437) is substantially larger than the values generated under the null distribution, resulting in a p-value that is effectively 0. Since the p-value is well below our significance level of 0.05, we rejeuct the null hypothesis. This suggests that the distribution of genres differs between popular and non-popular songs and that genre may be an important factor associated with song popularity in our dataset.
The prediction problem is: Can we predict whether a Spotify track is popular using its audio features, genre, and track metadata?
This is a binary classification problem. The response variable is is_popular, created by thresholding the popularity score at 55, tracks scoring 55 or above are labeled popular (1), and the rest not popular (0). This threshold labels the top 37.7% of tracks as popular, giving a reasonable class balance.
At the time of prediction, all audio features, genre, and metadata (duration, explicit status, release year, number of artists) are known properties of the track itself, not outcomes that happen after release.
We chose F1 score as our evaluation metric instead of accuracy because the classes are not perfectly balanced, with approximately about 37.7% of tracks labeled as popular. A model could achieve high accuracy by simply predicting "not popular" for every track. F1 balances precision and recall, giving a more honest measure of how well the model identifies popular tracks.
For the baseline model, we used a RandomForestClassifier inside a single sklearn Pipeline. The goal of this baseline was to establish an initial performance benchmark using a small set of features before adding more complexity in the final model.
The baseline model used two features:
| Feature | Type | Encoding / Processing |
|---|---|---|
danceability |
Quantitative | Scaled using StandardScaler |
track_genre |
Nominal | Encoded using OneHotEncoder |
The model was trained on 80% of the data and evaluated on the remaining 20% test set.
| Metric | Score |
|---|---|
| Accuracy | 0.739 |
| Precision | 0.686 |
| Recall | 0.582 |
| F1-score | 0.630 |
The baseline model provides a reasonable starting point, but there is still room for improvement. Our primary metric, F1 score (0.630), reflects this; in particular, the model only identifies about 58% of popular tracks (recall), meaning it misses nearly half of the tracks it should flag as popular. This suggests that using only danceability and track_genre is not enough to fully capture the patterns behind track popularity.
For the final model, we plan to improve performance by adding more Spotify audio features, such as energy, valence, acousticness, loudness, tempo, and other track-level audio characteristics. We also plan to tune model hyperparameters to better capture more complex relationships between audio features, genre, and popularity.
We engineered three new features on top of the baseline:
loudness_energy_ratio(loudness / energy): captures the relationship between loudness and energy that neither feature alone conveys. A song can be loud but low energy, or high energy but quiet.tempo_filled: filled 22,114 missing tempo values (19.4% of rows) using the genre median rather than the overall median — because a missing tempo in "classical" is likely different from a missing tempo in "hip-hop".num_artists: counts the number of artists on a track by splitting the artists column by semicolons. D
Before selecting features, we evaluated each one using two methods:
1. Correlation — measures the linear relationship between each feature and popularity.
<iframe src="assets/feature_correlation.html" width="800" height="600" frameborder="0" ></iframe>Not all features relate to popularity in the same way. Features like release_year, loudness, and energy show positive correlations, songs that are louder, more energetic, or more recently released tend to score higher in popularity. On the other side, instrumentalness and acousticness are negatively correlated, songs that are heavily instrumental or acoustic (think classical or folk) tend to be less popular on Spotify's mainstream charts. It's worth noting that correlation only captures linear relationships, so some features may still be useful even if their correlation looks small.
2. Permutation Importance — measures how much the model's F1 score drops when a feature is shuffled.
<iframe src="assets/permutation_importance.html" width="800" height="600" frameborder="0" ></iframe>To get a fuller picture of which features actually matter to our model, we used permutation importance by shuffling each feature one at a time and measuring how much the F1 score drops. track_genre stands out as the single most important feature by a wide margin, which makes sense given how dramatically popularity varies across genres (pop at 64% vs classical at 4.9%). release_year and acousticness follow as the next most impactful. Even features like num_artists that showed weak linear correlation still contributed positively, a reminder that tree-based models can pick up on patterns that simple correlation misses.
For the final model, we used GridSearchCV with 5-fold cross validation to tune the RandomForestClassifier. We tuned:
• n_estimators — number of trees (too few → underfitting)
• max_depth — depth of each tree (too deep → overfitting)
The best parameters were max_depth=20 and n_estimators=200.
Our primary evaluation metric is F1 score — all model comparisons are based on F1. Additional metrics are included for reference.
| Metric | Baseline Model | Final Model |
|---|---|---|
| F1-score (primary) | 0.630 | 0.745 |
| Accuracy | 0.739 | 0.820 |
| Precision | 0.686 | 0.808 |
| Recall | 0.582 | 0.691 |
Our F1 score improved from 0.630 to 0.745, confirming that feature engineering and hyperparameter tuning meaningfully improved the model's ability to identify popular tracks. Every other metric also improved, further validating this conclusion.
<iframe src="assets/confusion_matrix.html" width="800" height="600" frameborder="0" ></iframe>The confusion matrix above shows how our final model performed on the 1,200 test tracks. Out of 457 truly popular tracks, the model correctly identified 317 (69.1% recall) while missing 140. Out of 743 truly non-popular tracks, the model correctly classified 667 (89.8%) while falsely labeling 76 as popular.
The model performs significantly better at identifying non-popular tracks than popular ones. This is expected given the class imbalance, with only 37.7% of tracks labeled popular, the model has seen fewer examples of what makes a track popular during training. Despite this challenge, an F1 score of 0.745 suggests the model has learned meaningful patterns from audio features, genre, and metadata to distinguish popular from non-popular tracks.
Our EDA revealed something striking: popularity varies dramatically across genres. Pop tracks are popular 64% of the time, while classical tracks are popular only 4.9% of the time. This raised an important question: if our model was trained on data where some genres are rarely popular, does it struggle to identify popular tracks within those genres? To investigate, we split tracks into two groups: lower-popularity genres (classical and country) and higher-popularity genres (electronic, hip-hop, metal, and pop), and tested whether our model performs equally well for both.
Groups:
Group X: Lower-popularity genres — classical, country
Group Y: Higher-popularity genres — electronic, hip-hop, metal, pop
Evaluation metric : F1 score
Null hypothesis: The model is fair. Its F1 score for lower-popularity and higher-popularity genres are roughly the same, and any differences are due to random chance.
Alternative hypothesis : The model is unfair. Its F1 score for lower-popularity genres is lower than for higher-popularity genres.
Test statistic : Difference in F1 scores (lower genres minus higher genres)
Significance level: 0.05
Results: The observed difference in F1 scores was -0.681. After running 1000 permutation trials, the p-value was 0.0. Since 0.0 < 0.05, we reject the null hypothesis.
<iframe src="assets/fairness_permutation.html" width="800" height="600" frameborder="0" ></iframe>The plot above shows that the observed difference of -0.681 falls far outside the distribution of simulated differences, confirming that this result is not due to random chance (p-value = 0.0). Our model performs significantly worse on classical and country tracks than on pop, hip-hop, metal, and electronic tracks. This is likely due to class imbalance: lower-popularity genres have very few popular tracks in the training data (classical at 4.9%, country at 15.7%), making it harder for the model to learn what makes them popular compared to genres like pop (64.4%).
Not all songs on Spotify are created equal, some carry an "E" badge indicating explicit content like strong language or adult themes. Since explicit tracks make up a smaller portion of our dataset, we wondered: does our model treat them fairly, or does it struggle to predict their popularity compared to clean tracks? To test this, we ran a permutation test comparing the F1 score of our model on explicit tracks vs non-explicit tracks.
Groups:
Group X: Explicit tracks (explicit=True)
Group Y: Non-explicit tracks (explicit=False)
Null hypothesis : The model is fair. Its F1 score for explicit and non-explicit tracks are roughly the same, and any differences are due to random chance.
Alternative hypothesis : The model is unfair. Its F1 score for explicit tracks is lower than for non-explicit tracks.
Test statistic : Difference in F1 scores (explicit minus non-explicit)
Significance level : 0.05
Results : The observed difference in F1 scores was -0.017. After running 1000 permutation trials, the p-value was 0.401. Since 0.401 > 0.05, we fail to reject the null hypothesis. The observed statistic falls well within the simulated distribution, suggesting the model performs roughly equally for explicit and non-explicit tracks, any difference is likely due to random chance.
<iframe src="assets/fairness_explicit.html" width="800" height="600" frameborder="0" ></iframe>The plot above shows the distribution of simulated F1 score differences across 1000 permutation trials. The dashed line represents our observed difference of -0.017, which falls well within the center of the simulated distribution. This visually confirms that the difference between explicit and non-explicit tracks is well within what we'd expect by random chance (p-value = 0.401); supporting our conclusion that the model is fair across these two groups.
In this project, we built a binary classifier to predict whether a Spotify track is popular using its audio features, genre, and metadata. Starting from a simple baseline model using only danceability and track_genre (F1: 0.630), we improved performance significantly through feature engineering, adding more audio features, and hyperparameter tuning, reaching a final F1 score of 0.745.
Our analysis revealed that track_genre is by far the strongest predictor of popularity - pop and hip-hop tracks are dramatically more likely to be popular than classical or country tracks. This finding also surfaced an important limitation: our model performs significantly worse on lower-popularity genres, which is a direct consequence of class imbalance in the training data.
We also attempted to address this genre-based unfairness by setting class_weight='balanced' in our RandomForestClassifier, which instructs the model to pay more attention to underrepresented classes during training. However, this reduced the overall F1 score from 0.745 to 0.727, highlighting a common tension in machine learning: improving fairness for minority groups often comes at the cost of overall model performance.
While our model shows promise, popularity on Spotify is influenced by many factors beyond audio features, such as artist fame, playlist placement, and social trends all play a role that our dataset cannot capture. Future work could explore more sophisticated fairness techniques such as SMOTE oversampling or genre-specific models, as well as incorporating artist-level features or streaming data to better model popularity dynamics.