Popularity of Music Explained

Member 1: Katya Kiryutin, Contribution: 95% (Did not contribute to H)

Member 2: Areeb Malik, Contribution: 90% (Did not contribute to A,E)

Member 3: Arjun Sharma, Contribution: 95% (did not contribute to H)

We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial. Areeb Malik, Katya Kiryutin, Arjun Sharma.

Katya Kiryutin --> she delivered with the Chi-Squared test portion of the data analysis portion. She made the graphs for the ML analysis and she assisted with testing for ML algorithm.

Areeb Malik --> Wrote paragraphs describing the flow of the tutorial. Assissted with T-test analysis and also assisted with graphs.

Arjun Sharma --> Wrote majority of ML algorithm and and did the Pearson test. Assited with graphs and training ML data

Spring 2024 Data Science Project By: Areeb Malik, Arjun Sharma, Katya Kiryutin

1: Introduction

Music is a fascinating thing. It comes in many different forms, and many people have different tastes when it comes to music. Every day, songs gain attention across various platforms—whether it's trending on TikTok, dominating radio waves, or sparking conversations on social media. The abundance of music poses an interesting question. What drives a song's popularity? Given the vast diversity of genres and styles, how do certain tracks resonate so profoundly with audiences worldwide?

Our project seeks to reveal the mysteries behind the trends in popular music. We aim to identify the specific elements that contribute to a song's success. Is it the BPM, the danceability, the genre, or none of them? By analyzing these factors, we hope to discover patterns or perhaps even a 'formula' that could predict a song's popularity.

2: Managing our Data

The data we used is from Kaggle: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset We downloaded it as a CSV to utilize in our analysis.

After importing our libraries, let's load in our data set from our google drive. This makes it easier so that we don't have to always import our dataset whenever we open Google Colab.

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
In [17]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive

2.1: Cleaning our Data

First we need to clean out our data. We have duplicate track_id values that we need to get rid of. Although none of our rows or columns are empty, for good practice we will also drop any null columns such that we don't have interferences in our models later on

We see that we had saved our selves from looking through an extra 24,259 track_ids because we had dropped the duplicates values. If we had kept this in, we could have had skewed values pointing to one attribute because of these duplicates.

In [18]:
df_1 = pd.read_csv('/content/drive/MyDrive/CMSC320 Final Project/dataset.csv')
df = df_1.drop_duplicates(subset=['track_id'])
df = df.rename(columns={"Unnamed: 0": "ID"})
df = df.drop(columns=['ID'])
display(len(df_1))
display(len(df))
display("Number of Duplicates: " + str(len(df_1) - len(df)))
display(df)
114000
89741
'Number of Duplicates: 24259'
track_id artists album_name track_name popularity duration_ms explicit danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino Comedy Comedy 73 230666 False 0.676 0.4610 1 -6.746 0 0.1430 0.0322 0.000001 0.3580 0.7150 87.917 4 acoustic
1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward Ghost (Acoustic) Ghost - Acoustic 55 149610 False 0.420 0.1660 1 -17.235 1 0.0763 0.9240 0.000006 0.1010 0.2670 77.489 4 acoustic
2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN To Begin Again To Begin Again 57 210826 False 0.438 0.3590 0 -9.734 1 0.0557 0.2100 0.000000 0.1170 0.1200 76.332 4 acoustic
3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis Crazy Rich Asians (Original Motion Picture Sou... Can't Help Falling In Love 71 201933 False 0.266 0.0596 0 -18.515 1 0.0363 0.9050 0.000071 0.1320 0.1430 181.740 3 acoustic
4 5vjLSffimiIP26QG5WcN2K Chord Overstreet Hold On Hold On 82 198853 False 0.618 0.4430 2 -9.681 1 0.0526 0.4690 0.000000 0.0829 0.1670 119.949 4 acoustic
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113995 2C3TZjDRiAzdyViavDJ217 Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Sleep My Little Boy 21 384999 False 0.172 0.2350 5 -16.393 1 0.0422 0.6400 0.928000 0.0863 0.0339 125.995 5 world-music
113996 1hIz5L4IB9hN3WRYPOCGPw Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Water Into Light 22 385000 False 0.174 0.1170 0 -18.318 0 0.0401 0.9940 0.976000 0.1050 0.0350 85.239 4 world-music
113997 6x8ZfSoqDjuNa5SVP5QjvX Cesária Evora Best Of Miss Perfumado 22 271466 False 0.629 0.3290 0 -10.895 0 0.0420 0.8670 0.000000 0.0839 0.7430 132.378 4 world-music
113998 2e6sXL2bYv4bSz6VTdnfLs Michael W. Smith Change Your World Friends 41 283893 False 0.587 0.5060 7 -10.889 1 0.0297 0.3810 0.000000 0.2700 0.4130 135.960 4 world-music
113999 2hETkH7cOfqmz3LqZDHZf5 Cesária Evora Miss Perfumado Barbincor 22 241826 False 0.526 0.4870 1 -10.204 0 0.0725 0.6810 0.000000 0.0893 0.7080 79.198 4 world-music

89741 rows × 20 columns

Now before we delve deep into analyzing our data let's just take a look at some of the most popular genres! First we want to be able to see their index so we will apll the reset_index() function

3: Exploratory analysis: Finding popular and least popular Genres

In [19]:
df_mean = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
df2 = df_mean.reset_index()
df2.set_index('track_genre', inplace=True)
display(df2.head(30))
popularity
track_genre
k-pop 59.358779
pop-film 59.096933
metal 56.422414
chill 53.738683
latino 51.788945
sad 51.109929
grunge 50.587007
indian 49.765348
anime 48.776884
emo 48.500000
reggaeton 48.270270
sertanejo 47.860775
piano 46.608312
progressive-house 46.537748
hard-rock 45.744711
pagode 45.585799
deep-house 45.573045
mandopop 45.071019
british 44.768889
metalcore 44.708914
brazil 44.645678
electronic 44.234940
ambient 44.208208
singer-songwriter 43.592030
acoustic 42.483000
hip-hop 42.429929
pop 41.944712
punk 41.884956
forro 41.831663
world-music 41.536295

WOW! we see that k-pop is the most popular genre in this dataset with pop and metal right behind it. This is interesting as in an article by YouGov the 2 most popular genres right now in 2024 are pop and metal. https://business.yougov.com/content/48874-what-are-the-most-popular-music-genres-around-the-world With our dataset showing them close behind k-pop it's interesting to see but it is no doubt that k-pop could overtake pop and metal as it has been a rapidly rising genre over the past few years. Let's see this data in a graph!

In [20]:
plt.figure(figsize=(10, 6))
df2.head(30)['popularity'].plot(kind='bar', color='green')
plt.title('Top 30 Genres by Mean Popularity')
plt.xlabel('Genre')
plt.ylabel('Mean Popularity')
plt.show()

Let's see what the 10 least popular genres are!

In [21]:
display(df2.tail(10))
popularity
track_genre
idm 15.522222
kids 14.770791
grindcore 14.521827
classical 13.362168
chicago-house 12.333667
detroit-techno 11.130753
latin 9.855072
jazz 9.790076
romance 3.549779
iranian 2.224696

Very interesting to see that kids is ranked as the 9th least listened to genre. 5 of 10 most viewed youtube vidoes are kids songs with Baby Shark amassing over 14 Billion views! 6 billion more than the next video which is Despacito. https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos This could be because most kids do not use spotify and instead they are always on their parents phone watching Youtube Kids.

Now let's see the graph for the bottom 10!

In [22]:
plt.figure(figsize=(10, 6))
df2.tail(10)['popularity'].plot(kind='bar', color='green')
plt.title('Bottom 10 Genres by Mean Popularity')
plt.xlabel('Genre')
plt.ylabel('Mean Popularity')
plt.show()

Poor Iranian music :(

3.1: Testing our Data

Now we want to see different testing methods on out data via using different elements that define a song such as their genre, how danceable each song is, tempo etc...

For the three hypothesis tests below, we will use a significance level of $\alpha$ = 0.05

Pearson Test

H0: danceability does not have an effect on popularity

HA: danceability does have an effect on popularity

Because we want to find correlation between what makes a song popular, we will look at the danceibility of the song. We will take the correlation value and the p-value to determine this.

In [23]:
pop_data = df[df['track_genre'] == 'pop'][['danceability', 'popularity']].dropna()
hiphop_data = df[df['track_genre'] == 'hip-hop'][['danceability', 'popularity']].dropna()
print(pop_data)

pop_corr, pop_p = stats.pearsonr(pop_data['danceability'], pop_data['popularity'])

print(f'pop correlation: {pop_corr}, pop p value: {pop_p}')
       danceability  popularity
81000         0.514          91
81004         0.679          90
81006         0.724          74
81009         0.772          76
81012         0.410          90
...             ...         ...
81989         0.599          64
81990         0.668          64
81992         0.766          63
81993         0.741          64
81994         0.876          64

[416 rows x 2 columns]
pop correlation: 0.10307501506514821, pop p value: 0.03558825695481653

For pop music, the p-value of 0.0356 is lower than the significance level of 0.05 so we reject the null hypothesis and claim that danceability does affect popularity. This makes sense as one would expect pop to be catchy and easy to jam to. We also see that there is a correlation coefficent of 0.103, which shows a positive correlation but it is not a big coefficent so it is not a strong correlation, but there is a correlation.

In [24]:
plt.scatter(pop_data['danceability'], pop_data['popularity'])
plt.title('Effect of Danceability on Popularity for Pop Music')
plt.xlabel('Danceability')
plt.ylabel('Popularity')
Out[24]:
Text(0, 0.5, 'Popularity')

T-Test

H0: there is a difference between the effect of liveliness on popularity of pop and hip hop music

HA: there is not a difference between the effect of liveliness on popularity of pop and hip hop music

We will determine if liveliness affects popularity but utilize a T Test to look at the difference between the means of the two groups.

In [30]:
pop_data_clean = pop_data.dropna()
hiphop_data_clean = hiphop_data.dropna()

pop_data = df[df['track_genre'] == 'pop'][['liveness', 'popularity']].dropna()
hiphop_data = df[df['track_genre'] == 'hip-hop'][['liveness', 'popularity']].dropna()

print(stats.ttest_ind(pop_data, hiphop_data, equal_var=False))

plt.scatter(pop_data['liveness'], pop_data['popularity'], label="Pop")
plt.scatter(hiphop_data['liveness'], hiphop_data['popularity'], label="Hip-Hop")
plt.title('Effect of Liveness on Popularity for Pop and Hip-Hop Music')
plt.legend()
plt.xlabel('Liveness')
plt.ylabel('Popularity')
TtestResult(statistic=array([-3.9950668 , -0.23588592]), pvalue=array([6.89804197e-05, 8.13586897e-01]), df=array([1099.34611991,  735.13326543]))
Out[30]:
Text(0, 0.5, 'Popularity')

The p-values are lower than the significance level of 0.05 so we reject the null hypothesis and conclude that liveliness does affect popularity for both pop and hip hop music. Despite the graph having two clusters of popularity, both of them tend to have higher popularity at the lower end of the liveliness scale.

Chi-squared test

H0: The genre (category) of the song does not affect tempo

HA: The genre (category) of the song does affect tempo

In [26]:
pop = df[df['track_genre'] == 'pop']['tempo'].dropna()
rock = df[df['track_genre'] == 'rock']['tempo'].dropna()
jazz = df[df['track_genre'] == 'jazz']['tempo'].dropna()
allthree = df[(df['track_genre'] == 'pop') & (df['track_genre'] == 'rock') & (df['track_genre'] == 'jazz')]
allthree = df[df['track_genre'].isin(['pop', 'rock', 'jazz'])]

print(stats.f_oneway(pop, rock, jazz))

# plt.scatter(allthree['track_genre'], allthree['tempo'])
# plt.xlabel('Genres')
# plt.ylabel('Tempo')
plt.hist(df[df['track_genre'] == 'jazz']['tempo'], label="Jazz")
plt.hist(df[df['track_genre'] == 'pop']['tempo'], label="Pop")
plt.hist(df[df['track_genre'] == 'rock']['tempo'], label="Rock")
plt.legend()
plt.title('Distribution of Tempo for Jazz, Pop, and Rock')
plt.xlabel('Tempo')
plt.ylabel('Number of occurrences')
F_onewayResult(statistic=5.031362739035586, pvalue=0.006659749277153722)
Out[26]:
Text(0, 0.5, 'Number of occurrences')

The p-value is lower than the significance level of 0.05 so genre does affect tempo. We can see in the graph that jazz has many more songs at its peak tempo bar compared to pop and rock having a more even spread with most of their songs being at the lower end of their tempo ranges.

4. Primary analysis and visualization

A linear regression was used to determine if there was any relationships between the columns that could be explored.

In [27]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


df_pt = df[['track_genre', 'popularity', 'tempo']]
df_filtered = df_pt[df_pt['track_genre'] == 'k-pop']
df_pt_filtered = df_filtered[['popularity', 'tempo']]
# df_encoded = pd.get_dummies(df_p, columns=['track_genre'])
# df_encoded

X = df_pt_filtered.drop('popularity', axis=1)
y = df_pt_filtered['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

plt.scatter(X_train, y_train)
plt.title('Effect of Tempo on Popularity of K-pop')
plt.xlabel('tempo')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()

y_train_pred = LinearRegression().fit(X_train, y_train).predict(X_train)
y_test_pred = LinearRegression().fit(X_test, y_test).predict(X_test)

plt.scatter(X_train, y_train)
plt.plot(X_train, y_train_pred, color='green', label='Linear Regression')
plt.title('Effect of Tempo on Popularity of K-pop With Linear Regression')
plt.xlabel('tempo')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(LinearRegression(), X_train_scaled, y_train, cv=skf)

print(f"Cross-validation accuracy (mean): {cv_scores.mean()}")
print(f"Cross-validation accuracy (std): {cv_scores.std()}\n")

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"MSE Train: {mse_train}")
print(f"MSE Test: {mse_test}")
print(f"R2 Train: {r2_train}")
print(f"R2 Test: {r2_test}")


# using a linear regression, we can see that there is not much of a correlation
# between popularity and tempo, as shown with k-pop as an example
Cross-validation accuracy (mean): -0.0011768987242749551
Cross-validation accuracy (std): 0.003464192059949629

MSE Train: 133.5647178732994
MSE Test: 192.24413045740053
R2 Train: 0.0014804277957974898
R2 Test: 0.008167105607016056
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py:700: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
In [28]:
df_pd = df[['track_genre', 'popularity', 'danceability']]
df_filtered = df_pd[df_pd['track_genre'] == 'k-pop']
df_pd_filtered = df_filtered[['popularity', 'danceability']]

X = df_pt_filtered.drop('popularity', axis=1)
y = df_pt_filtered['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

plt.scatter(X_train, y_train)
plt.title('Effect of Danceability on Popularity for K-pop')
plt.xlabel('danceability')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()

y_train_pred = LinearRegression().fit(X_train, y_train).predict(X_train)
y_test_pred = LinearRegression().fit(X_test, y_test).predict(X_test)

plt.scatter(X_train, y_train)
plt.plot(X_train, y_train_pred, color='green', label='Linear Regression')
plt.title('Effect of Danceability on Popularity for K-pop With Linear Regression')
plt.xlabel('danceability')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(LinearRegression(), X_train_scaled, y_train, cv=skf)

print(f"Cross-validation accuracy (mean): {cv_scores.mean()}")
print(f"Cross-validation accuracy (std): {cv_scores.std()}\n")

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"MSE Train: {mse_train}")
print(f"MSE Test: {mse_test}")
print(f"R2 Train: {r2_train}")
print(f"R2 Test: {r2_test}")
Cross-validation accuracy (mean): -0.0011768987242749551
Cross-validation accuracy (std): 0.003464192059949629

MSE Train: 133.5647178732994
MSE Test: 192.24413045740053
R2 Train: 0.0014804277957974898
R2 Test: 0.008167105607016056
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py:700: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
In [29]:
df_pd = df[['popularity', 'danceability']]
df_pt = df[['popularity', 'tempo']]

X1 = df_pd.drop('popularity', axis=1)
y1 = df_pd['popularity']

X2 = df_pt.drop('popularity', axis=1)
y2 = df_pt['popularity']

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)

plt.scatter(X1_train, y1_train)
plt.title('Effect of Danceability on Popularity with all genres')
plt.xlabel('danceability')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()

plt.scatter(X2_train, y2_train)
plt.title('Effect of Tempo on Popularity with all genres')
plt.xlabel('tempo')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()

5. Insights and Conclusions

So far, these results are not very promising. After fitting the model and running a prediction, the cross validation mean accuracies, mean squared errors, and r2 values are all quite low. This means that there is not much of a correlation between popularity and tempo, as well as population and danceability for the most popular genre, k-pop. In general this is due to the wide variety of values found in the columns, as seen in the final plot. Our final conclusion is that there is not a reliable way to determine a song's popularity based on its tempo or danceability, as well as the many other factors we attempted making graphs for. The range of possibilities for songs is too broad to make accurate predictions for popularity, though we were able to analyze other factors such as comparing tempo across genres.