Member 1: Katya Kiryutin, Contribution: 95% (Did not contribute to H)
Member 2: Areeb Malik, Contribution: 90% (Did not contribute to A,E)
Member 3: Arjun Sharma, Contribution: 95% (did not contribute to H)
We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial. Areeb Malik, Katya Kiryutin, Arjun Sharma.
Katya Kiryutin --> she delivered with the Chi-Squared test portion of the data analysis portion. She made the graphs for the ML analysis and she assisted with testing for ML algorithm.
Areeb Malik --> Wrote paragraphs describing the flow of the tutorial. Assissted with T-test analysis and also assisted with graphs.
Arjun Sharma --> Wrote majority of ML algorithm and and did the Pearson test. Assited with graphs and training ML data
Music is a fascinating thing. It comes in many different forms, and many people have different tastes when it comes to music. Every day, songs gain attention across various platforms—whether it's trending on TikTok, dominating radio waves, or sparking conversations on social media. The abundance of music poses an interesting question. What drives a song's popularity? Given the vast diversity of genres and styles, how do certain tracks resonate so profoundly with audiences worldwide?
Our project seeks to reveal the mysteries behind the trends in popular music. We aim to identify the specific elements that contribute to a song's success. Is it the BPM, the danceability, the genre, or none of them? By analyzing these factors, we hope to discover patterns or perhaps even a 'formula' that could predict a song's popularity.
The data we used is from Kaggle: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset We downloaded it as a CSV to utilize in our analysis.
After importing our libraries, let's load in our data set from our google drive. This makes it easier so that we don't have to always import our dataset whenever we open Google Colab.
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive
First we need to clean out our data. We have duplicate track_id values that we need to get rid of. Although none of our rows or columns are empty, for good practice we will also drop any null columns such that we don't have interferences in our models later on
We see that we had saved our selves from looking through an extra 24,259 track_ids because we had dropped the duplicates values. If we had kept this in, we could have had skewed values pointing to one attribute because of these duplicates.
df_1 = pd.read_csv('/content/drive/MyDrive/CMSC320 Final Project/dataset.csv')
df = df_1.drop_duplicates(subset=['track_id'])
df = df.rename(columns={"Unnamed: 0": "ID"})
df = df.drop(columns=['ID'])
display(len(df_1))
display(len(df))
display("Number of Duplicates: " + str(len(df_1) - len(df)))
display(df)
114000
89741
'Number of Duplicates: 24259'
track_id | artists | album_name | track_name | popularity | duration_ms | explicit | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | track_genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5SuOikwiRyPMVoIQDJUgSV | Gen Hoshino | Comedy | Comedy | 73 | 230666 | False | 0.676 | 0.4610 | 1 | -6.746 | 0 | 0.1430 | 0.0322 | 0.000001 | 0.3580 | 0.7150 | 87.917 | 4 | acoustic |
1 | 4qPNDBW1i3p13qLCt0Ki3A | Ben Woodward | Ghost (Acoustic) | Ghost - Acoustic | 55 | 149610 | False | 0.420 | 0.1660 | 1 | -17.235 | 1 | 0.0763 | 0.9240 | 0.000006 | 0.1010 | 0.2670 | 77.489 | 4 | acoustic |
2 | 1iJBSr7s7jYXzM8EGcbK5b | Ingrid Michaelson;ZAYN | To Begin Again | To Begin Again | 57 | 210826 | False | 0.438 | 0.3590 | 0 | -9.734 | 1 | 0.0557 | 0.2100 | 0.000000 | 0.1170 | 0.1200 | 76.332 | 4 | acoustic |
3 | 6lfxq3CG4xtTiEg7opyCyx | Kina Grannis | Crazy Rich Asians (Original Motion Picture Sou... | Can't Help Falling In Love | 71 | 201933 | False | 0.266 | 0.0596 | 0 | -18.515 | 1 | 0.0363 | 0.9050 | 0.000071 | 0.1320 | 0.1430 | 181.740 | 3 | acoustic |
4 | 5vjLSffimiIP26QG5WcN2K | Chord Overstreet | Hold On | Hold On | 82 | 198853 | False | 0.618 | 0.4430 | 2 | -9.681 | 1 | 0.0526 | 0.4690 | 0.000000 | 0.0829 | 0.1670 | 119.949 | 4 | acoustic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113995 | 2C3TZjDRiAzdyViavDJ217 | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Sleep My Little Boy | 21 | 384999 | False | 0.172 | 0.2350 | 5 | -16.393 | 1 | 0.0422 | 0.6400 | 0.928000 | 0.0863 | 0.0339 | 125.995 | 5 | world-music |
113996 | 1hIz5L4IB9hN3WRYPOCGPw | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Water Into Light | 22 | 385000 | False | 0.174 | 0.1170 | 0 | -18.318 | 0 | 0.0401 | 0.9940 | 0.976000 | 0.1050 | 0.0350 | 85.239 | 4 | world-music |
113997 | 6x8ZfSoqDjuNa5SVP5QjvX | Cesária Evora | Best Of | Miss Perfumado | 22 | 271466 | False | 0.629 | 0.3290 | 0 | -10.895 | 0 | 0.0420 | 0.8670 | 0.000000 | 0.0839 | 0.7430 | 132.378 | 4 | world-music |
113998 | 2e6sXL2bYv4bSz6VTdnfLs | Michael W. Smith | Change Your World | Friends | 41 | 283893 | False | 0.587 | 0.5060 | 7 | -10.889 | 1 | 0.0297 | 0.3810 | 0.000000 | 0.2700 | 0.4130 | 135.960 | 4 | world-music |
113999 | 2hETkH7cOfqmz3LqZDHZf5 | Cesária Evora | Miss Perfumado | Barbincor | 22 | 241826 | False | 0.526 | 0.4870 | 1 | -10.204 | 0 | 0.0725 | 0.6810 | 0.000000 | 0.0893 | 0.7080 | 79.198 | 4 | world-music |
89741 rows × 20 columns
Now before we delve deep into analyzing our data let's just take a look at some of the most popular genres! First we want to be able to see their index so we will apll the reset_index() function
df_mean = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
df2 = df_mean.reset_index()
df2.set_index('track_genre', inplace=True)
display(df2.head(30))
popularity | |
---|---|
track_genre | |
k-pop | 59.358779 |
pop-film | 59.096933 |
metal | 56.422414 |
chill | 53.738683 |
latino | 51.788945 |
sad | 51.109929 |
grunge | 50.587007 |
indian | 49.765348 |
anime | 48.776884 |
emo | 48.500000 |
reggaeton | 48.270270 |
sertanejo | 47.860775 |
piano | 46.608312 |
progressive-house | 46.537748 |
hard-rock | 45.744711 |
pagode | 45.585799 |
deep-house | 45.573045 |
mandopop | 45.071019 |
british | 44.768889 |
metalcore | 44.708914 |
brazil | 44.645678 |
electronic | 44.234940 |
ambient | 44.208208 |
singer-songwriter | 43.592030 |
acoustic | 42.483000 |
hip-hop | 42.429929 |
pop | 41.944712 |
punk | 41.884956 |
forro | 41.831663 |
world-music | 41.536295 |
WOW! we see that k-pop is the most popular genre in this dataset with pop and metal right behind it. This is interesting as in an article by YouGov the 2 most popular genres right now in 2024 are pop and metal. https://business.yougov.com/content/48874-what-are-the-most-popular-music-genres-around-the-world With our dataset showing them close behind k-pop it's interesting to see but it is no doubt that k-pop could overtake pop and metal as it has been a rapidly rising genre over the past few years. Let's see this data in a graph!
plt.figure(figsize=(10, 6))
df2.head(30)['popularity'].plot(kind='bar', color='green')
plt.title('Top 30 Genres by Mean Popularity')
plt.xlabel('Genre')
plt.ylabel('Mean Popularity')
plt.show()
Let's see what the 10 least popular genres are!
display(df2.tail(10))
popularity | |
---|---|
track_genre | |
idm | 15.522222 |
kids | 14.770791 |
grindcore | 14.521827 |
classical | 13.362168 |
chicago-house | 12.333667 |
detroit-techno | 11.130753 |
latin | 9.855072 |
jazz | 9.790076 |
romance | 3.549779 |
iranian | 2.224696 |
Very interesting to see that kids is ranked as the 9th least listened to genre. 5 of 10 most viewed youtube vidoes are kids songs with Baby Shark amassing over 14 Billion views! 6 billion more than the next video which is Despacito. https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos This could be because most kids do not use spotify and instead they are always on their parents phone watching Youtube Kids.
Now let's see the graph for the bottom 10!
plt.figure(figsize=(10, 6))
df2.tail(10)['popularity'].plot(kind='bar', color='green')
plt.title('Bottom 10 Genres by Mean Popularity')
plt.xlabel('Genre')
plt.ylabel('Mean Popularity')
plt.show()
Poor Iranian music :(
Now we want to see different testing methods on out data via using different elements that define a song such as their genre, how danceable each song is, tempo etc...
For the three hypothesis tests below, we will use a significance level of $\alpha$ = 0.05
Pearson Test
H0: danceability does not have an effect on popularity
HA: danceability does have an effect on popularity
Because we want to find correlation between what makes a song popular, we will look at the danceibility of the song. We will take the correlation value and the p-value to determine this.
pop_data = df[df['track_genre'] == 'pop'][['danceability', 'popularity']].dropna()
hiphop_data = df[df['track_genre'] == 'hip-hop'][['danceability', 'popularity']].dropna()
print(pop_data)
pop_corr, pop_p = stats.pearsonr(pop_data['danceability'], pop_data['popularity'])
print(f'pop correlation: {pop_corr}, pop p value: {pop_p}')
danceability popularity 81000 0.514 91 81004 0.679 90 81006 0.724 74 81009 0.772 76 81012 0.410 90 ... ... ... 81989 0.599 64 81990 0.668 64 81992 0.766 63 81993 0.741 64 81994 0.876 64 [416 rows x 2 columns] pop correlation: 0.10307501506514821, pop p value: 0.03558825695481653
For pop music, the p-value of 0.0356 is lower than the significance level of 0.05 so we reject the null hypothesis and claim that danceability does affect popularity. This makes sense as one would expect pop to be catchy and easy to jam to. We also see that there is a correlation coefficent of 0.103, which shows a positive correlation but it is not a big coefficent so it is not a strong correlation, but there is a correlation.
plt.scatter(pop_data['danceability'], pop_data['popularity'])
plt.title('Effect of Danceability on Popularity for Pop Music')
plt.xlabel('Danceability')
plt.ylabel('Popularity')
Text(0, 0.5, 'Popularity')
T-Test
H0: there is a difference between the effect of liveliness on popularity of pop and hip hop music
HA: there is not a difference between the effect of liveliness on popularity of pop and hip hop music
We will determine if liveliness affects popularity but utilize a T Test to look at the difference between the means of the two groups.
pop_data_clean = pop_data.dropna()
hiphop_data_clean = hiphop_data.dropna()
pop_data = df[df['track_genre'] == 'pop'][['liveness', 'popularity']].dropna()
hiphop_data = df[df['track_genre'] == 'hip-hop'][['liveness', 'popularity']].dropna()
print(stats.ttest_ind(pop_data, hiphop_data, equal_var=False))
plt.scatter(pop_data['liveness'], pop_data['popularity'], label="Pop")
plt.scatter(hiphop_data['liveness'], hiphop_data['popularity'], label="Hip-Hop")
plt.title('Effect of Liveness on Popularity for Pop and Hip-Hop Music')
plt.legend()
plt.xlabel('Liveness')
plt.ylabel('Popularity')
TtestResult(statistic=array([-3.9950668 , -0.23588592]), pvalue=array([6.89804197e-05, 8.13586897e-01]), df=array([1099.34611991, 735.13326543]))
Text(0, 0.5, 'Popularity')
The p-values are lower than the significance level of 0.05 so we reject the null hypothesis and conclude that liveliness does affect popularity for both pop and hip hop music. Despite the graph having two clusters of popularity, both of them tend to have higher popularity at the lower end of the liveliness scale.
Chi-squared test
H0: The genre (category) of the song does not affect tempo
HA: The genre (category) of the song does affect tempo
pop = df[df['track_genre'] == 'pop']['tempo'].dropna()
rock = df[df['track_genre'] == 'rock']['tempo'].dropna()
jazz = df[df['track_genre'] == 'jazz']['tempo'].dropna()
allthree = df[(df['track_genre'] == 'pop') & (df['track_genre'] == 'rock') & (df['track_genre'] == 'jazz')]
allthree = df[df['track_genre'].isin(['pop', 'rock', 'jazz'])]
print(stats.f_oneway(pop, rock, jazz))
# plt.scatter(allthree['track_genre'], allthree['tempo'])
# plt.xlabel('Genres')
# plt.ylabel('Tempo')
plt.hist(df[df['track_genre'] == 'jazz']['tempo'], label="Jazz")
plt.hist(df[df['track_genre'] == 'pop']['tempo'], label="Pop")
plt.hist(df[df['track_genre'] == 'rock']['tempo'], label="Rock")
plt.legend()
plt.title('Distribution of Tempo for Jazz, Pop, and Rock')
plt.xlabel('Tempo')
plt.ylabel('Number of occurrences')
F_onewayResult(statistic=5.031362739035586, pvalue=0.006659749277153722)
Text(0, 0.5, 'Number of occurrences')
The p-value is lower than the significance level of 0.05 so genre does affect tempo. We can see in the graph that jazz has many more songs at its peak tempo bar compared to pop and rock having a more even spread with most of their songs being at the lower end of their tempo ranges.
A linear regression was used to determine if there was any relationships between the columns that could be explored.
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
df_pt = df[['track_genre', 'popularity', 'tempo']]
df_filtered = df_pt[df_pt['track_genre'] == 'k-pop']
df_pt_filtered = df_filtered[['popularity', 'tempo']]
# df_encoded = pd.get_dummies(df_p, columns=['track_genre'])
# df_encoded
X = df_pt_filtered.drop('popularity', axis=1)
y = df_pt_filtered['popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
plt.scatter(X_train, y_train)
plt.title('Effect of Tempo on Popularity of K-pop')
plt.xlabel('tempo')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()
y_train_pred = LinearRegression().fit(X_train, y_train).predict(X_train)
y_test_pred = LinearRegression().fit(X_test, y_test).predict(X_test)
plt.scatter(X_train, y_train)
plt.plot(X_train, y_train_pred, color='green', label='Linear Regression')
plt.title('Effect of Tempo on Popularity of K-pop With Linear Regression')
plt.xlabel('tempo')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(LinearRegression(), X_train_scaled, y_train, cv=skf)
print(f"Cross-validation accuracy (mean): {cv_scores.mean()}")
print(f"Cross-validation accuracy (std): {cv_scores.std()}\n")
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
print(f"MSE Train: {mse_train}")
print(f"MSE Test: {mse_test}")
print(f"R2 Train: {r2_train}")
print(f"R2 Test: {r2_test}")
# using a linear regression, we can see that there is not much of a correlation
# between popularity and tempo, as shown with k-pop as an example
Cross-validation accuracy (mean): -0.0011768987242749551 Cross-validation accuracy (std): 0.003464192059949629 MSE Train: 133.5647178732994 MSE Test: 192.24413045740053 R2 Train: 0.0014804277957974898 R2 Test: 0.008167105607016056
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py:700: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. warnings.warn(
df_pd = df[['track_genre', 'popularity', 'danceability']]
df_filtered = df_pd[df_pd['track_genre'] == 'k-pop']
df_pd_filtered = df_filtered[['popularity', 'danceability']]
X = df_pt_filtered.drop('popularity', axis=1)
y = df_pt_filtered['popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
plt.scatter(X_train, y_train)
plt.title('Effect of Danceability on Popularity for K-pop')
plt.xlabel('danceability')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()
y_train_pred = LinearRegression().fit(X_train, y_train).predict(X_train)
y_test_pred = LinearRegression().fit(X_test, y_test).predict(X_test)
plt.scatter(X_train, y_train)
plt.plot(X_train, y_train_pred, color='green', label='Linear Regression')
plt.title('Effect of Danceability on Popularity for K-pop With Linear Regression')
plt.xlabel('danceability')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(LinearRegression(), X_train_scaled, y_train, cv=skf)
print(f"Cross-validation accuracy (mean): {cv_scores.mean()}")
print(f"Cross-validation accuracy (std): {cv_scores.std()}\n")
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
print(f"MSE Train: {mse_train}")
print(f"MSE Test: {mse_test}")
print(f"R2 Train: {r2_train}")
print(f"R2 Test: {r2_test}")
Cross-validation accuracy (mean): -0.0011768987242749551 Cross-validation accuracy (std): 0.003464192059949629 MSE Train: 133.5647178732994 MSE Test: 192.24413045740053 R2 Train: 0.0014804277957974898 R2 Test: 0.008167105607016056
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py:700: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. warnings.warn(
df_pd = df[['popularity', 'danceability']]
df_pt = df[['popularity', 'tempo']]
X1 = df_pd.drop('popularity', axis=1)
y1 = df_pd['popularity']
X2 = df_pt.drop('popularity', axis=1)
y2 = df_pt['popularity']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)
plt.scatter(X1_train, y1_train)
plt.title('Effect of Danceability on Popularity with all genres')
plt.xlabel('danceability')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()
plt.scatter(X2_train, y2_train)
plt.title('Effect of Tempo on Popularity with all genres')
plt.xlabel('tempo')
plt.ylabel('popularity')
plt.tight_layout()
plt.show()
So far, these results are not very promising. After fitting the model and running a prediction, the cross validation mean accuracies, mean squared errors, and r2 values are all quite low. This means that there is not much of a correlation between popularity and tempo, as well as population and danceability for the most popular genre, k-pop. In general this is due to the wide variety of values found in the columns, as seen in the final plot. Our final conclusion is that there is not a reliable way to determine a song's popularity based on its tempo or danceability, as well as the many other factors we attempted making graphs for. The range of possibilities for songs is too broad to make accurate predictions for popularity, though we were able to analyze other factors such as comparing tempo across genres.