Predicting the Future (of Music)Taylor FogartyBlockedUnblockFollowFollowingMay 24Taylor Fogarty, Peter Worcester, Lata GoudelIn the era of Big Data, we have the all-time advantage when it comes to making predictions.

With websites like Kaggle.

com and OurWorldinData.

org, we have endless data at our fingertips- data we can use to predict things like who will win the NCAA tournament in 2020, how many avocados will be sold globally next month, or the crime rate in Chicago 10 years from now.

Of course these are all still predictions, but these predictions can’t be made without appropriate data.

For instance, we spent a beautiful summer day testing to see if you can predict a song’s placement on Spotify’s Top 100 List based on a given set of metrics.

From Kaggle.

com, we obtained this list for 2018 which included 13 variables such as tempo, duration, acoustic-ness, and valence (positivity) for each song as well as identifiers (artist name, track id, and artist name).

For our purposes, we ignored the identifiers in order to focus on the numeric qualities of the songs since we already know that Post Malone releases the greatest hits of this generation.

Our main goal: find the most important factors in determining what makes one song more popular than another so we can predict how popular a new song may be.

To do this, we used three different metrics for evaluating different linear models.

For the sake of simplicity, we did not include higher order or interaction terms, so there very well may be more to this than the variables we’re given (spoiler alert: there definitely is).

Error AnalysisOnce a model has been developed, the next step is to see how well it performs.

To measure the success of the model, data scientists have developed error metrics to judge just how well the predicted and actual values match up.

The differences between our expected and predicted values is called the set of residuals, essentially a measure of how far the data is from the fitted regression line that we’re using to predict.

While we can inspect the residuals for each data point to measure the usefulness of the model this can be particularly tedious with large data sets.

There are many statistics that summarize the residuals, but we will stick to Mean Squared Error, Mean Absolute Error, Root Mean Square Error, and Mean Absolute Percentage Error.

To explain these, let’s take a look at our reduced model, found with a stepwise selection procedure*:Popularity= loudness + liveness + tempo + keyThe Mean Square Error and Root Mean Square Error are by far the most commonly used error metric.

MSE is the average of squared residuals and is often referred to as the variance of a model.

The RMSE is the square root of the MSE, or the standard deviation of the model, and is most useful for comparing variation between different models since the units of RMSE are the same as the target variable.

Both can be easily found using the scikit-learn library:The MSE and RMSE for this model are 2196.

52 and 46.

87 respectively.

This means that the data is very spread out when you remember that our target only ranges from 1 to 100.

However, if you look at the residual graph below, while the predicted popularities are much different than the actual placement on the list, there appears to be an outlier, or at least an unexpected value.

We should keep in mind that MSE and RMSE can be affected heavily by outliers since the residuals are squared and look at the other metrics.

We can also use the Mean Absolute Error which better accounts for outliers because it takes the absolute value of the residuals instead of squaring them.

This can also be reported by the Mean Absolute Percentage Error, which is the percentage equivalent.

MAE can also be found using scikit-learn very easily:Again, this value is high considering the data we have, so we can conclude that the model is struggling to accurately predict a song’s popularity given these parameters.

However, we still want to find a model to estimate a song’s popularity, so we will continue on to a few different prediction equations and some different evaluation metrics.

DeterminationWe can instead evaluate the model based on determination by looking at the R-squared and adjusted R-squared values.

The former, often called the coefficient of determination, is the percentage of variability that is explained by the model, or how close the data fits the regression line.

The goal is to have a R-squared of 1, meaning that 100% of the variation is explained by the model, but this rarely happens due to random error.

Most statistical analyses look at the R-squared value to evaluate how accurate the model is, therefore it is also easily found using scikit-learn:R-Squared for the complete modelUnfortunately, the R-squared for the complete model is not high.

The model only explains 16.

4% of the variation which means roughly 84% of the variation is coming from somewhere else, whether it’s from an 14th variable that we don’t have or from the popularity of songs being random.

To better understand how accurate this model is, we should also look at the adjusted R-squared which penalizes a model for including variables that don’t add to the explanation of variation.

Essentially, if we had a very reduced model of Popularity = tempo and decided to see if adding energy to that, Popularity = tempo + energy, and the addition of energy did not make the predictions any closer, the adjusted R-squared would decrease.

This is important to look at because the unadjusted R-squared will always increase with the addition of more parameters, but it can result in the overfitting of a model.

We can see this play out below where we start with a reduced model of all the parameters except loudness and duration.

By adding these variables back in, the unadjusted R-squared increases minimally, but the adjusted R-squared visibly decreases.

Adjusted R-squared for the complete modelAs you can see, the adjusted R-squared is much lower than the unadjusted, revealing that the model has included parameters that are not helping with predictions.

So, let’s look back at our prediction equation:Popularity = tempo + key + liveness + loudnessR-squared values for reduced modelWhile the unadjusted R-squared decreased to 13.

3%, the adjusted R-squared increased to 9.

6% which gives us confidence that this is a better model than the complete model.

Sadly, it’s still a poor model.

We can’t accurately make predictions with this model, but it’s the best we have…Information CriteriaNow let’s say we want to compare how much better this model is than the complete model.

For this, we would use two metrics: the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC).

These two evaluators are used less often that error statistics and the coefficients of determination, but they are very helpful when it comes to comparing models.

AIC and BIC both measure the quality of a model and accounts for both overfitting and underfitting models.

The main difference between the two is the BIC favors parsimonious models, penalizing for complexity or the addition of more parameters similarly to the adjusted R-squared.

For the complete model with a random training set taken from the data, the AIC and BIC are 94.

446 and 100.

373 respectively.

As you can guess, the BIC is slightly higher because of the penalizing factor for more parameters.

However, AIC and BIC are often reported in comparison to the best model.

In order to compare the models, we ran each model created through 10,000 iterations of these calculations in order to find the average AIC and BIC.

This was important since our sample size was so small that the values heavily depended on which observations were chosen for the training set.

From the graph, it’s clear to see that the addition of parameters increases both the AIC and BIC, but that the likely best model has 3 parameters (liveness, loudness, and tempo).

This model as an AIC of 78.

04, a BIC of 75.

97, a R-squared of 0.

105, an adjusted R-squared of 0.

077, a MAE of 43.

9, and a MSE of 1933.

7.

Overall, we can conclude that using these variables in the dataset, we cannot accurately predict a song’s popularity.

We can’t even really get close.

Even the best model is predicting the song at 100 to be up around 50.

Residual plot for final modelLooking at the data, there are quite obviously no trends in any of the variables.

Is this surprising?.No.

Music popularity is a vastly complex thing and variables like the artist’s reputation, lyrical meaning, genre, and the time of year a song is released can impact it’s popularity beyond the randomness of humans’ quick obsession with already culturally popular songs.

*The stepwise selection procedure was done with an entry level of 0.

1 and stay level of 0.

1.

These are higher alpha levels than most procedures, but due to the variables being weak predictors generally, a standard procedure produced a model with zero variables included which is not very helpful.

.