What makes a movie hit a “jackpot”? Learning from data with Multiple Linear Regression

The answer is pretty simple: by squaring the residual values, we treat positive and negative discrepancies in the same way.

There are two important things that you have to learn by heart:Collinearity: two predictor variables are said to be collinear when they are correlated with each other.

Remember, predictors are also called independent variables, so they should be independent of each other.

Inclusion of collinear predictors complicates model estimation.

Remember: independence matters!Parsimony: Avoid adding predictors associated with each other because often times the addition of such variables brings nothing new to the table.

Simply remember the Occam Razor’s principle: among models that are equally good we want to select the one that has the fewer variables.

About the datasetFor this MLR, I will be using the dataset that includes information about the 651 randomly sampled movies produced and released before 2016, taken from Rotten Tomatoes and IMDB.

My dataset comes in .

RData file format — to open the data, we first need to convert it into the Pandas DataFrame.

To that end, the “pyreadr” library comes in handy.

Hint: While “importing stats models.

api as sm” you can be stuck with the “ImportError: cannot import name ‘factorial’ with pip installed packages”.

It has to do something with the latest version of stats models — to overcome this issue you need to downgrade your current version of scipy.

It can be done with typing : python3.

6 -m pip install scipy==1.

2 — upgrade in your Terminal or if you are using Jupiter notebook, put “conda” instead of pip.

Explanatory Data Analysis : Feature selectionObviously, we don’t have to consider every variable out of 32 for your model — it doesn’t make any sense to include them in statistical analysis as they were given for informational purposes.

The right question is: What variables should I consider in my model?First, let’s look at our dataset.

Codebook for Movies datasetAt the first glance, it seems that the data contains a number of dependent variables that are highly correlated.

If we see that variables are highly correlated with each other, we select one and drop others.

To find the dependence, we use Pearson’s correlation:From the heatmap above, we see that there are two “groups” of variables that are highly correlated with each other:[‘imdb_rating’], [‘critics_score’], [‘audience_score’][‘dvd_rel_year’], [’thtr_rel_year’]Remember, that MLR states that we do need to have independent variables, hence we need to stay with the one from each group and drop the others.

From my personal view, I will stay with [‘imdb_rating’].

What would we drop more?Note, that I have to drop all three TimeSeries variables ([‘thtr_rel_year’, ‘thtr_rel_month’, ‘thtr_rel_day’]), as it is a violation of the classical assumption that the errors are independent of each other.

In other words, year, months and days are highly collinear — that will result in a bad effect.

Remember, we need to take care that all of our variables are independent and not highly correlated!.Those are the continuous variables and for this matter we should treat them with more advanced statistical methods.

You can learn about this in minute detail in the ticket that I have created on StackExchange.

In light of this assumptions, I will come up with few variables and drop the other.

For my future model, I’ve chosen:Feature matrix:title_type —( as the id of the film);genre;runtime;best_pic_nom (as the outstanding source it can be bring new viewers);top200_box;director;actor1;Dependent variable: [‘imdb_rating’].

For this step, I create one DataFrame for both — feature matrix and dependent variable unless I deal with missing values.

The reason is that while dealing with missing values we have to delete all rows in EVERY column to keep both of our future DataFrames (for feature matrix and dependent variable) in the same shape.

So we’ve come to the point when we start working with the DataFrame and dropping those variables, that we won’t consider in our model.

There are two methods that can be used for feature selection from Pandas library:Data.

iloc[<row selection>, <column selection>] — works if you know the index of your variables (Note that .

iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.

)Data.

loc[<row selection>, <column selection>] — works with labeled data by the name of rows/ columnsHunting for missing valuesAs the regression function will simply raise MissingValueError if we ask to build regression model based on data that contains NaNs, we have to take care about them in advance.

From the given result we see, that (in sum) we have 5 missing values with the following distribution: one in [‘runtime’] , two in [‘director’] and [‘actor’], respectively.

As the algorithm suggest, we simply drop them with the following function shown below.

The point is: since we have more than enough observations, the missing values don’t influence the whole scope that much, therefore we can drop them.

Note, that it’s really important to specify ‘ inplace=True’ argument, as if not doing so, nothing will happen and you will end up with the same number of missing values as before.

Treating categorical variables as dummyNow it’s time to split our dataset for two: the feature matrix and dependent variable.

The feature matrix simply contains the variables that are used for modelling and dependent variable is the one that we try to predict.

There are two ways how we can perform the conversion from categorical variables to dummy.

One is using the pandas.

get_dummies() function — however, it adds more variability in data and perform purely with scikit-learn (Note, I am not saying that I doesn’t work, you just get the higher Mean Square Error in the end).

To that end, I use two other methods provided in scikit-learn library.

Step 1.

Label EncodingThe first step is to label all the levels of 1 categorical variable — LabelEncoder() encodes labels with value between 0 and n_classes-1.

This simply means : categorical variables with 10 levels (“Drama”, “Comedy” etc.

) will be labeled from 0 to 9.

Hint, you can use this method to iterate over every column once at a time!Here we go, we don’t have any words anymore, just numbers.

You might ask :” Why don’t we just stay with label encoded variables?”The answer is that for any statistical measurements those variables having a higher label (say 9 for Tragedy) will have a higher result — it would turn out to be a disaster for your regression analysis.

Step 2.

One-hot EncodingAs a corollary, the next step is to use one hot encoder to perform “binarization” of the category and include it as a feature to train the model.

With this code, you end up with a roster of columns with 0 and 1 values.

Step 3.

StandardisationFinally, we are done with categorical features.

Now there is a last step before we plug everything into a model.

To deal with one continuous variable ( [‘runtime’]).

To standardise it, we will use the StandardScaler() method.

In other words, standardisation tells how far each of our values is from the mean in terms of standard deviation.

Here is the important note: If you end up your last line of code with .

fit_transform(nothing in here).

You get an error — all of the values turn out to be zeroes now!.Intuitively, there is nothing wrong, but one problem behind — for every number of array you subtract from it mean of this number, which equal to number and divide by the standard deviation of this number.

To overcome this, you need to transform it from (1,n) to (n,1).

After all, you need to concatenate all the features together (just like in SQL):Finally, here comes your regression model:Hint: when working with pandas encoding, there is a rule of thumb to drop one variable — simply, because one of your variables becomes a reference level.

While following along, you might notice we haven’t done it, therefore we set our fit_intercept too False.

In case of pandas encoding, you should change it to “True”.

More details are here.

To test how our model performs, we check the predicted values in comparison with the observed ones:In our case MSE = 1.

20.

One possible interpretation would be that I estimate the imdb rating for a movie with ≈±1 error.

Mhh, interesting!.How might we interpret that?.In other words, as we know from the basis STAT-101 course, mean is too sensitive to outliers.

Meaning in our case, that the films that have less scores from the audience have more variability in data, resulting in purely predicted values.

To that end, we compute median square error.

As you might check, it’s less than 0.

4 in our case.

Meaning that our model performs pretty well and can be improved with more data coming into the model.

Factors that may have contributed to this inaccuracy in the modelMore data is needed: We need to include more variables just to get the best possible prediction (for example, we could include the time series variables, but — as we see know that the correlation is pretty low — it wouldn’t influence our model that much).

Bad assumptions: to check whether all of our variables are not collinear, we could perform MLR diagnostics (there is a highly recommended video that will guide your through) & drop those that bring nothing new to the modelLack of features: The features we used don’t have the highest correlation with the dependent variable.

There are two possible solutions: we could surf the net for a more comprehensive dataset or we might build a web scraper :)For those, who are more familiar with R coding, I’ve made the more detailed analysis using R studio.

You can check it out clicking the link.

The dataset can be found here.

.