Predicting Subscribers on Youtube using Machine LearningNikola CiganovićBlockedUnblockFollowFollowingFeb 28IntroductionThis project is inspired by second chapter of the book “Hands on Machine Learning with Scikit-Learn & Tenserflow”.

Full project is published on my github:https://github.

com/nciganovic/Youtube-subscribers-predictionPoint of this project is to try to predict number of subscribers that channel have and calculate error based on data that we have.

Loading and cleaning dataFor this project i used “Top 5000 Youtube channels data from Socialblade” on Kaggle website.

So, like in every Machine Learning project first thing you should do is load your data:import pandas as pdcsv_file = “data.

csv”dt = pd.

read_csv(csv_file)This is how my dataset looks like:You can see that my dataset is not perfect and that i have lot of work to do.

First those double dashes are strings and i don’t want that.

So i will simply replace them with null values.

import numpy as npdt = dt.

replace(‘ — ‘, np.

nan, regex=True)Done.

Now we can notice that in Rank column all values have “th” and “ , ” so that means they are also strings.

This code will fix that.

dt[“Rank”] = dt[“Rank”].

replace(‘th’, “”, regex=True)dt[“Rank”] = dt[“Rank”].

replace(‘rd’, “”, regex=True)dt[“Rank”] = dt[“Rank”].

replace(‘nd’, “”, regex=True)dt[“Rank”] = dt[“Rank”].

replace(‘st’, “”, regex=True)dt[“Rank”] = dt[“Rank”].

replace(‘,’, “”, regex=True)Next thing that we need to see is with what kind of value are we dealing with.

I can check that easily with dt.

info()Rank 5000 non-null objectGrade 5000 non-null objectChannel name 5000 non-null objectVideo Uploads 4994 non-null objectSubscribers 4613 non-null objectVideo views 5000 non-null int64Here, we can see that only Video views have value int64 and all other have object value.

If we want to use them in the future we need to convert some of them to floats.

dt[“Subscribers”] = pd.

to_numeric(dt[“Subscribers”])dt[“Video Uploads”] = pd.

to_numeric(dt[“Video Uploads”])dt[“Video views”] = pd.

to_numeric(dt[“Video views”])dt[“Rank”] = pd.

to_numeric(dt[“Rank”])That will do a job.

Now we can run dt.

info() again and we can see some changes.

Rank 5000 non-null int64Grade 5000 non-null objectChannel name 5000 non-null objectVideo Uploads 4994 non-null float64Subscribers 4613 non-null float64Video views 5000 non-null int64We can also make additional column Average views by combining Video Views and Video Uploads.

dt[“Average views”] = dt[“Video views”] / dt[“Video Uploads”]Previously we replaced some double dashes with null values, but now we don’t want those null values anymore because it will be a problem in a future.

from sklearn.

impute import SimpleImputersample_incomplete_rows = dt[dt.

isnull().

any(axis=1)].

head()imputer = SimpleImputer(strategy="median")dt_num = dt.

drop(["Channel name","Grade"], axis = 1)imputer.

fit(dt_num)X = imputer.

transform(dt_num)dt_full_1 = pd.

DataFrame(X, columns=dt_num.

columns)dt_full_1.

loc[sample_incomplete_rows.

index.

values]dt_full_1 = pd.

DataFrame(X, columns=dt_num.

columns)We computed median and stored all of those values into dt_full.

We don’t have null values anymore.

So this was a part of cleaning the data, next thing will be to prepare data.

Preparing data for trainingIts possible to check correlation between different columns.

corr_matrix = dt.

corr()print(corr_matrix)This is reslult:From this table we can see that Subscribers and Video Views are highly correlated, and other columns are not .

This means that if number of subscribers are big also number of views is big, and if rank numbers is bigger then views and subscribers are lower and that is correct if you think about it.

Now we can plot some graphs:import matplotlib.

pyplot as pltdt_full_1.

hist(bins=10, figsize=(10,8))plt.

show()And we will get this:If you are curious le8 is scale factor of y axis.

So if maximum value of y axis is 3,5 that means that real value of y axis is 3,5 * 100.

000.

000.

As you can see those graphs wont tell us much but i just wanted to show you a way to visualise it.

Another way of displaying graph is using scatter matrix.

Scatter matrix will show you relationships between columns but in visual way.

Only relationship that we are interested in is between Subscribers and Video views because as we saw earlier those two columns have biggest correlations.

dt.

plot(kind=”scatter”, x=”Subscribers”, y=”Video views”,alpha=0.

4)Correlation between Video view and SubscribersBy looking at this graph we can see that the as values go smaller and smaller they became more and more correlated.

Next thing on the list is categorical data.

You can scroll up and see that in Grade columns we don’t have regular values.

By typing:print(dt[“Grade”].

value_counts())we can see all the values that are inside Grade column.

B+ 2956A- 1024A 963A+ 41A++ 10 6With code bellow we are going to transform those grades into array using OneHotEncoder:grade_cat = dt[[“Grade”]]from sklearn.

preprocessing import OneHotEncodercat_encoder = OneHotEncoder()dt_cat_1hot = cat_encoder.

fit_transform(grade_cat)dt_one_hot_array = dt_cat_1hot.

toarray()print(dt_one_hot_array)This is called one hot because there is one hot (1) and other are cold (0):array([[1.

, 0.

, 0.

, 0.

, 0.

], [1.

, 0.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 0.

, 1.

], .

, [0.

, 1.

, 0.

, 0.

, 0.

], [1.

, 0.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 1.

, 0.

]])Next thing to do is to split data into two pieces.

Training set and testing set.

Name of two them tells you everything.

Usually you can use train test split method for that but in my case it caused some problems so i decided to do it simple way.

Since data is not set in some order i can take dt_full and first 4000 samples use for train and last 1000 for test:dt_full_train = dt_full[:4000]dt_full_test = dt_full[4000:]Also, i will need separated Subscribers column because i am going to predict error on Subscribers.

subs_train = dt_full_train[“Subscribers”].

copy()After this we need to create pipelines.

They are useful because we need to do several data transformations in the right order.

First we will create small pipeline for numerical attributes.

from sklearn.

pipeline import Pipelinefrom sklearn.

preprocessing import StandardScalernum_pipeline = Pipeline([ (‘imputer’, SimpleImputer(strategy=”median”)), (‘std_scaler’, StandardScaler()), ])dt_num_tr = num_pipeline.

fit_transform(dt_full)Then we will create full pipeline.

Full pipeline is going to be combination of num_pipeline and OneHotEncoder that we created above.

from sklearn.

compose import ColumnTransformernum_attribs = list(dt_full)cat_attribs = ["Grade"]full_pipeline = ColumnTransformer([ ("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs) ])dt_prepared = full_pipeline.

fit_transform(dt[:4000])This will be crucial when for training and testing data.

Training dataNow it’s time to train data.

First we can try with Linear Regression:from sklearn.

linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.

fit(dt_prepared, subs_train)some_data = dt.

iloc[:10]some_labels = subs_train.

iloc[:10]some_data_prepared = full_pipeline.

transform(some_data)print(“Predictions:”, lin_reg.

predict(some_data_prepared))print(“Labels:”, list(some_labels))After that we can see how good our model was.

Output look like this:Predictions: [18759513.

92519344 61311159.

39693056 19176720.

16060103 31182532.

85734618 32873532.

67815986]Labels: [18752951.

0, 61196302.

0, 19238251.

0, 31180559.

0, 32852346.

0]It’s working very well but don’t forget that this is done with training set so results are better then they should be.

Another thing we can calculate is mean squared error.

That is just error between predicted and actual value.

from sklearn.

metrics import mean_squared_errordt_predictions = lin_reg.

predict(dt_prepared)lin_mse = mean_squared_error(subs_train, dt_predictions)lin_rmse = np.

sqrt(lin_mse)print(“Mean squared error: “,lin_rmse)Mean squared error: 62825.

03338323658There is also mean absolute error:from sklearn.

metrics import mean_absolute_errorlin_mae = mean_absolute_error(subs_train, dt_predictions)print(“Mean absolute error: “,lin_mae)Difference between those two is that mean absolute error corresponds with l1 norm and mean squared error with l2 norm.

I wont get in details of that here.

Decision tree is another way to train your model.

Here is how to implement it:from sklearn.

tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor(random_state=42)tree_reg.

fit(dt_prepared, subs_train)dt_predictions = tree_reg.

predict(dt_prepared)tree_mse = mean_squared_error(subs_train, dt_predictions)tree_rmse = np.

sqrt(tree_mse)print(tree_rmse)Output: 0Strangely, result is zero.

Why is that a result?.Well, it’s impossible for model to be perfect so this means that model is massively overfitting.

That means that our hyperparameters are not set properly.

Fine tune modelNow it’s time to fine tune our model.

We are going to use cross validation to create many small training sets, train them, and evaluate them against the validation set.

from sklearn.

model_selection import cross_val_scorescores = cross_val_score(tree_reg, dt_prepared, subs_train, scoring=”neg_mean_squared_error”, cv=10)tree_rmse_scores = np.

sqrt(-scores)def display_scores(scores): print(“Scores:”, scores) print(“Mean:”, scores.

mean()) print(“Standard deviation:”, scores.

std())print(“.Decision tree cross validation:”, display_scores(tree_rmse_scores))and the results are:Scores: [2799980.

28056109 35764.

50879793 19583.

4858712 1329691.

50539461 14146.

57662007 40782.

54636131 55099.

69240801 5779.

98882979 7856.

0084394 4963.

85483873]Mean: 431364.

84481221513Standard deviation: 880562.

0065318999Then we can do same thing with Linear Regression so we can compare it later:lin_scores = cross_val_score(lin_reg, dt_prepared, subs_train, scoring=”neg_mean_squared_error”, cv=10)lin_rmse_scores = np.

sqrt(-lin_scores)print(“.Linear regression cross validation:”, display_scores(lin_rmse_scores))Results are:Scores: [72289.

15466674 53424.

79729356 69807.

56326175 69245.

51446183 56179.

0606429 71684.

14432671 63151.

90128564 63610.

71140885 55154.

54290604 62902.

22921384]Mean: 63744.

96194678508Standard deviation: 6645.

817945834862By doing observation on both results we can see that our model is much better with Linear Regression then with Decision tree.

We still have problem with overfitting Decision tree.

Our last model will be Random forest.

Code for implementation is here and it’s very similar with other two:from sklearn.

ensemble import RandomForestRegressorforest_reg = RandomForestRegressor(n_estimators=10, random_state=42)forest_reg.

fit(dt_prepared, subs_train)dt_predictions = forest_reg.

predict(dt_prepared)forest_mse = mean_squared_error(subs_train, dt_predictions)forest_rmse = np.

sqrt(forest_mse)print(“.Random forest: “, forest_rmse)Result: 136004.

41073016106Much better than Decision Tree but still worse then Linear Regression.

Adjusting parameters for Random ForestLast thing before we do a final test is we need to tweak hyperparameter values so we can get better results.

All the things that we did before was on training set, so now we have to adjust our model for final part.

Good way for that is using Grid Search CV.

He will combine different values of hyperparameters and tell us witch combination works best.

That is much faster then manually changing values.

from sklearn.

model_selection import GridSearchCVparam_grid = [ # try 12 (3×4) combinations of hyperparameters {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, # then try 6 (2×3) combinations with bootstrap set as False {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, ]forest_reg = RandomForestRegressor(random_state=42)# train across 5 folds, that's a total of (12+6)*5=90 rounds of training grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)grid_search.

fit(dt_prepared, subs_train)print("Best parametars", grid_search.

best_params_)cvres = grid_search.

cv_results_for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]): print(np.

sqrt(-mean_score), params) table = pd.

DataFrame(grid_search.

cv_results_)Here, we set 3, 10 ,30 estimators, and they will be combined with max features parameter.

We will display all results so we can see what works and what doesn’t.

Best parametars {‘max_features’: 8, ‘n_estimators’: 10}1978261.

4858917266 {‘max_features’: 2, ‘n_estimators’: 3}1685714.

4678400287 {‘max_features’: 2, ‘n_estimators’: 10}1658009.

8383636787 {‘max_features’: 2, ‘n_estimators’: 30}1484044.

575928663 {‘max_features’: 4, ‘n_estimators’: 3}1112587.

6655714784 {‘max_features’: 4, ‘n_estimators’: 10}983287.

2964708938 {‘max_features’: 4, ‘n_estimators’: 30}1121878.

9198595826 {‘max_features’: 6, ‘n_estimators’: 3}1122086.

2536127297 {‘max_features’: 6, ‘n_estimators’: 10}804911.

1396580335 {‘max_features’: 6, ‘n_estimators’: 30}993486.

7045322978 {‘max_features’: 8, ‘n_estimators’: 3}725171.

6910869441 {‘max_features’: 8, ‘n_estimators’: 10}738994.

2465072221 {‘max_features’: 8, ‘n_estimators’: 30}1977632.

090672395 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 3}1641877.

9680815167 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 10}1692146.

6715286577 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 3}1126978.

013066327 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 10}2739415.

8541200855 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 3}1636701.

727875341 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 10}From this we can see that our model works best when we have max_features = 8 and 10 estimators so our error is lowest at that point.

There is another way to do that using Random Search CV.

As the name says parameters will be random.

Here is how to do that:from sklearn.

model_selection import RandomizedSearchCVfrom scipy.

stats import randintparam_distribs = { ‘n_estimators’: randint(low=1, high=200), ‘max_features’: randint(low=1, high=8), }forest_reg = RandomForestRegressor(random_state=42)rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs, n_iter=10, cv=5, scoring=’neg_mean_squared_error’, random_state=42)rnd_search.

fit(dt_prepared, subs_train)cvres = rnd_search.

cv_results_for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]): print(np.

sqrt(-mean_score), params) feature_importances = grid_search.

best_estimator_.

feature_importances_Now our estimators can be between 1 and 200 and max features can be between 1 and 8.

Lets see result:758552.

498896288 {‘max_features’: 7, ‘n_estimators’: 180}887449.

6640592187 {‘max_features’: 5, ‘n_estimators’: 15}1254404.

498486916 {‘max_features’: 3, ‘n_estimators’: 72}854745.

4251753032 {‘max_features’: 5, ‘n_estimators’: 21}766778.

869808175 {‘max_features’: 7, ‘n_estimators’: 122}1243686.

9169564403 {‘max_features’: 3, ‘n_estimators’: 75}1217585.

4785197838 {‘max_features’: 3, ‘n_estimators’: 88}921306.

1734439975 {‘max_features’: 5, ‘n_estimators’: 100}1235597.

8754016897 {‘max_features’: 3, ‘n_estimators’: 150}2005778.

9717830971 {‘max_features’: 5, ‘n_estimators’: 2}First one is the best from this section but it’s not better then one from previous test were we had 30 estimators and 8 max features.

Final testIt’s time to test it on test set.

The set that we haven’t used before.

Now we can see how our model would work with new data.

final_model = grid_search.

best_estimator_X_test = dt_full_test.

drop(“Subscribers”, axis=1)y_test = dt_full_test[“Subscribers”].

copy()X_test_prepared = full_pipeline.

transform(X_test)final_predictions = final_model.

predict(X_test_prepared)final_mse = mean_squared_error(y_test, final_predictions)final_rmse = np.

sqrt(final_mse)print(final_rmse)Result:1519866.

0146370314Finally we did it.

To be fair this result is not very good, part of that is small database that i had and other and much important is lack of my experience since this is my first project.

I hope that you learned something new by reading this article, like i learned by doing it.

.