Finding the right model parametersKaran BhanotBlockedUnblockFollowFollowingMar 24If you’ve been reading about Data Science and/or Machine Learning, you must have come across articles and projects that work with MNIST dataset.
The dataset includes a set of 70,000 images where each image is a handwritten digit from 0 to 9.
I also decided to use the same dataset to understand how fine tuning the Machine Learning model parameters can create a difference.
This article explains how I used GridSearchCV to find the best fit parameters for this dataset and used them to increase the accuracy and improve the confusion matrix.
You can find the code in the GitHub repository below:kb22/Digit-Recognition-with-Parameter-TuningThe project includes using GridSearchCV to identify the best combination of estimator parameters.
comImport libraries and datasetI begin by importing the necessary libraries.
I used the training and testing data as .
csv from here.
Each row in the dataset consists of a label and 784 pixel values to represent the 28×28 image.
The training data consists of 60,000 images while the testing dataset includes 10,000 images.
Once I have the data, I get the features and labels from it and store it in train_X, train_y, test_X and test_y.
Exploring the datasetAnalysing class distributionAs I have discussed in my previous articles, the data for each class should be approximately the same size to ensure proper model training without bias.
If we look at the plot, there is some variance in the count for each digit.
However, the difference is not too much and the model will still be able to train on the data well.
Thus, we can proceed further.
Viewing the training imagesLet’s also see how the images look in real.
I randomly select 10 images from the training data and display it using plt.
Something we immediately see in the 10 random images is the difference between digits of any one type too.
Take a look at all the 4 in the above 10 images.
The first one is bold and straight, the second one is bold and diagonal while the third is thin and diagonal.
It’ll be really amazing if the model could learn from the data and actually detect all the different styles for 4.
Applying machine learningI decided to use the Random Forest Classifier to train on the training data and predict on the test data.
I used the default values for all parameters.
Next, using the prediction, I calculated the accuracy and confusion matrix.
The model achieved an accuracy of 94.
The confusion matrix shows that the model was able to predict a lot of images correctly.
Next, I decided to tweak the model parameters to try and improve the result.
Parameter TuningTo identify the best combination of parameter values for the model, I used GridSearchCV.
It’s a method provided by the sklearn library which allows us to define a set of possible values we wish to try for the given model and it trains on the data and identifies the best estimator from a combination of parameter values.
In this particular case, I decided to select a range of values for a few parameters.
The numbers of estimators could be 100 or 200, maximum depth could be 10, 50 or 100, minimum samples split at 2 or 4 and maximum features can be based on sqrt or log2.
The GridSearchCV expects the estimator which in our case is the random_forest_classifier.
We pass the possible parameter values as param_grid, and keep the cross-validation set to 5.
Setting verbose as 5 outputs a log to the console and njobs as -1 makes the model use all cores on the machine.
Then, I fit this grid and use it to find the best estimator.
Finally, I use this best model to predict the test data.
Taking a look at the accuracy above, we see that the accuracy improved to 97.
08% from 94.
42% just by changing the parameters of the model.
The confusion matrix also shows that more images were classified correctly.
Machine learning is not just reading the data and applying multiple algorithms till we get a good model to work with but it also involves fine tuning the models to make them work best for the data at hand.
Identifying the right parameters is one of the essential steps in deciding which algorithm to use and making the most of it based on the data.
ConclusionIn this article, I discussed a project where I improved the accuracy of a Random Forest Classifier by just selecting the best combination of parameter values using GridSearchCV.
I used the MNIST dataset and improved the accuracy from 94.
42% to 97.
Read more articles:Let’s build an Article Recommender using LDARecommend articles based on a search querytowardsdatascience.
comWorking with APIs using Flask, Flask RESTPlus and Swagger UIAn introduction to Flask and Flask-RESTPlustowardsdatascience.
comPredicting presence of Heart Diseases using Machine LearningApplication of Machine Learning in Healthcaretowardsdatascience.
comMatplotlib — Making data visualization interestingUsing Matplotlib to create beautiful visualizations of Population Density across the world.
comPlease feel free to share your ideas and thoughts.
You can also reach out to me on LinkedIn.