Data Science is Disrupting the Way We Look at Data

This is when machine learning comes into play.

A machine learning classifier can predict whether someone will purchase your car by comparing information on them with the information of hundreds or thousands of other people who may or may not have bought it.

This might sound complicated now, but I will break down how this is done if you want to understand it further or if you want to make your own classifier!How to make a Machine Learning Classifier (Python)GitHub Repository:https://github.

com/Vedant-Gupta523/randomforestclassificationImporting libraries and the data setTo start solving this problem, we begin by importing the information we collected from our consumers:# Importing the librariesimport numpy as npimport matplotlib.

pyplot as pltimport pandas as pd# Importing the datasetdataset = pd.

read_csv('Customer_Information.

csv')X = dataset.

iloc[:, [2, 3]].

valuesy = dataset.

iloc[:, 4].

valuesWe imported 3 libraries which will help us with our ML model.

We also imported our data set, “Customer_Information.

csv”.

This set contains the information of 400 people, their gender, age, annual salary, and whether or not they bought the car.

X and y represent the independent and dependent variables respectively.

The dependent variable (y) is what we are trying to figure out and is influenced by the independent variables (X).

In our case, the dependent variable is if the customer purchased the car or not.

The independent variables are the age and estimated salary.

“dataset.

iloc[].

values” is selecting the rows and columns we want from our dataset.

For the dependent variable we selected everything in column 5 (ID: 4).

For the independent variable we took everything from columns 3 and 4 (IDs 2 and 3).

Splitting the data set into the training and test setsIn school, teachers give us homework from which we learn various concepts.

After some time, we are given a test to see if we can apply what we learned from the homework to solve similar, but different problems.

When we are training our machine learning classifier to figure out who will buy our car we follow a similar process.

We divide our data set into two different sets, a training set and a test set.

The model uses the training set to find correlations between the dependent variable and independent variables.

Then, we give it the test set (without the dependent variable) and it uses what it learned to make predictions on the dependent variable.

Afterwards, we can compare the results to see how accurate our model is.

# Splitting the dataset into the Training set and Test setfrom sklearn.

cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.

25, random_state = 0)In the snippet above, we use the train_test_split module from the sklearn.

cross_validation library to divide our data set.

Our data set is stored into 4 variables: X_train (training independent variables), y_train (training dependent variables), X_test (test independent variables), and y_test (actual answers to the test independent variables).

25% of our complete data set was put into our test set (test_size = 0.

25).

Feature ScalingFeature scaling is a step that isn’t always needed when creating your machine learning model, but is in this case.

If we put our numerical variables (i.

e age, salary, etc.

) in our classifier’s algorithm as they are, our results will become skewed.

Even though age and salary represent two completely different things, the algorithm will plainly look at them as numbers.

When it inputs the values 22 (age) and 50,000 (salary) into the formula it won’t take into account the different weights.

Think of it as comparing millimeters to kilometers without converting the units.

The purpose of feature scaling is to take every numerical value and put it on the same scale.

This way the algorithm can use the values fairly.

# Feature Scalingfrom sklearn.

preprocessing import StandardScalersc = StandardScaler()X_train = sc.

fit_transform(X_train)X_test = sc.

transform(X_test)In order to do this, we use the StandardScaler module from the sklearn.

preprocessing library to scale all of our independent variables.

Fitting the classifier to our Training set and making predictionsIt’s finally time to start training our algorithm and predicting the results of our training set!# Fitting Random Forest to the Training setfrom sklearn.

ensemble import RandomForestClassifierclassifier = RandomForestClassifier(n_estimators = 10, criterion = "entropy", random_state = 0)classifier.

fit(X_train, y_train)# Predicting the Test set resultsy_pred = classifier.

predict(X_test)Although there are many different classifiers available, for this problem I chose the Random Forest classifier.

I started by importing the RandomForestClassifier module from the sklearn library.

I proceed to fit (train) it to our training set.

In the second section of the snippet I created a new variable, y_pred.

y_pred is the predictions made by the classifier for the test set (X_test).

Evaluating our ResultsIt is important to know that no machine learning model is 100% accurate.

If your model shows to have perfect accuracy, it is likely due to over-fitting.

Over-fitting means that your model strictly follows the EXACT rules it found in the training set.

For example, if you are trying to predict whether or not a 40 year-old man with a $35,000 salary purchased a car and the classifier hasn’t trained on this exact data point, it will likely default to saying they didn’t purchase it even though this might not be accurate.

A quick way to check how many predictions were right/wrong is to use a confusion matrix:Confusion Matrix outline# Making the Confusion Matrixfrom sklearn.

metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)Confused?.Don’t be!.The results of the confusion matrix is divided into four sections as shown by the first image.

The number in the top left will represent how many times we predicted someone would buy a car and they did.

The number in the top right will represent how many times we predicted someone would buy a car, but didn’t.

It is the other way around for the bottom 2 numbers.

The important thing to note is that the sum of the top left and bottom right represent how many we got right!Let’s look at this possible confusion matrix for our problem:Confusion MatrixOur confusion matrix informs us that we got 92 out of the 100 test predictions correct (63 + 29).

Visualizing ResultsThe last step is to visualize the results of our classifier on a graph!# Visualising the Training set resultsfrom matplotlib.

colors import ListedColormapX_set, y_set = X_train, y_trainX1, X2 = np.

meshgrid(np.

arange(start = X_set[:, 0].

min() – 1, stop = X_set[:, 0].

max() + 1, step = 0.

01), np.

arange(start = X_set[:, 1].

min() – 1, stop = X_set[:, 1].

max() + 1, step = 0.

01))plt.

contourf(X1, X2, classifier.

predict(np.

array([X1.

ravel(), X2.

ravel()]).

T).

reshape(X1.

shape), alpha = 0.

75, cmap = ListedColormap(('red', 'green')))plt.

xlim(X1.

min(), X1.

max())plt.

ylim(X2.

min(), X2.

max())for i, j in enumerate(np.

unique(y_set)): plt.

scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)plt.

title('Random Forest (Training set)')plt.

xlabel('Age')plt.

ylabel('Estimated Salary')plt.

legend()plt.

show()# Visualising the Test set resultsfrom matplotlib.

colors import ListedColormapX_set, y_set = X_test, y_testX1, X2 = np.

meshgrid(np.

arange(start = X_set[:, 0].

min() – 1, stop = X_set[:, 0].

max() + 1, step = 0.

01), np.

arange(start = X_set[:, 1].

min() – 1, stop = X_set[:, 1].

max() + 1, step = 0.

01))plt.

contourf(X1, X2, classifier.

predict(np.

array([X1.

ravel(), X2.

ravel()]).

T).

reshape(X1.

shape), alpha = 0.

75, cmap = ListedColormap(('red', 'green')))plt.

xlim(X1.

min(), X1.

max())plt.

ylim(X2.

min(), X2.

max())for i, j in enumerate(np.

unique(y_set)): plt.

scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)plt.

title('Random Forest (Test set)')plt.

xlabel('Age')plt.

ylabel('Estimated Salary')plt.

legend()plt.

show()Using the matplotlib library, we can create beautiful graphs to visualize the correlations the model made during the training and how well the predictions followed them.

Lets break down what we see on the graphs above.

Each red point represents someone who didn’t purchase the car and each green point represents someone who did.

If a point falls within a red region, the classifier will think that that person didn’t purchase the car, and vice versa.

The general trend we notice is that older people with higher salaries have a higher likelihood of purchasing this car.

This could be super valuable information for someone who is trying to improve their sales/marketing strategies!Key TakeawaysData science will greatly improve the efficiency and accuracy of our decision making and allow us to make analyses that humans wouldn’t be able to do alone.

In the future, any time you face a difficult data-related problem, you can create your own handy Machine Learning model to help you make the best decisions possible ;)“Information is the oil of the 21st century, and analytics is the combustion engine”— Eric Schmidt.. More details

Leave a Reply