In this picture, the solid line represents the best hyperplane.

You can draw many different lines to separate the two classes of points, but this line is the best separator because it maximizes the distance of each point from the separating line.

The points on the dotted lines are called Support Vectors.

The perpendicular distance between the two dotted lines is called the maximum margin.

Classifying income data using Support Vector MachinesWe will build a Support Vector Machine classifier to predict the income bracket of a given person based on 14 attributes.

Our goal is to see where the income is higher or lower than $50,000 per year.

Hence this is a binary classification problem.

We will be using the census income dataset available at https://archive.

ics.

uci.

edu/ml/datasets/Census+Income .

One thing to note in this dataset is that each datapoint is a mixture of words and numbers.

We cannot use the data in its raw format, because the algorithms don't know how to deal with words.

We cannot convert everything using label encoder because numerical data is valuable.

Hence we need to use a combination of label encoders and raw numerical data to build an effective classifier.

Create a new Python file and import the following packages:import numpy as np import matplotlib.

pyplot as plt from sklearn import preprocessing from sklearn.

svm import LinearSVC from sklearn.

multiclass import OneVsOneClassifier from sklearn import cross_validationWe will be using the file income_data.

txt to load the data.

This file contains the income details:# Input file containing data input_file = 'income_data.

txt'In order to load the data from the file, we need to preprocess it so that we can prepare it for classification.

We will use at most 25,000 data points for each class:# Read the data X = [] y = [] count_class1 = 0 count_class2 = 0 max_datapoints = 25000Open the file and start reading the lines:with open(input_file, 'r') as f: for line in f.

readlines(): if count_class1 >= max_datapoints and count_class2 >= max_datapoints: break if '?' in line: continueEach line is comma separated, so we need to split it accordingly.

The last element in each line represents the label.

Depending on that label, we will assign it to a class:data = line[:-1].

split(', ') if data[-1] == '<=50K' and count_class1 < max_datapoints: X.

append(data) count_class1 += 1 if data[-1] == '>50K' and count_class2 < max_datapoints: X.

append(data) count_class2 += 1Convert the list into a numpy array so that we can give it as an input to the sklearnfunction:# Convert to numpy array X = np.

array(X)If any attribute is a string, then we need to encode it.

If it is a number, we can keep it as it is.

Note that we will end up with multiple label encoders and we need to keep track of all of them:# Convert string data to numerical data label_encoder = [] X_encoded = np.

empty(X.

shape) for i,item in enumerate(X[0]): if item.

isdigit(): X_encoded[:, i] = X[:, i] else: label_encoder.

append(preprocessing.

LabelEncoder()) X_encoded[:, i] = label_encoder[-1].

fit_transform(X[:, i]) X = X_encoded[:, :-1].

astype(int) y = X_encoded[:, -1].

astype(int)Create the SVM classifier with a linear kernel:# Create SVM classifier classifier = OneVsOneClassifier(LinearSVC(random_state=0))Train the classifier:# Train the classifier classifier.

fit(X, y)Perform cross-validation using an 80/20 split for training and testing, and then predict the output for training data:# Cross validation X_train, X_test, y_train, y_test = cross_validation.

train_test_split(X, y, test_size=0.

2, random_state=5) classifier = OneVsOneClassifier(LinearSVC(random_state=0)) classifier.

fit(X_train, y_train) y_test_pred = classifier.

predict(X_test)Compute the F1 score for the classifier:# Compute the F1 score of the SVM classifier f1 = cross_validation.

cross_val_score(classifier, X, y, scoring='f1_weighted', cv=3) print("F1 score: " + str(round(100*f1.

mean(), 2)) + "%")Now that the classifier is ready, let’s see how to take a random input data point and predict the output.

Let’s define one such data point:# Predict output for a test datapoint input_data = ['37', 'Private', '215646', 'HS-grad', '9', 'Never-married', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', '0', '0', '40', 'United-States']Before we can perform prediction, we need to encode this data point using the label encoders we created earlier:# Encode test datapoint input_data_encoded = [-1] * len(input_data) count = 0 for i, item in enumerate(input_data): if item.

isdigit(): input_data_encoded[i] = int(input_data[i]) else: input_data_encoded[i] = int(label_encoder[count].

transform(input_data[i])) count += 1 input_data_encoded = np.

array(input_data_encoded)We are now ready to predict the output using the classifier:# Run classifier on encoded datapoint and print output predicted_class = classifier.

predict(input_data_encoded) print(label_encoder[-1].

inverse_transform(predicted_class)[0])If you run the code, it will take a few seconds to train the classifier.

Once it’s done, you will see the following printed on your Terminal:F1 score: 66.

82%You will also see the output for the test data point:<=50KIf you check the values in that data point, you will see that it closely corresponds to the data points in the less than 50K class.

You can change the performance of the classifier (F1 score, precision, or recall) by using various different kernels and trying out multiple combinations of the parameters.

What is Regression?Regression is the process of estimating the relationship between input and output variables.

One thing to note is that the output variables are continuous-valued real numbers.

Hence there is an infinite number of possibilities.

This is in contrast with classification, where the number of output classes is fixed.

The classes belong to a finite set of possibilities.

In regression, it is assumed that the output variables depend on the input variables, so we want to see how they are related.

Consequently, the input variables are called independent variables, also known as predictors, and output variables are called dependent variables, also known as criterion variables.

It is not necessary that the input variables are independent of each other.

There are a lot of situations where there are correlations between input variables.

Regression analysis helps us in understanding how the value of the output variable changes when we vary some input variables while keeping other input variables fixed.

In linear regression, we assume that the relationship between input and output is linear.

This puts a constraint on our modelling procedure, but it’s fast and efficient.

Sometimes, linear regression is not sufficient to explain the relationship between input and output.

Hence we use polynomial regression, where we use a polynomial to explain the relationship between input and output.

This is more computationally complex but gives higher accuracy.

Depending on the problem at hand, we use different forms of regression to extract the relationship.

Regression is frequently used for prediction of prices, economics, variations, and so on.

Building a single variable regressorLet’s see how to build a single variable regression model.

Create a new Python file and import the following packages:import pickle import numpy as np from sklearn import linear_model import sklearn.

metrics as sm import matplotlib.

pyplot as pltWe will use the file data_singlevar_regr.

txt provided to you.

This is our source of data:# Input file containing data input_file = 'data_singlevar_regr.

txt'It’s a comma-separated file, so we can easily load it using a one-line function call:# Read data data = np.

loadtxt(input_file, delimiter=',') X, y = data[:, :-1], data[:, -1]Split it into training and testing:# Train and test split num_training = int(0.

8 * len(X)) num_test = len(X) – num_training # Training data X_train, y_train = X[:num_training], y[:num_training] # Test data X_test, y_test = X[num_training:], y[num_training:]Create a linear regressor object and train it using the training data:# Create linear regressor object regressor = linear_model.

LinearRegression() # Train the model using the training sets regressor.

fit(X_train, y_train)Predict the output for the testing dataset using the training model:# Predict the output y_test_pred = regressor.

predict(X_test)Plot the output:# Plot outputs plt.

scatter(X_test, y_test, color='green') plt.

plot(X_test, y_test_pred, color='black', linewidth=4) plt.

xticks(()) plt.

yticks(()) plt.

show()Compute the performance metrics for the regressor by comparing the ground truth, which refers to the actual outputs, with the predicted outputs:# Compute performance metrics print("Linear regressor performance:") print("Mean absolute error =", round(sm.

mean_absolute_error(y_test, y_test_pred), 2)) print("Mean squared error =", round(sm.

mean_squared_error(y_test, y_test_pred), 2)) print("Median absolute error =", round(sm.

median_absolute_error(y_test, y_test_pred), 2)) print("Explain variance score =", round(sm.

explained_variance_score(y_test, y_test_pred), 2)) print("R2 score =", round(sm.

r2_score(y_test, y_test_pred), 2))Once the model has been created, we can save it into a file so that we can use it later.

Python provides a nice module called pickle that enables us to do this:# Model persistence output_model_file = 'model.

pkl' # Save the model with open(output_model_file, 'wb') as f: pickle.

dump(regressor, f)Let’s load the model from the file on the disk and perform prediction:# Load the model with open(output_model_file, 'rb') as f: regressor_model = pickle.

load(f) # Perform prediction on test data y_test_pred_new = regressor_model.

predict(X_test) print(".New mean absolute error =", round(sm.

mean_absolute_error(y_test, y_test_pred_new), 2))If you run the code, you will see the following screenshot:You will see the following printed on your Terminal:Linear regressor performance:Mean absolute error = 0.

59Mean squared error = 0.

49Median absolute error = 0.

51Explain variance score = 0.

86R2 score = 0.

86New mean absolute error = 0.

59Building a multivariable regressionIn the previous section, we discussed how to build a regression model for a single variable.

In this section, we will deal with multidimensional data.

Create a new Python file and import the following packages:import numpy as np from sklearn import linear_model import sklearn.

metrics as sm from sklearn.

preprocessing import PolynomialFeaturesWe will use the file data_multivar_regr.

txt provided to you.

# Input file containing data input_file = 'data_multivar_regr.

txt'This is a comma-separated file, so we can load it easily with a one-line function call:# Load the data from the input file data = np.

loadtxt(input_file, delimiter=',') X, y = data[:, :-1], data[:, -1]Split the data into training and testing:# Split data into training and testing num_training = int(0.

8 * len(X)) num_test = len(X) – num_training # Training data X_train, y_train = X[:num_training], y[:num_training] # Test data X_test, y_test = X[num_training:], y[num_training:]Create and train the linear regressor model:# Create the linear regressor model linear_regressor = linear_model.

LinearRegression() # Train the model using the training sets linear_regressor.

fit(X_train, y_train)Predict the output for the test dataset:# Predict the output y_test_pred = linear_regressor.

predict(X_test)Print the performance metrics:# Measure performance print("Linear Regressor performance:") print("Mean absolute error =", round(sm.

mean_absolute_error(y_test, y_test_pred), 2)) print("Mean squared error =", round(sm.

mean_squared_error(y_test, y_test_pred), 2)) print("Median absolute error =", round(sm.

median_absolute_error(y_test, y_test_pred), 2)) print("Explained variance score =", round(sm.

explained_variance_score(y_test, y_test_pred), 2)) print("R2 score =", round(sm.

r2_score(y_test, y_test_pred), 2))Create a polynomial regressor of degree 10.

Train the regressor on the training dataset.

Let’s take a sample data point and see how to perform prediction.

The first step is to transform it into a polynomial:# Polynomial regression polynomial = PolynomialFeatures(degree=10) X_train_transformed = polynomial.

fit_transform(X_train) datapoint = [[7.

75, 6.

35, 5.

56]] poly_datapoint = polynomial.

fit_transform(datapoint)If you look closely, this data point is very close to the data point on line 11 in our data file, which is [7.

66, 6.

29, 5.

66].

So, a good regressor should predict an output that’s close to 41.

35.

Create a linear regressor object and perform the polynomial fit.

Perform the prediction using both linear and polynomial regressors to see the difference:poly_linear_model = linear_model.

LinearRegression() poly_linear_model.

fit(X_train_transformed, y_train) print(".Linear regression:.", linear_regressor.

predict(datapoint)) print(".Polynomial regression:.", poly_linear_model.

predict(poly_datapoint))If you run the code, you will see the following printed on your Terminal:Linear Regressor performance:Mean absolute error = 3.

58Mean squared error = 20.

31Median absolute error = 2.

99Explained variance score = 0.

86R2 score = 0.

86You will see the following as well:Linear regression: [ 36.

05286276]Polynomial regression: [ 41.

46961676]Estimating housing prices using a Support Vector RegressorLet’s see how to use the SVM concept to build a regressor to estimate the housing prices.

We will use the dataset available in sklearn where each data point is defined, by 13 attributes.

Our goal is to estimate the housing prices based on these attributes.

Create a new Python file and import the following packages:import numpy as np from sklearn import datasets from sklearn.

svm import SVR from sklearn.

metrics import mean_squared_error, explained_variance_score from sklearn.

utils import shuffleLoad the housing dataset:# Load housing data data = datasets.

load_boston()Let’s shuffle the data so that we don’t bias our analysis:# Shuffle the data X, y = shuffle(data.

data, data.

target, random_state=7)Split the dataset into training and testing in an 80/20 format:# Split the data into training and testing datasets num_training = int(0.

8 * len(X)) X_train, y_train = X[:num_training], y[:num_training] X_test, y_test = X[num_training:], y[num_training:]Create and train the Support Vector Regressor using a linear kernel.

The C parameter represents the penalty for training error.

If you increase the value of C, the model will fine-tune it more to fit the training data.

But this might lead to overfitting and cause it to lose its generality.

The epsilon parameter specifies a threshold; there is no penalty for training error if the predicted value is within this distance from the actual value:# Create Support Vector Regression model sv_regressor = SVR(kernel='linear', C=1.

0, epsilon=0.

1) # Train Support Vector Regressor sv_regressor.

fit(X_train, y_train)Evaluate the performance of the regressor and print the metrics:# Evaluate performance of Support Vector Regressor y_test_pred = sv_regressor.

predict(X_test) mse = mean_squared_error(y_test, y_test_pred) evs = explained_variance_score(y_test, y_test_pred) print(".#### Performance ####") print("Mean squared error =", round(mse, 2)) print("Explained variance score =", round(evs, 2))Let’s take a test data point and perform prediction:# Test the regressor on test datapoint test_data = [3.

7, 0, 18.

4, 1, 0.

87, 5.

95, 91, 2.

5052, 26, 666, 20.

2, 351.

34, 15.

27] print(".Predicted price:", sv_regressor.

predict([test_data])[0])If you run the code, you will see the following printed on the Terminal:#### Performance ####Mean squared error = 15.

41Explained variance score = 0.

82Predicted price: 18.

5217801073.