Naive Bayes classification from Scratch in Python

All together posterior probability in terms of the joint probability distribution (neglecting denominator P(x)) is written as:Now to calculate each feature’s likelihood function, we use a Gaussian distribution which is parameterized by mean and sigma(standard deviation).We can also compute the likelihood function of the entire dataset but the parameters will be mean vector and covariance vector.So for a particular feature say X1, likelihood density for class c is modeled asHere X1 is the vector of features with class label c.Finally putting all together, steps involved in Naive Bayes classification for two class problem with class labels as 0 and 1 are :Divide the training dataset into subgroups based on class labels..Example: For a two-class problem with labels 0 and 1, all the training examples which belong to class 0 are separated and class 1 are separated.Compute the mean and standard deviations of each feature present in the separated dataset individually..Example: For a feature X1 from the original training set, there will be 2 set of mean and standard deviations for class 0 and class 1.For each testing example Xtest, we calculate the two joint probabilities i.e P(c=0/Xtest) and P(c=1/Xtest)..Finally, Xtest is assigned to the maximum of both joint probabilities.Now let’s implement above steps in python..We are going to use numpy, pandas and matplotlib libraries..The dataset we are going to use is “Social_network_ads.csv” .Let’s load the data set and see the first five rows:#reading datasetData=pd.read_csv('Social_Network_Ads.csv')Data.head(10)"""output User ID Gender Age EstimatedSalary Purchased0 15624510 Male 19 19000 01 15810944 Male 35 20000 02 15668575 Female 26 43000 03 15603246 Female 27 57000 04 15804002 Male 19 76000 05 15728773 Male 27 58000 06 15598044 Female 27 84000 07 15694829 Female 32 150000 18 15600575 Male 25 33000 09 15727311 Female 35 65000 0"""We take Age and EstimatedSalary as independent features and Purchased as dependent features (class labels are 0 and 1)..Now let’s divide the training and testing set sizes..We shall take 75% of the original data as training data and 25% as testing data.#training and testing set sizetrain_size=int(0.75*Data.shape[0])test_size=int(0.25*Data.shape[0])print("Training set size : "+ str(train_size))print("Testing set size : "+str(test_size))"""outputTraining set size : 300Testing set size : 100 """Shuffle the dataset and extract the required features.#Getting features from datasetData=Data.sample(frac=1)X=Data.iloc[:,[2, 3]].valuesy=Data.iloc[:,4].valuesX=X.astype(float)It’s useful to perform feature scaling as well because the scales of two features age and salary are entirely different..Feature scaling class is already implemented and you can get the code from Github link.#feature scalingfrom FeatureScaling import FeatureScalingfs=FeatureScaling(X,y)X=fs.fit_transform_X()Now split the data into training and testing set.#training set splitX_train=X[0:train_size,:]y_train=y[0:train_size]#testing set split X_test=X[train_size:,:] y_test=y[train_size:]We are ready with our data and let’s proceed to the task of classifying the testing data using Naive Bayes algorithm..Before that let’s visualize the training set.#visualize the training set from matplotlib.colors import ListedColormapX_set, y_set = X_train, y_trainfor i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j,marker='.')plt.title('Training set')plt.xlabel('Age')plt.ylabel('Estimated Salary')plt.legend()plt.show()The objective is to find a decision boundary which separates red and green dots..So let’s implement each step:generate dataset function will take X_train and y_train as parameters and split the data set based on class labels 0 and 1..Split data is stored in a dictionary named class_data_dic which is indexed by 0 and 1, i.e class_data_dic[0] fetches all the input training examples which have labels 0 and class_data_dic[1] fetches all the input training examples which have labels 1.def generate_data(class_data_dic,X_train,y_train): first_one=True first_zero=True for i in range(y_train.shape[0]): X_temp=X_train[i,:].reshape(X_train[i,:].shape[0],1) if y_train[i]==1: if first_one==True: class_data_dic[1]=X_temp first_one=False else: class_data_dic[1]= np.append(class_data_dic[1],X_temp,axis=1) elif y_train[i]==0: if first_zero==True: class_data_dic[0]=X_temp first_zero=False else: class_data_dic[0]= np.append(class_data_dic[0],X_temp,axis=1) return class_data_dic2.. More details

Leave a Reply