Performing Classification in TensorFlowHarshdeep SinghBlockedUnblockFollowFollowingFeb 25In this article, I will explain how to perform classification using TensorFlow library in Python.
We’ll be working with the California Census Data and will try to use various features of individuals to predict what class of income they belong in (>50k or <=50k).
The data can be accessed at my GitHub profile in the TensorFlow repository.
Here is the link to access the data.
My code and Jupyter notebook can be accessed below:HarshSingh16/TensorflowTensorflow Projects I am currently working on.
Contribute to HarshSingh16/Tensorflow development by creating an account…github.
comImporting the libraries and the datasetLet’s begin by importing the necessary libraries and the dataset into our Jupyter Notebook.
Let’s look into our dataset.
So, there are 15 columns.
Out of these 15, 6 columns are numeric in nature , while the remaining 9 are categorical.
The following image provides information regarding the type of columns and the respective descriptions.
Please note that we would not be using the variable “fnlwgt” in this example.
Looking at our target column:We will now look at our target column “Income”.
As described earlier, we are trying to classify the income bracket of our individuals.
So, there are basically two classes- “≤50K” and “>50K.
However, we can not leave our target labels in the current string format.
This is because TensorFlow does not understand strings as labels.
We will have to convert these strings into 0 and 1.
“1” if the income bracket is greater than 50K and “0” if the income bracket is less than or equal to 50K.
We can do so by creating a for loop, and then appending the labels to a list.
I have also updated the existing “Income” column directly with the new list that we just created.
Here is the code to perform the transformation:Normalizing our numeric features:We now want to normalize our numeric features.
Normalization is the process of converting an actual range of values which a numericalfeature can take, into a standard range of values, typically in the interval [−1, 1] or [0, 1].
Normalizing the data is not a strict requirement.
However, in practice, it can lead to an increased speed of learning.
Additionally, it’s useful to ensure that our inputs are roughly in the same relatively smallrange to avoid problems which computers have when working with very small or very big numbers (known as numerical overflow).
We will use the lambda function to do this.
Here is the code:Creating continuous and categorical features:Next step is to create feature columns for our numeric and categorical data.
Think of feature columns as the intermediaries between raw data and Estimators.
Feature columns are very rich, enabling you to transform a diverse range of raw data into formats that Estimators can use, allowing easy experimentation.
Here is an example from TensorFlow website that illustrates how feature columns work.
The data being discussed here is the famous Iris dataset.
As the following figure suggests, you specify the input to a model through the feature_columns argument of an Estimator (DNNClassifier for Iris).
Feature Columns bridge input data (as returned by input_fn) with your model.
To create feature columns, we have to call functions from the tf.
This image from TensorFlow’s website explains nine of the functions in that module.
As the following figure shows, all nine functions return either a Categorical-Column or a Dense-Column object, except bucketized_column, which inherits from both classes:It’s now time to create feature columns for our dataset.
We will first tackle the numerical columns and convert them to features by using the tf.
numeric_columnNext, we will tackle the categorical features.
Here we have two options -tf.
categorical_column_with_hash_bucket :Use this If you don’t know the set of possible values for a categorical column in advance and there are too many of themtf.
categorical_column_with_vocabulary_list : Use this if you know the set of all possible feature values of a column and there are only a few of themSince in our case, we have too many feature values in each of our categorical columns, we will use the hash function.
Be sure to specify a value of hash which is greater than the total number of categories of a column to avoid two different categories being assigned to the same hash value.
Next, we want to put all these variables into a single list with the variable name feat_columns .
Performing the training and test splitWe will be using the sklearn library to perform our train-test split.
Thus we will have to separate our labels from features.
This is because the module train_test_split module from sklearn requires you to explicitly specify the features and their target columns.
We will now import our train_test_split module.
We will keep 33% of data in test set.
This will give us a sufficient number of observations to accurately evaluate our model’s performance.
Defining the Input Function and the Linear Classifier:We now create an input function that would feed Pandas DataFrame into our classifier model.
The module tf.
inputs provides a very easy way of doing this.
It requires you to specify the features, labels and batch size.
It also has a special argument called shuffle,which allows the model to read the records in a random order, thereby improving model performance.
Next, we will define our linear classifier.
Our linear classifier will train a linear model to classify instances into one of the two possible classes- i.
0 for incomes less or equal to 50K, and 1 for incomes greater than 50K.
Again, the tf.
LinearClassifier allows us to do this with just a single line of code.
As a part of arguments, we have to specify our feature columns, and the number of classes.
Training the Model:Finally, the exciting part!.Let’s begin training our model.
As obvious, we have to specify the input function.
The steps argument specifies the number of steps for which to train the model.
PredictionsIt’s now time to generate our predictions.
Firstly, we need to redefine our input function.
While training the model needs you to specify the target labels along with the features, at the time of generating predictions, you do not specify the target labels.
The predictions will later be compared with the actual labels on the test data to evaluate the model.
So let’s begin!Let’s now feed the input function into model.
Please note that I have called the list object around my model.
predict function so that I can easily access the predicted classes in the next step.
Hurray!.We now have our predictions.
Let’s have a look at the prediction for the first observation in the test data.
In the image below, we can see that our model predicts it to be of Class 0 (refer to class_ids).
We also have a bunch of other predictions such as probabilities of the class, logits etc.
However, to conduct our model evaluations, we would just be requiring the class_ids.
In the next step, we will try to create a list of our class_ids.
As discussed above, we will now create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions we will use to compare against the real y_test values.
Looking at the classes of first 10 predictions.
Evaluation of the ModelWe have now come to the final stage of the project.
We will now try to assess our model’s predictions and will compare them with actual labels by using the sklearn library.
Here is our classification report:I have also printed out some other evaluation metrics which will give us a very clear picture of our model’s performance.
Our model has an overall accuracy of 82.
5% and an AUC of 86.
Good classifiers have bigger areas under the curves.
As evident, our model has achieved some really nice results.
Final Remarks:I hope that this article provides you a good understanding about performing classification tasks in TensorFlow.
I look forward to hearing your thoughts and comments.
Please feel free to reach me through LinkedIn.