We do have other alternatives when coping with NLP problems, such as Support Vector Machine (SVM) and neural networks.
However, the simple design of Naive Bayes classifiers makes them very attractive for such classifiers.
Moreover, they have been demonstrated to be fast, reliable and accurate in a number of applications of NLP.
Data Pre-processingFor text classification, if you are collecting your data yourself via scraping then you may have a messy dataset and have to put a lot of efforts in cleaning it and getting it in good form before applying any model.
In our case, the dataset was not that messy so we need not put that much effort into this.
So, we performed following very common but crucial data pre-processing steps -Lower case and removing stop words — Convert the entire input description to lower case and remove the stop words as they don’t add anything to the categorizationLemmatizing words — This groups together different inflections of the same words like organize, organizes, organizing, etc.
n-grams — Using n-grams we can count the sequence of the words, Instead of counting single wordsTo perform classification we have to represent the input description in the forms of the vectors using the bag of words techniques.
There are two approaches to perform this.
Counting the number of times each word appears in a documentCalculating the frequency that each word appears in a document out of all the words in the documentVectorization (CountVectorizer)It works on Term Frequency, i.
counting the occurrences of tokens and building a sparse matrix of documents tokensTF-IDF TransformerTF-IDF stands for Term Frequency and Inverse Document Frequency.
TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Term frequency is the frequency of a word in a particular document.
Inverse Document Frequency gives us a measure of how rare a term is.
Rarer a term is higher will be IDF score.
Model FittingNow we have the data prepared and will fit the Multinomial Naive Bayes to the data to train the model.
We created a sklearn pipeline with all the pre-processing steps involved because we want to represent the new incoming ticket description in the form of vectors which we have after the model is trained and not going to create new ones based on the new description.
pipeline import Pipelinetext_clf = Pipeline([(‘vect’, CountVectorizer(ngram_range=(1,2),stop_words=stopwords.
words(‘english’))), (‘tfidf’, TfidfTransformer()), (‘clf’, MultinomialNB()),])https://github.
ipynbAfter the model is created we tested the model performance on our test dataset and we were getting a pretty good 92.
Then we exported the model into a pickle file.
Integration with AWSNow comes the most important part of the project which was to deploy the model on AWS and configure AWS Lambda function to do real-time prediction.
This is where we spent a lot of time.
First, what is AWS Lambda and why are we using it?AWS Lambda is a serverless computing platform provided by Amazon.
It lets us run the code without having to worry about the provisioning and managing the servers and resources.
You have to define the event triggers when you want to run the lambda function and just upload your code and Lambda will take care of everything that is required to run and scale your code.
And one advantage of lambda is that you will only pay for the compute time for your code which means you will be charged only when your code is running and don’t pay anything when your code is not running.
Let’s get started with the process.
Setup a fresh EC2 instanceFirst set up a fresh EC2 instance where you will install all the required libraries which your code (model) will be using while running on Lambda.
Reason for setting up a new EC2 instance is that you will be configuring the python environment from scratch and install all the required libraries.
Once you have completed that you will zip the entire python environment along with your code which you will be running on Lambda and download it to your local machine and then we will upload it Amazon S3 bucket.
Lambda function will use this zip file for execution.
Whenever an event occurred which will invoke the Lambda function, it will use your specified zip file for execution.
Say, on the event, I specified that I will run sample.
py which is already inside the zip file then Lambda function will look for everything it needs (python environment and required libraries to run the code) in that zip file, and if there is something missing then the execution of your code will fail.
pyCreate a Lambda FunctionOnce, you have the zip file ready on S3 bucket.
Then you can create a new Lambda function.
Click on Create Function and provide a meaningful name and choose Python 3.
6 in runtime and may choose permissions which suit best your needs.
Goto your lambda function, in Function Code section you have to either upload a zip file or you can specify the address of the zip stored on S3 bucket as in the screenshot.
FYI — In the screenshot Function is the code file name (Function.
py) and handler is the method name defined in the file which will run.
Specify the path of S3 link for the zip fileTest your codeYou can test your code if it is running correctly or not.
Click on the test tab on the top right and configure a test event.
You can pass on the input which your code needs and then you can see whether your code is running or not.
As in the screenshot, we are passing the description and run the code.
If you are using our code to try then you have to uncomment the line 29 which will read test description and comment line 30 which is reading the input from AWS queue which I will explain now.
AWS SQS (Simple Queue Service)Create an SQS service by selecting SQS from the designer selection box.
It is pretty straight forward.
In this SQS, we configured two queues.
One is for passing the input to the Lambda function, we have an event trigger on this queue means whenever there is a new message in this queue then it will trigger the lambda to run.
Second is the output queue which will show the predicted value for the input message.
Input and output queueSelect the input queue and from the Queue Actions select ‘send a message’.
Once you will click the submit button, it will trigger the Lambda function and your code will run and do the prediction and write it to the output queue.
As you can see in the screenshot, messages available for output queue is 7 and input is 0 which means there is no new message is input queue and your lambda function is not running your code.
User feedback to retrain the modelWe created a simple angular JS UI for getting user feedback.
The idea of user feedback was to get the user’s approval if the classified ticket was correct or not.
If the classified ticket was the wrong category then the user has an option to select the correct category from the drop down and click on save which will save the file to the S3 bucket.
We will use this file to retrain our model periodically.
For this, we set up another Lambda function which you can schedule to run every day or every week depending on the requirement.
It will use the model pickle file to read the model and retrain the model and modify the pickle file.
ConclusionAWS Lambda is a very good choice for scalable models as you don’t have to worry about provisioning and managing servers.
It is easy to deploy models and automatically scales the required resources according to your requirements and you only have to pay if your code is running and thus it is very cost effective.
.. More details