Text CleaningWe are gonna keep the words and spaces and remove everything else for further feature processing, but this step should be done after feature extraction like hashtags, user tagged because this step will also remove ‘#’ and ‘@’.
5) Average word length — Different kind of text uses different kind of words, scientific reports has usually higher average word length than the normal conversation or daily news.
So average word length could help for differentiating the type of text.
Average word length = Sum (length of all the words in the tweet or doc ) / total number of words in the tweet or doc)6) Number of words — Basic idea is to extract number words in each tweet and use the count as a feature.
The main intuition behind this technique is — some text needs more words than others to express itself.
Like, when people are happy they express themselves more than when people are angry.
7) Sentiment score calculationPolarity, subjectivity, and intensity can be calculated from the tweets and can be used as features.
In this project I calculated the polarity and sentiments from the tweet:8) Bag of Words and TF-IDF calculationBag of words is the technique to use the word’s frequency in a doc/sentence/tweet as a feature.
TF-IDF is a better technique than the bag of words which considers the importance of the word than the only frequency of the word.
These features need more text preprocessing so we will discuss them in detail in the future post.
Code can be found here.
#NLP #textprocessing #featureextraction #machinelearning.