A Machine Learning Approach to Author Identification of Horror Novels from Text Snippets

But one thing is quite obvious that,Every author, whether it is a Lovecraft or Mary Shelley or Poe, had their own style of writing which includes their signature fashion of using certain words, making their literature unique and recognisableSo, let’s use this fact to identify the author (Lovecraft/Mary Shelley/Poe) from text snippets or quotes drawn from their horror novels.

Machine Learning powered by Natural Language Processing (NLP) is an excellent solution to the above problem.

So, let’s state the problem clearly and get started !!!Problem Statement: “Given Text Snippets/Quotes from renowned novels of Edgar Allan Poe, Mary Shelley and HP Lovecraft, identify that who is the author of the text snippet or quote”For the purpose, Spooky Author Identification Dataset prepared by Kaggle is considered.

So, let’s commence the Machine Learning Model Development in Python using NLTK (Natural Language Took Kit) and Scikit-Learn !!!I.

Loading (reading) the Dataset using Pandas:import pandas as pddf = pd.

read_csv('train.

csv')df.

head(5) # for showing a snapshot of the datasetSnapshot of the Dataset in which under label author, EAP -> (Edgar Allan Poe), HPL -> (HP Lovecraft) and MWS -> (Mary Wollstonecraft Shelley)II.

Text Processing Steps:Removal of Punctuation → All the punctuation marks are removed from all the text-snippets (instances or documents) from the dataset (corpus).

Lemmatisation → Inflected forms of a word are known as lemma.

For example, (studying, studied) are inflected forms or lemma of the word study which is the root word.

So, the lemma of a word are grouped under the single root word.

This is done to make the vocabulary of words in the corpus contain distinct words only.

Removal of Stopwords → Stop-words are usually articles (a, an, the), prepositions (in, on, under, …) and other frequently occurring words that do not provide any key or necessary information.

They are removed from all the text-snippets present in the dataset (corpus).

# Importing necessary librariesimport stringfrom nltk.

corpus import stopwordsfrom nltk.

stem import WordNetLemmatizer# Defining a module for Text Processingdef text_process(tex): # 1.

Removal of Punctuation Marks nopunct=[char for char in tex if char not in string.

punctuation] nopunct=''.

join(nopunct) # 2.

Lemmatisation a='' i=0 for i in range(len(nopunct.

split())): b=lemmatiser.

lemmatize(nopunct.

split()[i], pos="v") a=a+b+' ' # 3.

Removal of Stopwords return [word for word in a.

split() if word.

lower() not in stopwords.

words('english')]III.

Label Encoding of Classes:As this is a classification problem, here classes are the 3 authors as mentioned.

But in the dataset, it can be seen that labels are non-numeric (MWS, EAP and HPL).

These are label encoded to make them numeric, starting from 0 depicting each label in the alphabetic order i.

e.

, (0 → EAP, 1 → HPL and 2 → MWS)# Importing necessary librariesfrom sklearn.

preprocessing import LabelEncodery = df['author']labelencoder = LabelEncoder()y = labelencoder.

fit_transform(y)IV.

Word Cloud Visualization:As the Machine Learning Model is being developed, banking on the fact that the authors have their own unique styles of using particular words in the text, a visualization of the mostly-used words to the least-used words by the 3 authors is done, taking 3 text snippets each belonging to the 3 authors respectively with the help of a Word Cloud.

# Importing necessary librariesfrom PIL import Imagefrom wordcloud import WordCloudimport matplotlib.

pyplot as pltX = df['text']wordcloud1 = WordCloud().

generate(X[0]) # for EAPwordcloud2 = WordCloud().

generate(X[1]) # for HPLwordcloud3 = WordCloud().

generate(X[3]) # for MWS print(X[0])print(df['author'][0])plt.

imshow(wordcloud1, interpolation='bilinear')plt.

show()print(X[1])print(df['author'][1])plt.

imshow(wordcloud2, interpolation='bilinear')plt.

show()print(X[3])print(df['author'][3])plt.

imshow(wordcloud3, interpolation='bilinear')plt.

show()Word Clouds for the 3 authors taking their text-snippet samplesV.

Feature Engineering using Bag-of-Words:Machine Learning Algorithms work only on numeric data.

But here, data is present in the form of text only.

For that, by some means, textual data needs to be transformed into numeric form.

One such approach of doing this, is Feature Engineering.

In this approach, numeric features are extracted or engineered from textual data.

There are many Feature Engineering Techniques in existence.

In this problem, Bag-of-Words Technique of Feature Engineering has been used.

=>Bag-of-Words:Here, a vocabulary of words present in the corpus is maintained.

These words serve as features for each instance or document (here text snippet).

Against each word as feature, its frequency in the current document (text snippet) is considered.

Hence, in this way word features are engineered or extracted from the textual data or corpus.

# Importing necessary librariesfrom sklearn.

feature_extraction.

text import CountVectorizer# defining the bag-of-words transformer on the text-processed corpus # i.

e.

, text_process() declared in II is executed.

bow_transformer = CountVectorizer(analyzer=text_process).

fit(X)# transforming into Bag-of-Words and hence textual data to numeric.

text_bow = bow_transformer.

transform(X)VI.

Training the Model:Multinomial Naive Bayes Algorithm (Classifier) has been used as the Classification Machine Learning Algorithm [1].

# Importing necessary librariesfrom sklearn.

model_selection import train_test_splitfrom sklearn.

naive_bayes import MultinomialNB# 80-20 splitting the dataset (80%->Training and 20%->Validation)X_train, X_test, y_train, y_test = train_test_split(text_bow, y , test_size=0.

2, random_state=1234)# instantiating the model with Multinomial Naive Bayes.

model = MultinomialNB()# training the model.

model.

fit(X_train, y_train)Here, default value of alpha (smoothing parameter) of 1.

0 is takenVII.

Model Performance Analysis:=> Training Accuracymodel.

score(X_train, y_train)Obtained Training Accuracy=> Validation Accuracymodel.

score(X_test, y_test)Obtained Validation Accuracy=> Precision, Recall and F1–Score# Importing necessary librariesfrom sklearn.

metrics import classification_report # getting the predictions of the Validation Set.

predictions = model.

predict(X_test)# getting the Precision, Recall, F1-Scoreprint(classification_report(y_test,predictions))Classification Report of the Model=> Confusion Matrix# Importing necessary librariesfrom sklearn.

metrics import confusion_matriximport numpy as npimport itertoolsimport matplotlib.

pyplot as plt# Defining a module for Confusion Matrix.

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.

cm.

Blues): """ This function prints and plots the confusion matrix.

Normalization can be applied by setting `normalize=True`.

""" if normalize: cm = cm.

astype('float') / cm.

sum(axis=1)[:, np.

newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization')print(cm)plt.

imshow(cm, interpolation='nearest', cmap=cmap) plt.

title(title) plt.

colorbar() tick_marks = np.

arange(len(classes)) plt.

xticks(tick_marks, classes, rotation=45) plt.

yticks(tick_marks, classes)fmt = '.

2f' if normalize else 'd' thresh = cm.

max() / 2.

for i, j in itertools.

product(range(cm.

shape[0]) , range(cm.

shape[1])): plt.

text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")plt.

tight_layout() plt.

ylabel('True label') plt.

xlabel('Predicted label')cm = confusion_matrix(y_test,predictions)plt.

figure()plot_confusion_matrix(cm, classes=[0,1,2], normalize=True, title='Confusion Matrix')Normalized Confusion MatrixAccording to the Performance Analysis, it can be concluded that the NLP Powered Machine Learning Model has been successful in effectively classifying 84.

19% unknown (Validation Set) examples correctly.

In other words, 84.

19% of text-snippets are identified correctly that it belongs to which author among the three.

Based on our Model Performance, can we conclude which author has the most unique style of writing?The answer is YES !!!.Let’s look at the Normalized Confusion Matrix.

Here label 2 is the most correctly classified.

As label 2 refers to Mary Wollstonecraft Shelley, it can be concluded thatMary Wollstonecraft Shelley has the most unique style of writing Horror Novels w.

r.

t Edgar Allan Poe and HP Lovecraft.

Also, in a different sense, can we say who is the most versatile author among Mary Shelley, Edgar Allan Poe and HP Lovecraft?Again the answer is YES !!!.Again looking at the Confusion Matrix, label 0 is the least correctly classified.

Label 0 refers to Edgar Allan Poe, so it can be concluded thatEdgar Allan Poe is more versatile than HP Lovecraft and Mary Shelley.

In this way, a Text Detection Model can be developed using Machine Learning and Natural Language Processing.

A relevant web-application has also been developed by me, but with different methodologies and involving only 2 authors, Mary Shelley and Edgar Allan Poe using php (PHP: Hypertext Preprocessor) as back-end with the help of PHP-ML.

The link to the Web-App is given below:Author Identifiernavocommerce.

inREFERENCES[1] https://towardsdatascience.

com/multinomial-naive-bayes-classifier-for-text-analysis-python-8dd6825ece67For Personal Contacts regarding the article or the Web-App or discussions on Machine Learning or any department of Data Science, feel free to reach out to me on LinkedIn.

Navoneel Chakrabarty | LinkedInView Navoneel Chakrabarty's profile on LinkedIn, the world's largest professional community.

Navoneel has 4 jobs listed…www.

linkedin.

com.

. More details

Leave a Reply