that we expect in the corpus.
Then LDA randomly assigns topic to each word in each document.
Next, it goes to each word in the corpus and checks how many times that specific topic occurs in a document and how many times that specific word occurs in the assigned topic.
Based on the results it assigns a new topic to that specific word and this iteration keeps going until topics make sense.
In the end, it is up to the user to interpret the topic.
This blog post will describe the following steps to extract latent topics from my corpus of retracted literature.
Data preprocessingBuilding an LDA modelVisualizationAs mentioned in my previous blog post, the data was obtained from PubMed and contains 6485 retracted publications.
For the current analysis I have used only the abstracts from these publications.
Keywords can be used to determine the theme of any abstract but in the available dataset 84% of the abstracts had no keywords and this motivated me to extract topics from the abstracts.
Data Preprocessing:This step involves data cleaning and is crucial in any text mining task.
I have subdivided this into 3 steps:i.
LemmatizationThis part involves part-of-speech (POS) tagging to filter words that are either noun, proper noun, or adjective.
I ignored other parts of speech such as verbs and adverbs as they appeared to be not important to my current task.
The resulting words were then lemmatized, which means only the root form of the word was kept.
I used the awesome Python library, spaCy, for doing this processing.
# load spacy and return english language objectnlp=spacy.
load('en_core_web_sm')def lemmatization(doc, allowed_pos=['NOUN','ADJ','PROPN']): “` filtering allowed POS tags and lemmatizing them“` retracted_corpus= for text in doc: doc_new = nlp(text) retracted_abstract= for token in doc_new: if token.
pos_ in allowed_pos: retracted_abstract.
append(retracted_abstract) return retracted_corpusabstract_lemmatized = lemmatization(doc,allowed_pos=['NOUN','ADJ','PROPN'])These are scientific documents and sometimes breaking sentences by space alone is not optimal (e.
, a formula or equation may use other characters such as = or +).
Therefore in an additional step I used a different text splitting implementation using some functionality from NLTK, another Python library for natural language processing (NLP).
tokenize import RegexpTokenizer# tokenizer that splits only in the selected symbols or spacetokenizer=RegexpTokenizer('s+|[<>=()-/]|[±°å]',gaps=True)def remove_punctuation(text): updated_abstract= for doc in text: sent=' '.
tokenize(sent)) return(updated_abstract)abstract_processed = remove_punctuation(abstract_lemmatized)ii.
Removing stopwordsA fairly routine task in NLP is removing stop words, or common words that do not carry enough meaning to differentiate one text from another, e.
, ‘a’, ‘the’, ‘for’, and so on.
I have used NLTK’s stopword library for this.
It contains 179 words.
As there are more words in a lifescience and biomedical research corpus such as ‘proof’, ‘record’, ‘researcher’, etc.
, that are not differentiating words, I added additional words (which I found in the current corpus) to the stopword list making it 520 words long.
Additionally, in this step I lowercased all the letters, which is needed so as to avoid the algorithm reading ‘Read’ and ‘read’ as two different words.
def remove_stopwords(doc): retracted_corpus= for text in doc: retracted_abstract= for word in text: word=word.
lower() if word not in nltk_stopwords: retracted_abstract.
append(retracted_abstract) return retracted_corpusabstract_no_stop = remove_stopwords(abstract_processed)iii.
Making bigramsLife sciences literature frequently contains bigrams such as ‘cell cycle’ and ‘protein phosphatase’.
These bigrams convey a meaning that is not apparent when considering either words individually.
Therefore, I decided to collect bigrams using python library Gensim, another great python library for natural language processing.
This step provides bigrams and unigrams where bigrams are selected according to specific conditions over the frequency of appearance in a corpus.
#using gensim to construct bigramsfrom gensim.
phrases import Phrases, Phraserbi_phrases=Phrases(abstract_processed, min_count=5, threshold=10)bigram = Phraser(bi_phrases)#provides unigrams and bigramsdef dimers(doc): updated_abstract= for text in doc: di = bigram[text] updated_abstract.
append(di) return updated_abstractabstract_bigram = dimers(abstract_no_stop)After this I removed anything that is only numbers or is less than 2 characters long.
def abstract_bigram_clean(doc): retracted_corpus= for text in doc: retracted_abstract= for word in text: if not word.
isdigit() and len(word)>1: retracted_abstract.
append(retracted_abstract) return retracted_corpusabstract_clean = abstract_bigram_clean(abstract_bigram)From these processing steps, I obtained list of lists with each list containing words coming from a single abstract.
To give an idea of the complexity of the corpus, there are 35,155 unique words in the corpus, which itself contains a little over 6,000 abstracts.
Building an LDA modelBefore building a model, we need our corpus (all documents) to be present in a matrix representation.
For this purpose, I prepared a dictionary where each unique word is assigned an index and then used it to make a document-term matrix also called bag-of-words (BoW).
I have used Gensim for performing all these tasks.
Additionally, there were 11,998 words that occur just once.
I removed them as they do not indicate any pattern that the model will detect.
Moreover, any word that occurs in more than 20% of the documents were removed as they seem to be not restrictive to a few topics.
# create a dictionary from gensim.
corpora import Dictionarydictionary = Dictionary(abstract_clean)dictionary.
2)# convert the dictionary into the bag-of-words (BoW)/document term matrixcorpus = [dictionary.
doc2bow(text) for text in abstract_clean]To build an LDA model we need to provide the number of topics that we expect in the corpus.
This can be tricky and needs multiple trial-and-error attempts as we do not know how many topics are present.
One way to estimate this number is by computing coherence measure² , a score that rates quality of topics computed by topic models and thus, differentiates a good topic model from a bad one.
Coherence measure is calculated in following four steps:Segmentation — in which word set t can be segregated into word subsets SProbability calculation — where word probabilities P are calculated based on the reference corpusConfirmation measure — both set of S and P are used by confirmation measure to calculate agreements ????.of pairs of SAggregation — Finally, all confirmations are aggregated to a single coherence value c.
In order to find the optimum number of topics in a corpus, one can build multiple LDA models with different number of topics.
From these models one can choose the value of topics that has maximum score and ends the rapid growth of the coherence values.
Increasing the number of topics further can lead to situations where many keywords are present in different topics.
Gensim library provides an implementation of coherence measure.
# instantiating an lda modelLDA = gensim.
LdaModel #computing coherence for different LDA models containing different number of topics def calculate_coherence(dictionary, corpus, texts, start, stop, step): coherence_scores= for num_topics in range(start,stop,step): model=LDA(corpus=corpus, id2word=dictionary, num_topics=num_topics) coherence_lda=CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v') coherence=coherence_lda.
append(coherence) return coherence_scorescoherence_scores = calculate_coherence(dictionary=dictionary, corpus=corpus, texts=abstract_clean, start=2, stop=40, step=3)We can observe multiple peaks, of almost the same heights, in the coherence values graph.
The first peak comes at 14.
I decided to build my LDA model using 14 topics.
lda_model_abstract = LDA(corpus=corpus,id2word=dictionary, num_topics=14, random_state=10,chunksize=6485,passes=100)VisualizationThis is a key step to read and understand the topics obtained from LDA and pyLDAvis helps us with that.
pyLDAvis³ is python package for interactive visualization of topics.
It is derived from LDAvis which is a web based visualization tool for topics that are estimated using LDA.
pyLDAvis provides two panels for visualization: The left panel shows topics in the form of circles in two dimensions.
The center of the circles is determined by calculating inter-topic distances and then using multidimensional scaling they are projected onto two-dimensions.
Thus, topics that are closely related are closer.
Additionally, the areas of circles represent overall prevalence of the topic.
Thus, bigger the circle, more prevalent is the topic in the corpus.
The right panel shows a bar chart of key terms for the selected topic on the left panel.
The overlaid bars represent the topic-specific frequency of various terms in red color and corpus wide frequency in gray color.
Thus, this right panel helps us find the meaning of the topics.
Both the panels are linked in a way that selecting a topic in the left panel shows its important terms in the right panel.
Additionally, on selecting a term on the right panel shows its conditional distribution in all the topics.
An interesting thing in pyLDAvis is its method of ranking terms in a topic which is called relevance.
Relevance is the weighted average of logarithm of a term’s probability and its lift.
Lift is defined as ratio of the term’s probability in a topic to its marginal probability across the corpus.
Relevance of a term w to topic k with weight parameter λ can be mathematically defined as:where ????????????.is the probability of term ????∈1.
????.for topic ????∈1.
????.with ????.being the number of terms in vocabulary of the corpus; ????????.is the marginal probability of term ????.in the corpus.
????.lies between 0 and 1.
It determines the weight given to the probability of term ????.in the topic ????.relative to the lift associated with ????.
When ????=1 then terms are ranked in decreasing order of their topic specific probability.
When ????=0, terms are ranked in a topics according to their lift.
pyLDAvis provides a user interface to adjust ????.
For interpreting topics, I have kept ????=0.
6, which is also the optimum value obtained from a user study .
You can check the visualization for the current analysis below.
Please check my blog to see and play with the interactive visualization.
Following topics were obtained (listed in decreasing order of their prevalence):Conclusion:Firstly, topic modeling using LDA showed a good resolution/separation of topic-specific terms for topics 1–9 and 12, except 5 and I found topic interpretation fairly easy.
But the interpretation for other topics was difficult with many unrelated terms coming in a single topic, indicating that the topic model is picking up noise or idiosyncratic correlations in the corpus.
Selecting more number of topics didn’t give any improvement in topic interpretation.
Importantly, as the corpus is not big (only around 6000 articles), contains only abstracts (short text) and a vocabulary count of 23,157, these can be reasons why in some cases the topic model is picking noise.
Secondly, topics 1 to 7 showed similar topic prevalence that ranges from 10.
2% to 7.
9% of tokens.
Thus, indicating most of these topics are equally important or prevalent in the corpus of retracted literature.
In conclusion, in spite of a small dataset, LDA could find 14 topics in the retracted literature which majorly include- Medical, Cell cycle and proliferation, Cancer, Immunology, Oxidative stress, Genetics, Central nervous system and cognition.
It is not wrong to say that LDA is a very powerful tool that provides an easy and fast way to explore patterns or themes in the corpus.
The source code for this analysis can be found on my Github repository.
This article was originally published at my blog.
Please feel free to visit it.
, Ng, A.
and Jordan, M.
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan), pp.
, Both, A.
and Hinneburg, A.
, 2015, February.
Exploring the space of topic coherence measures.
In Proceedings of the eighth ACM international conference on Web search and data mining (pp.
and Shirley, K.
LDAvis: A method for visualizing and interpreting topics.
In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp.
.. More details