Removes punctuationRemoves numbersRemoves stopwords (words that are too common and do not qualify for being good keywords for search)Applies lemmatization (converts each word to its lemma word like ran, running are converted to run)ContentThis module connects to sqlite3 database and helps us iterate over the pages and clean their content using Cleaner module.
I added other methods to get the page and url by id.
ApplicationPhoto by Jason Leung on UnsplashOnce I had the modules set up, I began scraping for data, training the LDA model and recommending articles.
Collect DataFirst, I run the file collectData.
py which expects two arguments to begin extracting data from Wikipedia and storing it in the database.
category: The category for which we want to develop the article recommender systemdepth: To what depth do we want to extract the webpages for a given category.
For example, when browsing through an article when beginning with depth 2, it’ll go one step deeper (i.
its related articles) with depth 1 but will end at the next depth as it will be 0.
It creates the directory data if it does not exist.
Using WikipediaCrawler, it extracts the pages and stores them to wikiData.
db to be used by other files.
On completion, it outputs the message: The database has been generatedGenerate LDAThe next step is to use the database we created, build a LDA model from it and store it in the data folder.
First, I read the database and create a dictionary.
I remove all words that appear in less than 5 documents and that appear in more than 80% documents.
I tried multiple values and concluded on these numbers by hit and trial.
Then, using doc2bow, I create the bag of words which act as the list of keywords.
Finally, I generated the LDA Model and saved the model, dictionary and corpus.
EvaluatorFinally, everything is ready.
We invoke the evaluator.
py and pass in a query string based on which we identify the keywords and list the top 10 articles that match the search criteria.
I read the query and identify the keywords from it.
Then, by invoking the get_similarity method, I calculated the similarity matrix and sort them in the decreasing order so the maximum similarity documents are at the top.
Next, I iterate over these results and present the top 10 urls which represent the recommended articles.
Real ExampleUse CaseI created the database with depth 2 and category Machine Learning.
It generated the file, wikiData.
Next, using the generateLDA.
py, I created the LDA model.
UsageI used the search query as Machine learning applications and was recommended the articles as can be seen in the image below:Recommended articles for ‘Machine learning applications’ConclusionIn this article, I went about how I developed a LDA model to recommend articles to users based on a search query.
I worked with Python classes and devised a complete application.
If you like this article, check out my other articles:Working with APIs using Flask, Flask RESTPlus and Swagger UIAn introduction to Flask and Flask-RESTPlustowardsdatascience.
comPredicting presence of Heart Diseases using Machine LearningApplication of Machine Learning in Healthcaretowardsdatascience.
comMachine Learning Classifier evaluation using ROC and CAP CurvesLearn about ROC and CAP Curves and their implementation in Pythontowardsdatascience.
comAs always, please feel free to share your thoughts and ideas.
If you need help in a project or an idea, ping me on LinkedIn.
You can find me here.