Auto Generated FAQ with Python Dash, Topic Analysis and Reddit Praw APIVincent TatanBlockedUnblockFollowFollowingMay 14Dash Topic Modelling with Reddit API and PythonProblem Statement“Damn, after my 3 days leaves, I’m already lost of what’s important on my Slack group”“There should be a quick access to the posts which are most relevant and get me updated quickly”The biggest problem with peer reviews or forums is the copious amounts of available information on the site.
Often times, we feel frustrated by the number of comments that are unrelated to what they have been searching for.
Take this Reddit for example, in here, there are so many posts within our home page.
All of the information clutters are very hard to keep track of.
These Reddit Posts show how much clutters a forum could bring just by a few days of inactivityIn this article, we are going to learn more on how we can make the extractions of information from forums such as Reddit easier and more intuitive.
One way to do this is to build a dashboard page designed for extracting critical topics out of forums and package them in a filterable dashboard for quick overview — I will call this Auto-Generated FAQ as it goes through text corpus and extract the topics to form trends and patterns to create a Frequently Answered Question (FAQ)/Posts.
This will help us keep in touch with the right information at the right time.
Disclaimer: This disclaimer informs readers that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.
References are picked up from the list and any similarities with other works are purely coincidentalThis article was inherently picked up from the author’s paper as a part of Final Assignment for Master of Computer Science in Georgia Tech.
Why Auto-Generated FAQ?While there is always the search engine for us to find the information we need on these forums, the functionality is limited especially for users who are not active or lagging behind the flow of the discussions.
Ultimately,Search engine would be useful for us who already know what they should be looking for (e.
the latest machine learning paper).
However, its function may not be useful for us who want to contribute to a trending topic or want to discover new ideas/information related to a topic.
Consider the following matrix of knowledgeMatrix of Knowledge and the solutions in betweenFour domains of knowledge exist during a learning process.
The first involves the known knowns, knowledge that is known and is readily accessible to help tackle problems one is familiar with.
Next would be the known unknowns, or knowledge one is currently unaware of but is accessible.
An example would be finding information on a python programming syntax, which I currently have no knowledge of.
The third form is the unknown knowns.
This is defined as knowledge that one knows exists but have no idea how to access/obtain.
An example would be the completion of a task one has no idea how to go about starting/researching into.
The final case is unknown unknowns, knowledge one does not have which he/she is unaware of.
In this case, one is actually clueless of what is going on and what is important to know.
This may happen when a person is too busy at work and lost track of the discussion flow in the forum.
Our solution will help the area of C and D, where users might not know what is important topic at a given time.
Our goal is to provide a knowledge dashboard for users to have a quick glance of forum discussion.
Purpose and Github CodeThe purpose for this Proof Of Concepts (POC) was created as a part of assignment submission for Master of Computer Science in Georgia Tech that the author is managing currently.
Some of the contents in this article is modified to better reflect general audience’s needs.
Please refer to our paper here.
We aim to build this application as a web page application with Python and Flask/Dash.
Other tools such as Github for Version Control will also be implemented.
Please refer to our Github Code hereWorkflowLet’s get our hands dirty now.
There are 3 steps that we need to consider to create the Auto Gen FAQThe workflow of creating Auto-Gen FAQExtracting Reddit with Praw Python LibraryHow do we extract the Reddit Corpus?Authentication starts with using the praw library for Reddit.
I won’t talk in details on how to get your authentication ready as there are many resources available.
Feel free to access it hereReddit praw code to input authorization profilesThen we would get the subreddit information which would be exported to.
Reddit Code to gain a certain subreddit channelNext we export hot_python to topics.
csv with the following metadataMetadata retrieved after extracting posts from Reddit PrawTopics ExtractionThis section illustrates how to do approximate topic modeling in PythonWe will use a technique called Non-negative Metrics Factorization (NMF) which is used to find extract topics from bags of words (list of words).
NMF introduces a deterministic algorithm to create a single representation using the text corpus.
Due to this reason, NMF is is characterized for ML Algorithm.
For more information on NMF and other topic modelling techniques, please refer to this research paper hereUsing sklearn Count Vectorizer to vectorize wordsThis will return 500 words in the bag upon 1115 posts in the topics.
We will use NMF to get a document-topic matrix (topics here will also be referred to as “components”) and a list of top words for each topic.
We will make analogy clear by using the same variable names: doctopic and topic_wordsdoctopic and topic words createdWe will then generate the topics and visualizations as followedDocTopic CreatedThis will create the visualizations for 5 topics, each topic would group 5 closely related words together based on Euclidean distance managed by NMF.
The calculations of appending the argmax given the length of the topic also helped us define the underlying dominant topic for each of the document in the corpus.
The following are the 5 topics foundTopic 1: omscs program students job courseTopic 2: cs undergrad degree non reviewsTopic 3: georgia tech online master programTopic 4: courses classes semester students newTopic 5: time did job offer commitmentAnalysis and Visualization of Topics AnalysisWe then tried to map the topics to each relevant posts, and we will count the number of posts that are relevant to certain topics and visualize them in pie chart and line chart.
AnalysisFrom here we can see that top topics talk about job offers for master and non master students.
Followed by the time commitment for omscs students.
Then by programs and undergrad classes.
In May and August, we can see that there are rising topics in terms of jobs and offers regarding master students2.
In October there is raising trends in the interest for the time and workload of OMSCS students (probably due to the beginning of the new sems)3.
For the overall activities, we can see decline in April but increasing activity in OctoberDash VisualizationsWe will use Dash Python, which is Python Data Visualization Framework that is built on top of Plotly and Flask.
This will be the foundation for our visualization and deployment in local.
Please check our presentation and demo below for a more animated view of the application.
Through our application, users will be able to select the topics which mattered the most recently, filtered on them and revealed the timeline.
Furthermore the table below will showcase the most relevant posts given their chosen topics.
Therefore, rather than clicking all of the available posts , users could just take a quick glance at this dashboard to catch up with the discussions.
All in just a few clicks to filter relevant posts.
Dash Python VisualizationSurvey: How useful is our Application?After building the following prototype, we built a survey to indicate confirm the app’s usefulness.
These are the results which we have found which indicated 93% as useful to very useful.
Users also find that the application were very intuitive which potentially saved time from manually scour through uncategorized posts.
Usefulness of the applicationSuggested ImprovementsDemoDemo for Auto Generated FAQ for Online Master of Computer Science Educational TechConclusion and Future WorkThis project gave us a chance at tackling a practical and relevant problem many of us seem to face on typical forums such as Reddit or Educational Forum.
We could add on with the suggested improvements to functionality, but for now, this seems to be handling what we exactly needs.
AcknowledgmentsWe would like to thank you our fellow students and my mentor, Jace Van Auken, for their supports and helpful comments on the work products we have culminated in this paper.
Lastly to Ranon Sim who oversees the general project management of this toolFinally…Whew… That’s it, about my idea which I formulated into writings.
I really hope this has been a great read for you guys.
With that I hope my idea could be a source of inspiration for you to develop and innovate.
Please reach out to me via my LinkedIn and subscribe to my Youtube ChannelComment out below for suggestions and feedbacks.
Happy coding :).