In other words, what information is flowing and persisting in these communities?Information Systems and Data Science often come hand in hand in the field of Social Media Analytics.
What allows their union is the area of Natural Language Processing (NLP), and processing language generally requires some form of quantification so we can model the information before we torture it with math.
For this project, I will be looking at some standard NLP methods to quantify and ‘observe’ textual information (tweets).
This will be followed by some analysis and plotting of various relationships that we can use to make some descriptive inferences of the community profile.
The higher objectives of this post are two-fold:To introduce all people (technical and non technical alike) to some basics of dealing with social media textual information as data — possibly help them see how their data can be (or is) used & classified.
Maybe even introduce some privacy concerns…To explore the quality of results we can get from natural language processing specifically within the Twitter info-based communities contextTo achieve a data-centric view of what this community as a whole as well as the individual users provide to the root user (me in this case)What we Will Discuss1.
Quantifying Information (Term Frequencies and Named Entities)2.
Discovering Topics in Tweets3.
Building User Profiles (based on their tweet content and role in the communal information flow)The Scope and the DataI will scope this post to the analysis of intra-community properties without going into insights gained by inter-community analysis.
What this means is, we will only look at one community at a time.
For the purpose of showing the entire flow, I will use my community #6 from stage one, a community consisting of young Muslims spread across various fields and interests [Figure 1.
Recall that this is a ‘subjective’ analysis — the group of users that make up this community is only the subset of users that I follow from the more broader community.
In this sense, this community is a sample of a larger community — which means insights could very well translate over but would not necessarily be completely accurate.
Side Note: I have actually begun to follow more people that I would say are from the 'Muslim Community' since I collected this data.
but they will (unfortunately ?) not be included in this analysis as this is building off my algorithmically-found ego-community.
Furthermore, since once again we are concerned with information ‘flow’, we will only be looking at the tweets that ‘flowed’ in the community — which I have approximated as tweets that were retweeted more than once within the community.
These tweets fall into 2 specific groups:Within Community Flow — Tweets generated (originally tweeted) by someone within the community and then propagated (retweeted) by someone within the community.
Outside Community Flow — Tweets generated by a user not within the community but then propagated by someone within the community.
1: Information Flow RatioFigure 1.
1 shows distribution of the flow ratio weekly, showing that each week, the tweets propagated within the community consists of 10–25 times more of tweets generated outside that community compared to tweets generated within the community.
This can be seen as a social property of the community.
Flow Ratio: #ofRTsGeneratedOutside/#ofRTsGeneratedWithinFigure 1.
2 below shows the weeks of information flow this analysis will look at.
Be advised that the maximum number of tweets that can be pulled from twitter for any given user is ~3200.
Therefore, if a user tweeted (including quoted and retweets) more than 3200 for the year, we will not have all their tweets of 2018 so the distribution is a little skewed but nevertheless the difference between outer and inner flow is obvious.
2: Tweet Data TimelineThis categorization of flow results in a further categorization of users into serving a mixture of 3 roles within the community:Generator — As someone who tweets information that flows within the community.
(creates information)Internal Propagator — As someone who propagates (retweets/quotes) information generated by a member of the community to other users within the community.
(spreads information within community)External Propagator — As someone who propagates information generated by someone who is not a member of the community to users within the community.
(brings outside information into community)Of course, a user can serve multiple roles at varying amounts and we will analyze this later on.
Lastly, the tweets are also categorized but in a more involved manner.
Every tweet was preprocessed to clean out user mentions, hashtags, urls, medias, etc.
until we had their raw text and we could use prominent text processing libraries (spacy in this case) to tokenize and tag tweets.
3 is a sample of the data we can get from preprocessing text in this manner.
Tweet = "Happiness is a category of slaves"Figure 1.
3 Sample Tweet ProcessingInformation ExtractionWords as not Just Words … but as NumbersWords are never ‘only words’, they matter because they define the contours of what we can do.
— Slavoj ZizekTo initiate our investigation, it is important to return to the fundamentals , to the very units that carry information: words.
Speaking to the mentioned Zizek quote, to discover the ‘words’ is to define the ‘contours’: the surface boundary of the shape that constrains the community as is (I confess I have completely misappropriated his statement for my own gain, for which I can only believe he would be proud).
This might seem very obvious in one realm (words carry information…duh) but becomes more meaningful in the area of Information Systems as it helps to answer a critical question: how do we quantify ‘information’?.The answer is almost childlike: by ‘counting’ words.
Important Note: Keep in mind all our analysis and insights are time-dependant and they only indicate the behaviour of the community within the timeframe for which we have their tweeting history (roughly speaking, the year 2018).
Let’s take a look at the frequencies and plot out some of the more significant terms [Figure 2.
Let’s also specifically take a look at ‘Named Entities’ (NE) [Figure 2.
How to look at these charts:Frequencies of terms in “Within-Flow” in RedFrequencies of terms in “Outside-Flow” in BlueX Axis is the Frequency (i.
1 = 10%)Numbered labels are the term’s ranking in the respective flow (i.
1 = the highest occurring term within the specific flow)Charts are ordered by ‘Rank Difference’ implying:Words near the top are most significantly occurring terms in the Outside-Flow and not so significantly occurring in Within-FlowWords in the middle are similarly significant in both flowsWords near the bottom are more significantly occurring terms in the Within-Flow and not so significantly occurring in Outside-FlowThe Meaning Behind the NumbersHow do we extract insights from the above figures?.I’ll point out some and leave the rest for you to note.
Sample Readings:[Muslim, Islam, Muslims, Islamic] are high occurring NEs in both flows as well as some of the highest occurring words overall — could point to a major discussion point or identity of this community.
[Liberalism, Ideology, Feminism, Capitalism, Nature…] are some of the major terms that can be found near the bottom of Figure 2.
1 — they are more significant in WithinFlow than OutsideFlow implying the community inclined towards generating information within itself around these terms rather than importing information from outside.
[Trump, Student, School, Attack…] are some of the terms that are found near the top of Figure 2.
1 and 2.
2 — the information this community propagates around these terms is more so generated from outside the community (most likely related to sharing News probably)[Shari’a, Eleventh Contentions, Holy Quran, Dostoevsky…] are some terms that are found near the bottom of Figure 2.
2 —significant information around these entities was generated within the community and spread.
Interestingly, Dostoevsky might seem like an odd one in this list but tracing it would highlight the generation and spread of a certain article in the WithinFlow of the community:"On the heels of the horrific Pittsburgh synagogue massacre guest contributor examines Dostoevsky's predictions about ideological radicalization and asks what shapes the psychology of the modern terrorist" – @TraversingTradhttps://traversingtradition.
com/2018/10/29/dostoevskys-strange-ideas-and-the-modern-terrorist/Another way to look at this data (if there are not a lot of points and the labels are clear) is in a scatter plot where the axis are the relative frequencies (log scaled) from the different flows [Figure 2.
The plot below is for Named Entities and depicts some of the analysis we made above (ex.
note the terms [Muslim, Muslims, Islam, Islamic] in the top right of the chart, indicating significance in both OutsideFlow and WithinFlow)In the same spirit as the above points, we can continue to extract more comparative properties that help to define the community.
Yet, though this exercise may prove valuable, our spirit would eventually wane as we realize the tediousness of this task.
We also will start to recognize some inaccuracies and begin to ask questions like ‘Wait, is this term really significant or is it just showing up because its associated with this other term and the other term is really the one that’s significant?”.
This leads us deeper — we don’t just want a bunch of words, we want to really get down to the crux of what all these words represent.
If words defined the contour of this community, we want to now get to the ‘latent sources’ of the contour — what is ‘causing’ these terms to appear?.This is the motivation behind the NLP techniques of topic modelling.
Topic Modelling & AnalysisTopic-Term ProfilePut simply, Topic modelling takes a collection of ‘documents’ (tweets in our case) which are made of various ‘terms’ (words in the tweets) and finds N (the number of topics) unique weighting strategies to apply to the terms such that each of ‘document’ is categorized into a mixture of the N topics.
There are numerous tutorials and details about this you can search for so I will not bother to go into more detail than that.
Specifically, I vectorized (basically quantified) the tweets using TF-IDF and then used LDA to find the ‘Topics’.
1 depict our final topics in WithinFlow.
This sort of approach doesn’t give us a great picture of the ‘viewpoint’ of the community around the topic, just what the topic is — we know ‘what’ they are talking about, not necessarily their opinion on it.
(Though more involved analysis can certainly estimate opinion as well — we’ll see a small example of that later when we are discussing polarity and subjectivity.
)There are many other terms that have a weight associated to them from one or more of the topics but I have only included some of the more significant ones for readers to get an idea of what the topic 'means'.
2: Topic NamesThe sizes of the bubble in Figure 3.
1 indicate the ‘weight’ of the term (on the y-axis) to the topic (i.
Topic 1 in the WithinFlow has ‘Article’ as its most significant term).
If we take the 7 most significant terms in each topic we get a term-summary of sorts for each topic [Figure 3.
We can then take the same topic model and apply it to the OutsideFlow to see how topic weights compare in the information flow coming from outside the community.
3 visualized a comparison of Topic Weights over the two flows.
Roughly speaking, I can vaguely see some specific topics popping out: Philosophy (#6), Feminism / Women Studies (#4), Ramadan (#5), Statehood / Muslim-Related Politics (#3), Prophetic Sayings (Hadith) + Religious anecdotes (#2), Article Sharing (#1)The others still make sense but seem to be a mix of things.
While we’ve clustered terms into topics, we need to go up one level and see how these topics model entire tweets.
Topic-Tweet ProfileEach tweet is assigned a weight for each topic making a topicVector.
We can then spatially map out tweets after some feature reduction on the topicVector to bring it into 2-D space.
We can see the clusters that form as well as tweets that are not as much associated with any specific topic but are somewhere between them in relative space.
4 visualizes the tweets and clusters for WithinFlow (if you download the notebook file, you can hover over (and zoom in) each point and explore which tweet each dot corresponds with — hovering over different clusters will show you which ‘kind’ of tweets are contributing to our topics!).
Note that the size of the circles (each circle = a tweet) represent the tweet's weight for their respective topic — the larger circles are on the outskirts, meaning those tweets are more closely related to their respective topics.
It's as if the ‘latent’ topic sources surround the overall information flow.
We can see isolated clusters as tweets that fall closely in their respective topics but others that are spread out among other topics (i.
a tweet could be 10% topic 1, and 40% topic 2, and 50% topic 3, etc…).
The tweets in the middle of the ‘ring’ are a mix of multiple topics with lower weights and therefore are not isolated into a cluster.
These tweets could be interpreted as not really falling into any topic, or being hard to classify — they’re outliers nevertheless.
We can also apply our topic models to the OutsideFlow, shown in Figure 3.
We see more of a mix and less defined boundaries in the OutsideFlow tweet clusters — this is expected as generally the information flowing from outside will not be so well-defined as internal propagation.
Once again, you are are encouraged to open the notebook file and hover over various clusters to get a feel of what the clusters are representing and what kind of tweets are causing cluster-mixing.
Topic ProfitabilityUsing topic weights, we can categorize each tweet into a specific topic by choosing that topic that is weighed largest for the tweet.
We can then visualize the ReTweet Count (how many times the tweet was retweeted) distributions for each topic within each flow separately — in this manner we can measure the ‘profitability’ property of our topics.
Each chart in Figure 3.
6 shows the distribution of ReTweet counts for each type of information flow (blue=WithinFlow, Orange=OutsideFlow).
These distributions give us a probabilistic idea of how many retweets a tweet within this topic gets.
Note that the x-axis is log10 scaled (i.
2 = 100 retweets, 3 = 1000 retweets, etc.
This means even slight differences between the distributions in the charts can indicate large profit (#ofRetweets) increases/declines.
Sample Reading:Tweets in topic #3 (Statehood) that are in WithinFlow are most likely to get ~10–100 retweets.
Tweets in topic #2 (Hadith, prophetic sayings) get the same amount of retweets (~10–100) whether they are internally generated or brought from outside the community.
Tweets in topic #1 (Article sharing) get more retweets on average in WithinFlow than OutsideFlow but only OutsideFlow tweets get more than ~100 retweets.
Topic Polarity & SubjectivenessAnother interesting measure that can be explored is a tweet’s polarity (how much ‘emotion’ is expressed in the text ranging from -1.
0 to 1.
0— how ‘polarized’ it is) and its subjectivity (how ‘personal’ the expression is ranging form 0 to 1).
I used a pretty weak method of calculating these measures (simply aggregating the polarity & subjectivity of each word in the tweet which can be gained from predefined libraries and mappings) but nevertheless we can make some cool, alien-looking graphs!.These are joint density graphs which basically model bi-variate (2 variables) distributions — they are a nice visual to tell us where most of the tweets are on a polarity- subjectivity scale.
The colour indicates the ‘density’ so for example, the maxTopic=3 chart for WithinFlow (red) shows a deep red circle centred around (polarity = ~0.
1, subjectivity = ~0.
25) — indicating most of the tweets in topic 3 have those polarity/subjectivity values.
Sample Readings:Tweets in topic#4 (Woman-Muslim-Man-Abortion-Hijab-Feminist-Want) are more subjective in WithinFlow (generated internally within the community) than the tweets brought into the flow from outside the community.
Tweets in topic#3 (Statehood) are more spread out on the polarity/subjectivity scale in WithinFlow compared the tweets in OutsideFlow which are more densely packed.
These are general trends, and I will take the time again to remind the readers that our analysis is both subjective (based on my personal followings) and temporal (the information flow in 2018-ish).
Let’s move on to users because everyone knows… the real gossip isn’t about ideas… its about the people!.Since we know which tweet came from which user in the community, we can go up a level again and investigate at the User-level.
Topic-User ProfileFYI: All these user profiles were public (at the time i collected this data, so legally I’m cool).
If a user has 1000+ followers (pretty much an arbitrary number I ‘intuitively’ decided.
), I’ve kinda just assumed they are openly public and should be okay with profiling their tweet.
For <1000 followers, I have messaged and asked if they were okay with it and removed their data if they indicated otherwise.
However,If you wish for your name to be removed for any reason, let me know!Before we mix user profiles with our generated topics, let’s start by profiling our user’s individually… there are algorithms that can help us to find the most discriminative terms from a subset of documents (i.
tweets by a specific user) compared to the entire set of documents (all the tweets in the information flow).
8 shows the most discriminative terms in the user’s generated tweets (genDiscTerms) and the tweets they propagated (propDiscTerms) that ‘set them apart’ from the rest of the information flow.
This does not mean these are their most significant terms!"N/A" means there weren't enough tweets to really discriminate any terms.
sorry!*For those of you who know '@dimashqee', his account has been deactivated so we don't know which tweets he has retweeted, even though he tends to be a major player in this community around certain topics.
These already gives us an idea of the user!.But we want to be able to profile users based on their contributions to the topics we have found in our information flow.
To accomplish this, we can aggregate our topic-tweet weights from the previous section by the User who Tweeted and the User who ReTweeted.
This allows us to profile the users in the community with respect to their 3 roles that we highlighted earlier (Generator, Internal Propagator, External Propagator).
10, and 3.
11 depict User-Topic Occupation Percentages — to read them:Occupation Percentage: The size of the circle indicates the % of the specific information flow (WithinFlow or OutsideFlow) around that topic the specific user ‘occupies’ (which can mean how much info they generate or how much they propagate depending on the chart)All columns sum up to 100%The last column ‘TopicSum’ indicates the total % of information flow the user ‘occupies’ in the respective role of the chart.
Some of the numbers are hard to see here, I once again point the readers to the linked notebook in which you can hover over the points and see the complete topic Name as well as a clearer Occupation %.
Occupation sounds a little.
harsh, but it is meant to be indifferent here.
:)We can also use scatter plots to see a high-level role profile of the user space.
13, and 3.
14 show higher-level relationship between different user roles in the community (i.
12 is a comparison between a User’s total Internal Propagation Occupation % in the community (across all topics) and the user’s total Generation Occupation % in the community).
GEN = Generator, IPROP = Internal Propagator, OPROP = Outer/External PropagatorNote: those with very low occupation % are excluded.
14 collectively are, in my view, one of the more valuable measures that can capture user behaviour not only by their content but their type of activity.
Here are some sample high-level readings that can be gained from the above 6 figures.
Sample Readings:‘@TraversingTradition’ is a major generator within this community’s internal information flow — generating over 24% of the internal material this year — but propagates only 7% of the internal material to the rest of the community.
It only seems to effectively engage with 6/9 major topics within the community.
‘@AndrewStodghill’ and ‘@SeekingErudite’ are large propagators in both kinds of information flow but very low generators.
These users can be seen as important nodes for the continuity of information flow even though they do not necessarily generate material.
TheSalafiFeminist (‘@AnonyMousey’) brings the largest proportion of outside information into the information flow for topic #2 (Feminism / Women Studies)There seems to be a kind of 30–70 rule for each topic… 30% of users (~5) generate and propagate ~70% of the information in each topic.
The generation occupation of topic#4 (related to feminism, hijab, etc.
) is much more distributed than, say, the generation occupation distribution of topic #2 (related to prophet sayings (Hadith), religious anecdotes and quotes, etc.
) which is 50% monopolized by a single user.
Such a trend could prompt further research as it may hint that topic #4 has a less of an ‘echo chamber’ than topic #2.
Of course, there are a lot more insights, depending entirely on what kind of questions you are looking to answer and if you’re trying to determine the behaviour of a single user, etc.
…but wait, there’s more!.We can also break down the various user roles by topic and explore inter-topic correlations.
To save you from a barrage of charts, I have included the analysis in Appendix A for anyone interested.
Concluding RemarksOur analysis has helped us to successfully form a Term-level, Tweet-level, and User-level profile of the information flow surrounding the selected ego-community.
Overall, the results have been quite satisfactory!.Rather than topic modelling blindly on a set of tweets, the initial breakdown into communities of users has helped to constrain the problem and find topics that are truly relevant and insightful.
This is further validated by major contributors to the information flow like ‘@TraversingTradition’ being highlighted in our user-profiling stage.
Reminder that once programmed, these kinds of analysis (and much more detailed, advanced processes) take minutes to perform — this means when you are consenting to use Twitter as a public platform, you’re also making your data public for tools and algorithms that are more efficient and effective than you may think.
Of course, my purpose was not ‘malicious’ (I promise!) but a more malicious purpose would have the same access…Since I follow these users personally, the characteristics, content, and behaviours of these users extracted from this analysis was not too much a surprise and seems to agree with the intuitive understanding I have qualitatively gained of the community over time.
However, as this analysis can be completed for any set group of users, this sort of topic modelling analysis can preemptively give us an idea of the total information gain (and/or user specific) this community can provide.
The obvious use case for such an understanding in today’s social media models points to targeted marketing and advertisement, but one can also imagine uses to gain sociological information for studies, research, policy development, etc.
that are less ‘corporate’.
Highly weighted terms in both WithinFlow and OutsideFlow of our analysis like [Islam, Muslims, etc.
] help us to inversely validate our community detection — to feel good about our algorithmic grouping of users.
Though we scoped our analysis only to one community — there is certainly room for inter-community comparisons, perhaps even building base metrics that define a community’s information flow and discovering that communities are made up in predictable fashions in practice?.(i.
in terms of a distribution of their user roles)The analysis here has been mainly descriptive, but this information can definitely be used to build predictive models as well and help us see effects and influences that can develop in the future.
The main purpose of this post was to classify the substance of the information flow so I’ve remained shy in attaching interpretation to our observations and extracting predictive insights.
This is because, when interpreting, we face a soft limit of technocratic approaches: the observations, as well-plotted and interesting as they are, could mean anything and everything.
For this problem, (I believe) data scientists have to make a brave attempt at leveraging the field of sociology to provide us with a theory or two to interpret what we have found — emphasizing a philosophical view I hold dearly: the necessity of backing data science related analytical processes with ‘Theory’.
Though community detection also followed from the concept of a ‘sociological’ ego-network, this was still a very superficial reference to the sociological field.
To further our analysis of the content of information flow, we will require something much, much deeper.
(hint: Goffman? Bourdieu? Sartre?.
Heidegger?)Community Detection (Stage 1) helped us to find ‘where’ the information flowed, Topic Modelling (Stage 2) has helped us to find ‘what’ information is flowing — the next stage (3) is the ‘why’.
In the future, I plan to wrap this project up by introducing connections between data science analytics and sociological theories (in the field of social media analytics) and how they can help us to interpret and constrain the meaning from our results.
Stay tuned…Appendix A — Inter-Topic Correlations Between RolesNow I know this looks like a lot of charts, dots, and lines but not to be intimidated!.Figures A.
1 and A.
2 show the relationships between external and internal User roles (External Propagator vs Internal Propagator / Generator) by topics (ex.
the top left chart in Figure A.
1 relates External Propagation Occupation % for topic 0 vs.
Generation Occupation % for topic 0 and the top right chart relates External Propagation Occupation % for topic 0 and Generation Occupation % for topic 8).
In our case, we have a pretty small sample size (~22 users) and therefore these charts are to be taken with a grain of salt, but in principle, we can get use these visualizations to get even a deeper understanding of specific topic relations.
To read the plots below, it is best to look for oddities as you scan from left to right or top to bottom.
What we are looking for is peculiar behaviour — i.
if the internal propagation occupation % of topic X is correlated in a different manner with the generation occupation % of topic Y than with other topics, this could hint towards the topics having an effect on each other.
GEN = Generator, IPROP = Internal Propagator, OPROP = Outer/External PropagatorSample Reading for Figure A.
1There is a stronger trend showing that users propagating external information around topic#0 tend to generate less information around Topics 4,5,6,7, 8.
Generally, looking at the diagonals, Users who generate more information around a certain topic tend to propagate less external information.
Sample Reading for Figure A.
2Peculiarly strong positive correlation between internally propagating topic#2 and externally propagating topic #2.
Looking at the diagonal, it seems that, generally, users who internally propagate any give topic more also tend to externally propagate the same topic more as well.
Thank you for reading!My general purpose in these posts is a hands-on approach to explore concepts I’m personally curious about — so, regardless of your field, if you have any interesting ideas you want to discuss or collaborate for where you think Data Science can provide some value, feel free to message me and we can talk.
.. More details