An important consideration here is that information theory deals with the limits of communication, thus a given entropy rate and its corresponding predictability imply the upper limit of a prediction-generating model given this data.
The random predictability (with a vocabulary of 5000 words) would be 0.
02%, whereas the average Twitter user in their dataset is characterized by 53% predictability (corresponding to an entropy rate of 6.
This would mean that, in an ideal model built on this Twitter data, more than every second predicted word would be correct.
Yet the main part of this story deals with your friends.
Analogous to entropy, cross-entropy is the number of bits from your friend which are needed to predict your texts.
In this article, the authors chose the 15 presumably closest friends for each individual (which were most often mentioned by the individual on Twitter).
The first important point is that there is information about you in your social circle.
Combining predictive information from you together with your friends increases the predictability to over 60% (64% for an infinite number of friends) even though there are definitely diminishing returns operating here, meaning that the effect of the first friend added to the model is a lot more palpable than the effect of the tenth friend.
This is all nice and well to improve predictions but now comes the whopper.
Remove yourself from the social media network and only use your friends to predict your text and you end up at ~56% predictability with 15 friends or ~61% with an infinite number of friends.
Just to really make it clear: using a mere 8–9 friends (without using the individual itself) you break even with using information from the actual person and you can get up to 95% of the maximum predictability of individual+friends!If you’re a data science person who wants to make use of this, here’s another insight: this predictability is especially pronounced for individuals which posted a lot (thereby strongly influencing/imprinting their friends) combined with friends that didn’t post a lot (as they would be too varied in their expressions otherwise) and mention the individual frequently.
Therefore, the embedding a person leaves in the network when they delete their profile varies according to their personality and network.
Here are some caveats/limitations to this study: As all such studies, this social embedding effect unfortunately only works in practice if you have at some point been an active user of social media in order to identify your social circle.
Yet if there is some other way to link you to your friends (GPS co-location, mentioning in posts without tagging, etc.
) and if they are on social media, this caveat is void.
Another consideration is that the predictability gained with social media might be limited to texts posted on social media.
And while this still allows for probing sentiments and attitudes, a model built on this might not be accurate in predicting, say, longform articles written by the individual.
The most important limitation, also mentioned by the authors, is that this embedded information may change over time as your social circle evolves (hell, some of them might even quit social media as well).
Thus, you might have a short time-frame to develop a well-performing model from the moment the individual quits social media (good for you, individual!).
It would be interesting to see whether this could be mitigated by including either more friends into the model, carefully choosing friends with a low degree of ‘change’ to preserve the embedding or simply relying on older archived social media data (the internet forgets nothing).
This is why prediction from your friends after you quit social media will be important.
Source: Google TrendsIn summary, I think this article is a great example of the way we preserve information in our environment.
Think of cities for instance.
What else are they than the memory of crossroads, natural harbors and trade routes?.In the same vein, our friends are representative of at least a part of us and carry information about us with them.
And apparently, with enough friends, this could be enough to teach a model to know us better than if it would have ‘studied’ us directly.
With the exodus of traditional social media platforms mentioned earlier, this might be a golden opportunity for advertising firms or agencies interested in your political leanings to maintain or expand their predictive potential using machine learning and data mining.
I wonder how easily built / potent a machine learning model built just on friends would be in practice!.Let me know if you give it a try!Some additional notes / trivia from the article which you may find interesting:- Social media texts are more extreme than ‘conventional texts’ with some being very predictable and some intractable.
– Based on cognitive limits, Dunbar’s number postulates a maximum number of around 150 friends per individual.
Yet the average number of Facebook friends is over 300 and the average number of LinkedIn connections clocks in at over 500.
This means social media platforms could have quite the impact compared to conventional friendship networks.
– There is more long-term information in the posts of individuals themselves compared to their friends.
This is made clear by the diminishing returns impact made on individual text prediction by friend posts dating back a while compared to recent posts.
– The limits presented here might be extendable as the authors excluded hyperlinks from their data.
Using Co-training or other approaches might result in a model which includes information about these hyperlinks and therefore is able to leverage more information and achieve a better prediction.
.. More details