Seeding Viral Growth: An Application of Graph EmbeddingSimulating different seeding techniques to maximize information diffusion in a networkShaw LuBlockedUnblockFollowFollowingApr 15Are Influencers Overrated?It is intuitive enough that you need influencers to spread awareness of your product or service.
In one of my graduate classes, however, I came across the notion that influencers may be overrated.
The basic idea is that influencers tend to concentrate in large sub-networks.
If you rely on them to promote your product, you risk missing many smaller satellite networks (more detail here).
This is a wake-up call that many intuitions that we take for granted may not be corrected, and, if implemented hastily, could be costly in the long run.
Over the weekend I decided to test several seeding strategies on different datasets.
As it turns out, there is no simple answer as to whether influencers are overrated.
The growth dynamics heavily depends on the structure of the graph clusters, node embeddings, and how influencers connect with the rest.
Two datasets — Facebook politicians and Deezer music streaming––were particularly illustrative.
After running the GEMSEC graph embedding algorithm, the two datasets are visualized by the top 3 principal components as shown above.
Can you guess which of the following seeding techniques work best for each dataset?Simple random seeding: this is the naive way of seeding, but it has its proponents.
The main advantage is simple, cheap, and fair.
Random seeding on stratified sampling by cluster size: by relying on stratified sampling, we are assured not to miss seeding small clusters.
Influencer seeding: simply choose influencers with the most connections.
Cluster-based influencer seeding: similar to option 2, but we are picking top influencers in each cluster.
This post will focus on high level discussion, and leave the implementation details to the end.
DatasetsReal world social-network datasets were used to train the GEMSEC model.
The datasets are avaialble in the repository.
Here I focused on two datasets:Facebook politician page networks: These graphs represent mutual like networks among verified Facebook pages.
Deezer user-user friendship networks: friendship networks from the music streaming site Deezer (here we focused on Hungary only).
Graph EmbeddingThe only information available in the dataset is edges between nodes.
We know who is connected to whom, but we don’t know why they are connected, or how similar they are.
If we were to model information spread, we need some kind of similarity measure to estimate how likely one person will spread the information to another.
In a music service, the similarity can be music taste.
In a social network, the similarity can be a proxy of intimacy (shared conversation, photos, etc.
)To get around this limitation, I trained the GEMSEC graph embedding model to directly learn node embeddings while clustering them at the same time (hence the name Graph EMbedding with SElf Clustering).
Let me waive the mathematical details, which is available in the original paper.
The important point is that we now have the graph structure and the node embeddings.
SimulationThe diffusion process is modeled like a random walk.
The probability of infection is correlated with the cosine similarity between two nodes.
I made the simplifying assumption that nodes with negative cosine similarity will not infect each other.
In the beginning, only the seed nodes are infected.
The seed nodes are placed in a queue.
Every time a node is dequeued, it tries to infect all its neighbors.
Newly infected nodes (including the old node itself) are appended to the queue in a breadth-first-search fashion.
ObservationThe result is summarized in the two plots below.
Cluster-based influencer seeding outperformed direct influencer seeding in the Facebook politician dataset, but they showed little difference in the Deezer music dataset.
Cluster-based random seeding outperformed simple random seeding in both datasets.
This is expected, as we are essentially performing stratified sampling by cluster size.
Diffusion curve, Facebook politician datasetDiffusion curve, Deezer music dataset (Hungary)What causes the difference in influencer seeding?.The primary reason is that influencers are not equally powerful in all clusters.
If we only choose the top influencers, only 12 out of the 20 clusters will have qualified influencers.
By stratified sampling, we are forcing clusters that have less powerful influencers to be seeded.
A second reason is attributed to the graph structure.
The politician dataset has very polarized, clearly distinct, and well-separated clusters.
The Deezer music streaming dataset is more blurry.
This is understandable, people can enjoy different music genres, but politicians stick to specific (and often opposite) political stances.
The density plots for the top influencers’ neighbor nodes clearly demonstrate the polarized pattern: politicians’ friends are all alike; music lovers forge friendships across clusters.
Kernel density estimation, top influencer’s cosine similarity with connected nodes (Facebook politician)Kernel density estimation, top influencer’s cosine similarity with connected nodes (Deezer music)SummaryGraphs are among the hardest and most profitable problems to work with.
The purpose of this blog is not to propose any universal truth, but to illustrate some perspectives from which you could inspect your own graph datasets.
Some details worth your attention are:When knowledge of the graph is readily available, or easily derivable via machine learning, influencer seeding are much preferred compared to random seeding.
The similarity measure should be a good proxy of infection probability.
Ronaldo has 161 million Instagram followers, but it doesn’t mean he can influence all of them (most of whom are very different from him).
Are clusters well-defined?.Are individual nodes highly polarized?.If so, then influencers in specific clusters could be very influential.
Are influences equally influential in all clusters or segments?.If not, you may need stratified sampling to catch small satellites networks.
Instead of relying on intuition, run simulation and incorporate as much information as currently available.
Although simulation often turns out to be grossly inaccurate, it helps raise valuable questions.
Experimentation DetailsFirst, clone the GEMSEC repository to your workstation.
The datasets are also available in the repository.
Second, create the environment with the required packages.
You can simply run my gemsec_env.
conda env create -f gemsec_env.
ymlTo replicate my result, run the commands in the experiment.
It’s recommended that you use a multi-core-CPU virtual machine.
The algorithm supports multi-threading.
Plots, summary statistics, and animation codes are available here.
Sarkar and C.
GEMSEC: Graph Embedding with Self Clustering.