Let’s take a look at five potential ways of answering those questions.
Measure 1: Common NeighborsThe most intuitive and simplest measure is to count the number of individuals that both individual A and individual B are connected to.
How many common neighbors do Jack and Liz have?The answer is two: Mike and Cindy.
Now all that’s left to do is count the number of common neighbors for all non-existent connections and sort them from highest to lowest.
Measure 2: Jaccard CoefficientThe Jaccard coefficient is very similar to the number of common neighbors with the difference that the number retrieved in measure one is normalized.
To do so, one takes the number of common neighbors and divides it by the total number of neighbors.
In other words, one divides the intersection of A and B by their union:In the case of Liz and Jack, we have already determined that the number of common neighbors is two.
How many neighbors do Liz and Jack have in total?Liz and Jack have five neighbors in total.
As a result, their Jaccard coefficient is two divided by five or 0.
As for measure one, one would now have to calculate the Jaccard Coefficient for all non-existent connections and compare them.
Measure 3: Resource Allocation IndexDon’t get confused by the name.
This measure is similar to the Jaccard coefficient in that it normalizes using the total number of neighbors.
This time, however, we’re looking at the common neighbors of the individuals we are interested in.
Mathematically expressed, resource allocation looks like this:All this means is that we want to calculate the sum of one over the number of total neighbors for each common neighbor of our individuals of interest.
For this one, let’s calculate the Resource Allocation index for Elle and Ronald.
Elle and Ronald have one common neighbor, Steve.
Steve himself has three connections.
Thus, applying the formula above gives us 1/3.
Because Elle and Ronald only have one common neighbor, we don’t need to do anything else.
If they had another neighbor, we’d do the same thing we did for Steve and add those two numbers.
Measure 4: Adamic-Adar IndexThe Adamic-Adar index is pretty much equivalent to the Resource Allocation index except for the fact that instead of dividing by the total number of neighbors, one divides by the log of the total number of neighbors.
Mathematically expressed:Measure 5: Preferential Attachment ScoreThe Preferential Attachment Model is an attempt to create a blueprint of the essential structure of many social networks.
It assumes that individuals (nodes) with many connections (a high degree) get more new connections (neighbors) than individuals with fewer connections.
In other words, if you already have many friends, you are going to meet more new people that someone with fewer friends.
To calculate the preferential attachment score, one simply multiplies the number of connections the individuals of interest have with each other.
This calculation is best explained visually.
Remember Liz and Jack?.If not, here’s a little refresher:As you can see, both Liz and Jack have three neighbors.
Thus, their preferential attachment score is 3*3=9.
Who’s Most Likely Your Next Friend?Now that we’ve gathered all the needed measures, we can finally answer our initial question.
One option to do so would be to choose one of these measures, calculate the scores, and then compare the results.
However, as I wanted to avoid introducing any bias into this project, I decided to calculate the scores for all measures, normalize them, and take the average score.
This way, I avoid having to choose between the measures and also receive a resulting score that ranges between zero and one.
Let’s assume we want to find out who Jack is most likely to connect with next.
Can you guess who it is?.Very unsurprisingly at this point, it’s Liz with a score of 0.
Let’s do another one: Who is Mark most likely to connect with?.The answer is Liz.
Now, looking at the structure of our social network, is there something standing out to you that would challenge that result?Taking another look at this social network, it seems like there are two communities with stronger connections within them than with the other community.
When calculating probabilities of who someone is most likely to connect with, not paying attention to community structures could definitely result in biased results.
People within a more tightly knit community should be more likely to meet people from their own community than people from different communities.
To account for the factor of community structures within a social network, I’m going to introduce two new measures that are extensions of the previously discussed measures that were introduced by Soundarajan and Hopcroft (2012).
Adapted Common NeighborsInstead of simply counting the number of common neighbors of two individuals, this adapted common neighbors measure first counts the number of common neighbors and then counts how many of these common neighbors are in the same community as individuals A and B.
Finally, those two numbers are summed up.
In our case, let’s look at Ronald and Elle:Ronald and Elle have one common neighbor, Steve.
Steve is part of the same community as Elle and Ronald.
Thus, the adapted common neighbors score is 1+1=2.
Adapted Resource AllocationAgain, this one is very similar to the measure introduced above.
The only difference is that the Resource Allocation score proposed by Soundarajan and Hopcroft (2012) only takes individuals into account that are in the same network as the two individuals of interest.
As explained above, f(u)=0 when the individual is not in the same network as X and Y and f(u)=1 when they are in the same network.
The formula makes the measure look more complicated than it actually is.
What’s the adapted Resource Allocation score for Steve and Pete?They have one common neighbor, Elle.
However, as they are both in different communities, f(u)=0.
Therefore, as there are no other common neighbors to consider, their adapted Resource Allocation score is zero too.
Let’s try to find Mark’s most likely next connection again.
Using the measures I’ve just introduced, we find that Mark’s most likely next collection is actually Elle!.This result seems to make more sense given that we’ve classified both to be part of the same community.
ConclusionSocial network analysis is a fascinating topic to explore.
In addition to just evaluating the edges between nodes, one could also, given a set of labels, use these measures as features themselves in supervised machine learning.
Besides that, modeling who will eventually become good friends might also be more accurate if more data for each node is available, such as their interests, age, gender, etc.
In case you should be interested in the code used to conduct this analysis, you can find the complete notebook on my GitHub.
References: Holder, Mark D.
& Coleman, Ben, The Contribution of Social Relationships to Children’s Happiness (2009), Journal of Happiness Studies Otte, Evelien & Rousseau, Ronald, Social network analysis: a powerful strategy, also for the information sciences (2002), Journal of Information Science Soundarajan, Sucheta & Hopcroft, John, Using community information to improve the precision of link prediction methods (2012), Proceedings of the 21st International Conference on World Wide Web Romero, Daniel, Applied Social Network Analysis in Python, Coursera.. More details