# Prison Escape – Solving Prisoner’s Dilemma with Machine Learning

Prison Escape – Solving Prisoner’s Dilemma with Machine LearningDemystifying the Most Famous Game Theory ProblemGreg SurmaBlockedUnblockFollowFollowingApr 8In today’s article, we are going to demystify the most famous Game Theory problem – Prisoner’s Dilemma.

We are going to study the problem itself, as well as the strategies that can be used to approach it.

Ultimately we are going to conduct a tournament to find the most successful strategy.

By the end of this article, you will be familiar with the Prisoner’s Dilemma mechanics and its implications that can be useful in many real-world situations.

Prisoner’s Dilemma (PD)Two members of a criminal gang are arrested and imprisoned.

Each prisoner is in solitary confinement with no means of communicating with the other.

The prosecutors lack sufficient evidence to convict the pair on the principal charge, but they have enough to convict both on a lesser charge.

Simultaneously, the prosecutors offer each prisoner a bargain.

Each prisoner is given the opportunity either to betray the other by testifying that the other committed the crime, or to cooperate with the other by remaining silent.

The offer is:If A and B each betray the other, each of them serves 2 years in prisonIf A betrays B but B remains silent, A will be set free and B will serve 3 years in prison (and vice versa)If A and B both remain silent, both of them will only serve 1 year in prison.

Above payoff matrix is crucial to understand the underlying mechanics of the problem.

While we can clearly see that there is no one-size-fits-all solution, can we come up with a strategy that will maximize our rewards?Although prisoner’s interrogation is a good and descriptive example, it’s just one of the many possible scenarios that can fit the PD game model.

Before we dive deeper into PD, let’s move on to the real-life examples to find out what may be the other possibilities of this game theory problem.

Real-Life ExamplesWhile the PD model is interesting to study on its own, it gets even more interesting if we can see that it may be used to model many real-world situations.

Here are a couple of examples.

SportsDoping usage is a classic example of PD in sports.

If neither of the two competitive athletes takes drugs, none of them benefits.

If only one of them takes the drugs, then that athlete gains a significant advantage over the other one.

However, if both of them take it, their benefits even out and due to the dangers of drugs usage, they are worse off comparing to the situation when neither of them took doping.

EconomicsOne of the best examples of PD in economics is price setting in the free market environment.

Assuming that there are only two competitors on the particular market if none of them reduces the price, neither of them gains any additional benefits.

However if one of the competitors decides to lower the price, it may achieve an economic advantage over the other entity.

In the aftermath of such a situation, the competitor with the higher price usually reduces its price to regain its market share and align with its competitor.

Ultimately both companies end up worse off with lowered prices.

Even though its usually illegal, some of the companies engage in price collusion and set the price at a given level to prevent the above situation.

PoliticsA good example of PD in politics is an arms race.

When two hostile countries don’t arm, neither of them benefits.

However if one of them arms, it gains a significant advantage over the other one.

But when both of them arms, neither of them benefits and they both are worse off because of the arming costs.

Such a scenario was visible during the Cold War between the Soviet Union and the United States.

Both of the countries poured enormous resources on military and they were aware that ‘the other side’ is doing the same so they didn’t engage in the military confrontation as they were both aware that the chances of winning are roughly equal.

Ultimately, the Soviet Union couldn’t withstand such tremendous economic costs which were one of the reasons for its dissolution in 1991.

I hope that the above examples convinced you that Game Theory is worth demystifying.

With that being said, let’s proceed to the PD approaches.

StrategiesI encourage you to check out the corresponding Python Notebook that underlies this project.

gsurma/prison_escapePrisoner's Dilemma research environment.

Contribute to gsurma/prison_escape development by creating an account on…github.

comClassicCooperator – Always cooperatesDefector – Always defectsRandom – Moves randomlyTitForTat – Initially cooperates, then mimics opponent’s movesGrumpy – Defects after a certain level of grumpiness which increases when the opponent defects and decreases when the opponent cooperatesAlternator – Alternates between cooperating and defectingMachine LearningAll Neural Network based strategies have the following features.

EvolvedANN – Neural Network with a hidden layer of size 10EvolvedANN5 – Neural Network with a hidden layer of size 5EvolvedANN5Noise05 – Neural Network with a hidden layer of size 5, trained with noise=0.

05TournamentHaving defined our strategies, we can create a tournament and find out which one performs best.

This is exactly what happened in 1980 when the professor of political science at the University of Michigan Robert Axelrod organized such tournament.

He invited a number of well-known game theorists to submit strategies to be run by computers.

In the tournament, programs played games against each other.

Each strategy specified whether to cooperate or defect based on the previous moves of both the strategy and its opponent.

Let’s create such a tournament on our own!Depending on the number of players and games, it may take a while to compute.

Then we can visualize and analyze the results.

ResultsAbove box-plot shows the results of our tournament.

EvolvedANNNoise05 appeared to be the most successful one and Random ended up with the worst score.

All EvolvedANN strategies and TitForThat have very similar high scores.

It’s important to point out that TitForTat appeared to be the most successful strategy among the classic ones, which is exactly what happened during the original Axelrod’s tournament in 1980.

What’s more, if we investigate the moves of the EvolvedANN strategies, we’ll realize that the behavior they are presenting is very similar to the TitForThat.

It implies that Neural Networks simply learned the TitForThat strategy which means that replaying opponent’s moves is a relatively good strategy in most situations.

It’s consistent with the Axelrod Tournament’s conclusions that TitForThat usually works the best but we cannot conclude that it’s ‘the best strategy’ as it heavily relies on the opponent moves and pairing it with for example Random wouldn’t yield satisfactory results.

Realizing that there are no silver bullets in PD, we should study specific matchups between the strategies.

How to interpret this chart?The more yellow the color, the better the matchup for the ‘row’ strategy.

Let’s look at the classic example of Defector (row) vs Cooperator (column).

The color for this matchup is purely yellow, because Defector always wins with Cooperator.

There is no single best strategy for the Prisoner’s Dilemma.

Each individual strategy will work best when matched against a “worse” strategy.

In order to win, a player must figure out his opponent’s strategy and then pick a strategy that is best suited for the situation.

What’s Next?With the Prison Escape project, we’ve showed how fascinating and fun to study Game Theory can be.

I encourage you to come up with your own strategies and compare them with already existing ones.

Meanwhile, stay tuned for Part 2 of the project where we are going to create a Reinforcement Learning Agent for the Prisoner’s Dilemma problem.

Don’t forget to check the project’s github page.

gsurma/prison_escapePrisoner's Dilemma research environment.

Contribute to gsurma/prison_escape development by creating an account on…github.