That means more the number of shots a team makes, the lesser points it will be likely to achieve ????.

While it seems to defy logic at first, in hindsight, every shot attempt that does not convert to a goal invariably handles possession back to the opponent team and gives them the upper hand, thus the negative correlation.

Sticking to a simple model, I decided then to use the full-time goal count for the home and away team as parameters.

Show me the mathEffectively speaking, the outcome of the match is based on the number of goals scored on either side.

Hence, we need to model the probability distribution of the goals scored.

One of the most common methods to do so is via Poisson distribution.

(Source)The Poisson distribution measures the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant rate and independently of the time since the last event.

Poisson distribution for x occurrences of the event, λ is the average rate and e is the Euler’s constantTo understand why this model fits our case, we can consider a goal scored to be an event.

Then within the span of 90 minutes of play, each such event can occur any number of times independently.

To give an example, let’s try predicting the probability that a match between Arsenal and Leicester City ends with the scoreline 2–1.

What remains then is to figure out the constant rate (λ):It can be intuitively seen that this parameter reflects the performance of a team, the better team having a higher rate of scoring goals on average.

Also, this rate would depend on both, attacking strength of the team and defensive strength of the opponent.

Lastly, we also have to account for the home advantage, that is, take into consideration that a team generally plays better at home ground.

Based on the discussion above, we can define the parameter λ as the Average number of goals scored by a team on a particular venue, which can be computed using the past data.

Building the modelLet’s build some statistics then:Using the above stats, we can now formulate the λ parameter as follows:Simulating the matchesAs discussed before, a match between two teams can end in 3 possible outcomes: Home team win(H), Away team win(A) or a Tie (T).

Let the home team score X goals and away team score Y goals.

Then:We have already seen how to calculate the probability that the match ends with the scoreline X-Y.

Also, we can put a practical upper limit to the number of goals scored by a team at say, 10.

Finally, since all score lines are independent of each other, the probabilities can be simply added together:Thus, we can simulate a match between Home(H) and Away(A) teams and predict the points scored by the teams:Putting it all togetherTo predict the final standings then, we simply simulate all the league matches using the model and add up the predicted point scores to the build the points table.

The final result obtained:So it seems that Liverpool and Man City will have the top finish, with Chelsea jumping ahead of Tottenham.

Man United is predicted to finish at 5th place with Arsenal close behind.

The results seem to agree with the general public opinion then — let’s just bring Fergie back (please)Find the complete code hereConclusionAs always, there is plenty of room for improvements.

Some ideas to try:Considering time as a factor: the form of a team can play an important role, and time-weighted averages can be considered to assign more importance to recent matchesIt could be interesting to see if manager rankings at the time as a parameter can improve the efficiencyImproving the model’s underestimation of draws, the general idea being that real-world chances of a draw happens to be more than the model’s average estimate of tiesDespite the shortcomings, the model is a good starting point with decent accuracy.

And the exercise was fun, after all, it got me the first place in the event :D.