An Introduction to the Powerful Bayes’ Theorem for Data Science Professionals

Source: i.

ytimg.

com Let E denote the event that a red ball is chosen and A, B, and C denote that the respective box is picked.

We are required to calculate the conditional probability P(A|E).

We have prior probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal probability of getting picked.

P(E|A) = Number of red balls in box A / Total number of balls in box A = 2 / 5 Similarly, P(E|B) = 3 / 4 and P(E|C) = 1 / 5 Then evidence P(E) = P(E|A)*P(A) + P(E|B)*P(B) + P(E|C)*P(C) = (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.

45 Therefore, P(A|E) = P(E|A) * P(A) / P(E) = (2/5) * (1/3) / 0.

45 = 0.

296   Applications of Bayes’ Theorem There are plenty of applications of the Bayes’ Theorem in the real world.

Don’t worry if you do not understand all the mathematics involved right away.

Just getting a sense of how it works is good enough to start off.

Bayesian Decision Theory is a statistical approach to the problem of pattern classification.

Under this theory, it is assumed that the underlying probability distribution for the categories is known.

Thus, we obtain an ideal Bayes Classifier against which all other classifiers are judged for performance.

We will discuss the three main applications of Bayes’ Theorem: Naive Bayes’ Classifiers Discriminant Functions and Decision Surfaces Bayesian Parameter Estimation Let’s look at each application in detail.

  Naive Bayes’ Classifiers This is probably the most famous application of Bayes’ Theorem, probably even the most powerful.

You’ll come across the Naive Bayes algorithm a lot in machine learning.

Naive Bayes’ Classifiers are a set of probabilistic classifiers based on the Bayes’ Theorem.

The underlying assumption of these classifiers is that all the features used for classification are independent of each other.

That’s where the name ‘naive’ comes in since it is rare that we obtain a set of totally independent features.

The way these classifiers work is exactly how we solved in the illustration, just with a lot more features assumed to be independent of each other.

Here, we need to find the probability P(Y|X) where X is an n-dimensional random variable whose component random variables X_1, X_2, ….

, X_n are independent of each other: Finally, the Y for which P(Y|X) is maximum is our predicted class.

Let’s talk about the famous Titanic dataset.

We have the following features: Remember the problem statement?.We need to calculate the probability of survival conditional to all the other variables available in the dataset.

Then, based on this probability, we predict if the person survived or not, i.

e, class 1 or 0.

This is where I pass the buck to you.

Refer to our popular article to learn about these Naive Bayes classifiers in detail along with the relevant code in both Python and R and try solving the Titanic dataset yourself.

If you get stuck anywhere, you can find me in the comments section below!.  Discriminant Functions and Surfaces The name is pretty self-explanatory.

A discriminant function is used to “discriminate” its argument into its relevant class.

Want an example?.Let’s take one!.You might have come across Support Vector Machines (SVM) if you have explored classification problems in machine learning.

The SVM algorithm classifies the vectors by finding the differentiating hyperplane which best segregates our training examples.

This hyperplane can be linear or non-linear: These hyperplanes are our decision surfaces and the equation of this hyperplane is our discriminant function.

Make sure you check out our article on Support Vector Machine.

It is thorough and includes code both in R and Python.

Alright – now let’s discuss the topic formally.

Let w_1, w_2, ….

, w_c denote the c classes that our data vector X can be classified into.

Then the decision rule becomes: Decide w_i if g_i(X) > g_j(X) for all j ≠ i These functions g_i(X), i = 1, 2, ….

, c, are known as Discriminant functions.

These functions separate the vector space into c decision regions – R_1, R_2, ….

, R_c corresponding to each of the c classes.

The boundaries of these regions are called decision surfaces or boundaries.

If g_i(X) = g_j(X) is the largest value out of the c discriminant functions, then the classification of vector X into class w_i and w_j is ambiguous.

So, X is said to lie on a decision boundary or surface.

Check out the below figure: Source: Duda, R.

O.

, Hart, P.

E.

, & Stork, D.

G.

(2012).

 Pattern classification.

John Wiley & Sons.

It’s a pretty cool concept, right?.The 2-dimensional vector space is separated into two decision regions, R_1 and R_2, separated by two hyperbolas.

Note that any function f(g_i(X)) can also be used as a discriminant function if f(.

) is a monotonically increasing function.

The logarithm function is a popular choice for f(.

).

Now, consider a two-category case with classes w _1 and w_2.

The ‘minimum error-rate classification‘ decision rule becomes: Decide w_1 if P(w_1|X) > P(w_2|X), otherwise decide w_2 with P(error|X) = min{P(w_1|X), P(w_2|X)} P(w_i|X) is a conditional probability and can be calculated using Bayes’ Theorem.

So, we can restate the decision rule in terms of likelihoods and priors to get: Decide w_1 if P(X|w_1)*P(w_1) > P(X|w_2)*P(w_2), otherwise decide w_2 Notice that the ‘evidence’ on the denominator is merely used for scaling and hence we can eliminate it from the decision rule.

Thus, an obvious choice for the discriminant functions is: g_i(X) = P(X|w_i)*P(w_i) OR g_i(X) = ln(P(X|w_i)) + ln(P(w_i)) The 2-category case can generally be classified using a single discriminant function.

g(X) = g_1(X) – g_2(X) = ln(P(X|w_1) / P(X|w_2)) + ln(P(w_1) / P(w_2)) Decide w_1, if g(X) > 0 Decide w_2, if g(X) < 0 if g(X) = 0, X lies on the decision surface.

In the figure above, g(X) is a linear function in a 2-dimensional vector X.

However, more complicated decision boundaries are also possible:   Bayesian Parameter Estimation This is the third application of Bayes’ Theorem.

We’ll use univariate Gaussian Distribution and a bit of mathematics to understand this.

Don’t worry if it looks complicated – I’ve broken it down into easy-to-understand terms.

You must have heard about the ultra-popular IMDb Top 250.

This is a list of 250 top-rated movies of all time.

Shawshank Redemption is #1 on the list with a rating of 9.

2/10.

How do you think these ratings are calculated?.The original formula used by IMDb claimed to use a “true Bayesian estimate”.

The formula has since changed and is not publicly disclosed.

Nonetheless, here is the previous formula: Source: Wikipedia The final rating W is a weighted average of R and C with weights v and m respectively.

m is a prior estimate.

As the number of votes, v, increases and surpasses m, the minimum votes required, W, approaches the straight average for the movie, R As v gets closer to zero (less number of votes are cast for the movie), W approaches the mean rating for all films, C We generally do not have complete information about the probabilistic nature of a classification problem.

Instead, we have a vague idea of the situation along with a number of training examples.

We then use this information to design a classifier.

The basic idea is that the underlying probability distribution has a known form.

We can, therefore, describe it using a parameter vector Θ.

For example, a Gaussian distribution can be described by Θ  = [μ, σ²].

Source: Wikipedia Then, we need to estimate this vector.

This is generally achieved in two ways: Maximum Likelihood Estimation (MLE): The assumption is that the underlying probability distribution p(Θ) has an unknown but fixed parameter vector.

The best estimate maximizes the likelihood function: p(D|θ) = p(x1|θ) * p(x2|θ) * .

* p(xn|θ) = Likelihood of θ with respect to the set of samples D I recommend reading this article to get an intuitive and in-depth explanation of maximum likelihood estimation along with a case study in R.

Bayesian Parameter Estimation – In Bayesian Learning, Θ is assumed to be a random variable as opposed to an “unknown but fixed” value in MLE.

We use training examples to convert a distribution on this variable into a posterior probability density.

We can write it informally as: P(Θ|data) = P(data|Θ)*P(Θ) / P(data), where data represents the set of training examples Key points you should be aware of: We assume that the probability density p(X) (samples are drawn from this probability rule) is unknown but has a known parametric form.

Thus, it can be said that p(X|Θ) is completely known Any prior information that we might have about Θ is contained in a known prior probability density p(Θ) We use training samples to find the posterior density p(Θ|data).

This should sharply peak at the true value of Θ   Demonstration of Bayesian Parameter Estimation – Univariate Gaussian Case Let me demonstrate how Bayesian Parameter Estimation works.

This will provide further clarity on the theory we just covered.

First, let p(X) be normally distributed with a mean μ and variance σ², where μ is the only unknown parameter we wish to estimate.

Then: p(X|Θ) = p(X|μ) ~ N(μ, σ²) We’ll ease up on the math here.

So, let prior probability density p(μ) also be normally distributed with mean µ’ and variance σ’² (which are both known).

Here, p(Θ|data) = p(μ|data) is called the reproducing density and p(Θ) =  p(μ) is called the conjugate prior.

Since this argument in exp() is quadratic in μ, it represents a normal density.

Thus, if we have n training examples, we can say that p(μ|data) is normally distributed with mean μ_n and variance σ_n², where My observations: As n increases, σ_n² decreases.

Hence, uncertainty in our estimate decreases Since uncertainty decreases, the density curve becomes sharply peaked at its mean μ_n: Source: spcforexcel.

com Here’s a superb and practical implementation of Bayesian Statistics and estimation: Bayesian Statistics explained to Beginners in Simple English   End Notes The beauty and power of Bayes’ Theorem never cease to amaze me.

A simple concept, given by a monk who died more than 250 years ago, has its use in some of the most prominent machine learning techniques today.

You might have a few questions on the various mathematical aspects we explored here.

Feel free to connect with me in the comments section below and let me know your feedback on the article as well.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.

. More details

Leave a Reply