Probabilistic Graphical Models: Bayesian Networks

The venue, cuisine, distance from home, pricing etc.

In general, we can write a custom program to answer our query (a nested if-else), but that will not be robust.

If it encounters an additional feature, we might have to re-write the model, which certainly seems a less feasible solution.

To overcome the aforementioned complication, we will use a different approach: Declarative Representation.

In this paradigm, we construct a model based on the task which would like to reason.

Model encodes the knowledge of how the system works.

Key features of this methodology are to separate knowledge and reasoning.

Decision-making in real-world applications come with a level of uncertainty which corresponds to an uncertainty in the model building.

That's where the Probability comes in.

The math of probability theory provides us with a framework for considering multiple outcomes and their likelihood.

Concepts of ProbabilityAxioms of Probability:For any event A, the probability of occurrence of the event will always be equal to or greater than zero.

Figure-1: Probability of an event A2.

If there are disjoint events in a sample space, then the union of all events is the summation of individual probabilities.

Figure-2: Union of all Disjoint events3.

In case of an event involving the universal set has the probability of 1.

Random Variable: A random variable is a function which maps each outcome in sample space to a value.

Figure-3: G, H an A are random variables mapping to outcomes.

Marginal DistributionAfter defining our random variable, we can consider the distribution of events which can be described using it.

This distribution is often referred to as Marginal Distribution over a random variable (let's say X).

We denote it by P(X).

Figure-4: Marginal Distribution of a random variable, GJoint DistributionIn many situations, we want to involve several random variables.

Each random variable corresponds to a certain attribute of an event.

In real-world problems, We are interested to find probability involving multiple events occurring together, which can be represented by Joint Distribution.

In general, if we have a set of events: {X1, X2, X3,…, Xn} then the joint distribution is denoted by P(X1, X2, X3,…, Xn)Figure-5: Joint Distribution of two random variables, G and I.

Joint Distribution is fundamental as it helps us query information regarding the data in terms of marginal or conditional distribution.

In other words, we can derive the marginal and conditional distribution from it.

Conditional ProbabilityThis type of probability distribution involves prior knowledge of a random variable which affects the probability of occurrence of the target variable.

It is usually denoted by P(X|Y).

This can be interpreted as ‘Probability of X given Y has already occurred’.

It can be factorized in terms of Marginal and Joint Distributions.

Figure-6: Factorization of Conditional ProbabilityConditional IndependenceAs we mentioned above that P(X|Y) is not equal to P(X), learning that Y is true changes our probability over X.

However in some cases equality can occur which means that learning Y does not change our probability of X.

Conditional Independence is a more common situation when two events are independent given an additional event.

We say an event X is conditionally independent of event Y given an event Z denoted as P(X | Y, Z) = P(X|Z)Shortcomings of Joint ProbabilityFigure-7: Joint Distribution of n random variablesThe problem of using Joint Distribution for inference is that it is too complex to handle, and we have to use a Chain Rule with all dependencies to parameterize it.

Figure-8: Chain Rule expansion of Joint Distribution of Fig.

7Assuming that each random variable takes up a binary value, joint distribution needs 2^n -1 values, which is computationally expensive and from a statistical point of view need huge data to learn parameters.

There is a natural way to split up the joint distribution, i.

e to generate a marginal and a conditional distribution from it which makes it more interpretable than parent table.

This process is known as Conditional Parameterization.

(But it does not reduce the number of parameters!)In the real world scenario, we make certain assumptions regarding the random variables involved in the joint distribution, whether or not they are dependent on each other or not.

In that case, Chain rule simplifies as follows:Figure-9: Chain Rule Simplification.

In the upcoming section, we will discuss to include independence among the features and one of the ways to graphically denote the Joint Distribution.

Bayesian NetworksUntil now, we saw that if we add conditional independence in the distribution, it largely simplifies the chain rule notation leading to less number of parameters to learn.

Bayesian Network can be viewed as a Data Structure ( It provides factorization of Joint Distribution)Suppose we have ’n’ random variables.

all of which are independent given another random variable C.

The graphical representation is as follows:Figure-10: Naive Bayes ModelIn this case, we did a very naive assumption that all random variables are independent of each other, which highly simplifies the chain rule notation to represent the model.

This model is formally known as the Naive Bayes Model ( which is used as one of the Classification Algorithm in Machine Learning Domain).

Bayesian Network aids us in factorizing the joint distribution, which helps in decision making.

(We started off with the idea of decision making, Remember?)Conventions involved:1.

Nodes: Random Variables2.

Edges: Indicate DepdenceA Graph can be seen in two ways: A Data Structure which provides a skeleton for representing Joint Distribution in the factorized way or Representation of Conditional Independence about Distribution.

Figure-11: Bayesian Network along with Local Probability ModelI have given an example of Decision making in terms of whether the student will receive a Recommendation Letter (L) based on various dependencies.

Grade(G) is the parent node of Letter, We have assumed SAT Score(S) is based solely on/dependent on Intelligence(I).

Grade is dependent on the Difficulty of Exam and Intelligence of Student.

One important point is that this is our assumption of how the world works, it may change for different perceptions, Some people might think SAT Score and Letter are dependent.

So in that case, the model looks like this:Each node is a random variable and has a local probability model associated with it.

This graph gives us a natural factorization for joint distribution.

Figure-12: Factorized Term of Joint DistributionNow, one important point to note is that this model is built on our assumption on how the world works.

More finely put, This is the standard way of decision making we do.

But this might differ from person to person.

Some people might think Recommendation Letter (L) depends on SAT Score(S) too.

In that case, the model becomes:Figure-13: Assumed ModelRule of Bayesian NetworksRule 1: A node is not independent of its parents.

Rule 2: A node is not independent of its parents even when we are given the values of other variables.

Rule 3: Given its parent node, a node is independent of all variables except its descendants node (child node).

Closing RemarksBayesian Networks help us in decision making and simplify complex problems encoding various independencies.

As we read that Joint Distribution is not capable to give us an inference which is interpretable.

Furthermore, Bayesian Networks (Acyclic Graphs) provide a basis for Restricted Boltzmann Machines.

Kushal Vala, Junior Data Scientist at Datametica Solutions Pvt Ltd.