By plugging this into the chain rule, we find that in this scenario we get P(x, y) = P(x|y) ⋅ P(y) = P(x) ⋅ P(y).

This leads us directly to our definition of independence.

Two variables x and y are said to be independent if P(x, y) = P(x) ⋅ P(y).

A similar concept is that of conditional independence.

Two variables x and y are called conditionally independent given another variable z if P(x, y|z) = P(x|z) ⋅ P(y|z).

Let’s do an example to see what this is all about.

Let’s assume x is a random variable indicating whether I brought an umbrella to work, and y is a random variable indicating whether my grass is wet.

It seems pretty obvious that these events are not independent.

If I brought an umbrella it probably means it’s raining, and if it’s raining my grass is wet.

Now let’s assume we observe variable z, which represents that it is in fact raining outside.

Now, regardless of whether I brought an umbrella to work, you know that my grass is wet.

So the condition of rain has made my umbrella independent of my grass being wet!Independence and conditional independence become very important when we need to represent very large joint distributions.

Independence let’s us factor our distribution into simpler terms, enabling efficient memory usage and faster calculations.

We’ll see this concretely in a future post on Bayesian Networks!Functions of Random VariablesIt’s often useful to create functions which take random variables as input.

Let’s consider a trip to the casino.

It costs $2 to play my favorite game “guess a number between 1 and 10”.

If you guess correctly you win $10.

If you guess incorrectly you win nothing.

Let x be a random variable indicating whether you guessed correctly.

Then we can write a function h(x) = {$8 if x = 1, and -$2 if x = 0}.

In other words, if you guess right you get $10 minus the $2 you paid to play, otherwise you just lose your $2.

You might be interested in knowing in advance what the expected outcome will be.

Expectation:The expected value, or expectation, of a function h(x) on a random variable x ~ P(x) is the average value of h(x) weighted by P(x).

For a discrete x, we write this as:Expected Value of h(x) with respect to P(x)If x had been continuous, we would replace the summation with an integral (I’ll bet you’re seeing a pattern by now).

So the expectation acts as a weighted average over h(x), where the weights are the probabilities of each x.

What’s the expected value of playing the guessing game at the casino if we assume we have a 1/10 chance of guessing the correct number?????[h(x)] = P(winning) ⋅ h(winning) + P(loosing) ⋅ h(loosing)= (1/10) ⋅ $8 + (9/10) ⋅ (-$2) = $0.

80 + (-$1.

80) = -$1.

So on average, we’ll loose $1 every time we play!Another nice property of expectations is that they’re linear.

Let’s assume g is another function of x, and α and β are constants.

Then we have:Expectations are linearVariance and Covariance:We saw variance with respect to a Gaussian distribution when we were talking about continuous random variables.

In general, variance is a measure of how much random values vary from their mean.

Similarly, for functions of random variables, the variance is a measure of the variability of the function’s output from its expected value.

Variance of h(x) with respect to P(x)Another important concept is covariance.

Covariance is a measure of how linearly related two random variables (or functions on random variables) are with each other.

The covariance between functions h(x) and g(y) is written as:Covariance between h(x) and g(y).

When the absolute value of covariance is high, the two functions tend to vary far from their means at the same time.

When the sign of the covariance is positive, the two functions map to higher values together.

If the sign is negative, one function maps to higher values, while the other function maps to lower values and vice versa.

The visualization at the beginning of this post shows samples from a joint Gaussian distribution with positive covariance between the variables.

You can see that as the first variable increases, so does the second.

Moments:Note that we can calculate the expectation and variance for a random variable by replacing the function h(x) with x itself.

The expectation of a distribution is its mean, or first moment.

The variance of a distribution is its second moment.

Higher order moments for probability distributions capture other characteristics like skewness and kurtosis.

Important DistributionsWe’ve covered most of the important aspects of probability theory.

These ideas act as building blocks for developing the underpinnings of the majority of statistics and machine learning.

In order to master probability theory and start bridging the gap toward statistics, one needs to become somewhat familiar with the more useful probability distributions.

The functional forms of probability distributions can be intimidating.

My advice is to not focus too much on that aspect and instead focus on what types of situation each distribution is good at modeling.

Some examples of model/purpose descriptions include:Bernoulli: models the outcome of coin flips and other binary eventsBinomial: models a series of Bernoulli trials (a series of coin flips, etc.

)Geometric: models how many flips necessary before you get a successMultinomial: a generalization of the Binomial to more than two outcomes (like a die roll)Poisson: models the number of events that occur in a certain intervalFor continuous distributions it’s also useful to know the shape.

For example, we saw that the Gaussian distribution is shaped like a bell, with most of its density close to the mean.

The Beta distribution can take on a wide range of shapes over the interval [0,1].

This makes the Beta distribution a good choice for modeling our beliefs about particular probabilities.

It’s also important to remember that these well formed distributions are more like templates than anything else.

The true distribution of your data is probably not so nice and may even be changing over time.

Great, but what’s all this have to do with Machine Learning?The goal of this post was to build up our language of probability so that we can frame machine learning in a probabilistic light.

I’ll cover specific machine learning algorithms and applications in future posts, but I’d like to describe a bit of what we’ve just enabled.

Supervised Learning:In supervised machine learning, our goal is to learn from labeled data.

Data being labeled means that for some inputs X, we know the desired outputs Y.

Some possible tasks include:Identify what’s in an image.

Predict the price of a stock given some features about the company.

Detect if a file is malicious.

Diagnose a patient with an illness.

How can probability help us in these scenarios?.We can learn a mapping from X to Y in various ways.

First, you could learn P(Y|X), that is to say, a probability distribution over possible values of Y given that you’ve observed a new sample X.

Machine learning algorithms that find this distribution are called discriminative.

Imagine I tell you that I saw an animal that had fur, a long tail, and was two inches tall.

Can you discriminate between possible animals and guess what it was?Photo by Ricky Kharawala on UnsplashAlternatively, we could instead try to learn P(X|Y), the probability distribution over inputs given labels.

Algorithms for doing this are called generative.

Given that I want a mouse, can you describe the possible heights, furriness, and length of tails that mice have?.Enumerating the possible values for the features is sort of like generating all possible mice.

You may be wondering how knowing a generative model would help us with our task of classifying animals?.Remember Bayes’ Rule?.From our training data we can learn P(Y), the probability of any specific animal, and P(X), the probability of any specific configuration of the features.

Using these terms we can answer queries in the form of P(Y|X) using Bayes’ Rule.

It’s possible to learn a mapping from X to Y which isn’t in the form of a probability distribution.

We could fit a deterministic function f to our training data such that f(X) ≈ Y.

What makes having a distribution better?.Well, imagine an algorithm is diagnosing your illness, and it tells you that you have a month left to live.

The function f can’t express to you how confident it is in the assessment.

Maybe you have features that the algorithm never saw in the training data, causing it to more or less guess an outcome.

The probabilistic model quantifies uncertainty, the regular function does not.

Unsupervised Learning:Unsupervised learning is a broad set of techniques for learning from unlabeled data, where we just have some samples X but no output Y.

Common unsupervised tasks include:Grouping similar data points together (clustering).

Taking high dimensional data and projecting it into a meaningful lower dimensional space (dimension reduction, factor analysis, embedding).

Representing the data with a distribution (density estimation).

Characterizing the distribution of unlabeled data is useful for many tasks.

One example is anomaly detection.

If we learn P(X), where X represents normal bank transactions, then we can use P(X) to measure the likelihood of future transactions.

If we observe a transaction with low probability, we can flag it as suspicious and possibly fraudulent.

Clustering is one of the canonical problems of unsupervised learning.

Given some data points originating from separate groups, how can we determine which group each point belongs?.One method is to assume that each group is generated from a different probability distribution.

Solving the problem then becomes finding the most likely configuration of these distributions.

Dimension reduction is the other main area of unsupervised learning.

High dimensional data takes up memory, slows down computations, and is hard to visualize and interpret.

We’d like to have ways of reducing the data to a lower dimension without loosing too much information.

One can think of this problem as finding a distribution in a lower dimensional space with similar characteristics to the distribution of the original data.

Reinforcement Learning:The field of reinforcement learning is all about training artificial agents to perform well at specific tasks.

The agents learn by taking actions in their environment and observing reward signals based on their behavior.

The goal of the agent is to maximize its expected long term reward.

Probability is used in reinforcement learning for several aspects of the learning process.

You may have picked up on the word “expected” in the goal.

The agent’s learning process often revolves around quantifying the uncertainty of the utility of taking one specific action over another.

ConclusionThis has been a gentle overview of the language of probability theory with a brief discussion on how we will apply these concepts to more advanced machine learning and statistics moving forward.

If you’d like to tackle probability theory from another angle, I highly recommend checking out this amazing visual introduction from Seeing Theory:Seeing TheoryA visual introduction to probability and statistics.

seeing-theory.

brown.

eduSee you next time!.. More details