# Probability for Machine Learning (7-Day Mini-Course)

This is called the “Boy or Girl Problem” and is one of many common toy problems for practicing probability.

I would love to see what you come up with.

In the next lesson, you will discover probability distributions for random variables.

In this lesson, you will discover a gentle introduction to probability distributions.

In probability, a random variable can take on one of many possible values, e.

g.

events from the state space.

A specific value or set of values for a random variable can be assigned a probability.

There are two main classes of random variables.

A discrete random variable has a finite set of states; for example, the colors of a car.

A continuous random variable has a range of numerical values; for example, the height of humans.

A probability distribution is a summary of probabilities for the values of a random variable.

A discrete probability distribution summarizes the probabilities for a discrete random variable.

Some examples of well-known discrete probability distributions include:A continuous probability distribution summarizes the probability for a continuous random variable.

Some examples of well-known continuous probability distributions include:We can define a distribution with a mean of 50 and a standard deviation of 5 and sample random numbers from this distribution.

We can achieve this using the normal() NumPy function.

The example below samples and prints 10 numbers from this distribution.

Running the example prints 10 numbers randomly sampled from the defined normal distribution.

For this lesson, you must develop an example to sample from a different continuous or discrete probability distribution function.

For a bonus, you can plot the values on the x-axis and the probability on the y-axis for a given distribution to show the density of your chosen probability distribution function.

I would love to see what you come up with.

In the next lesson, you will discover the Naive Bayes classifier.

In this lesson, you will discover the Naive Bayes algorithm for classification predictive modeling.

In machine learning, we are often interested in a predictive modeling problem where we want to predict a class label for a given observation.

One approach to solving this problem is to develop a probabilistic model.

From a probabilistic perspective, we are interested in estimating the conditional probability of the class label given the observation, or the probability of class y given input data X.

Bayes Theorem provides an alternate and principled way for calculating the conditional probability using the reverse of the desired conditional probability, which is often simpler to calculate.

The simple form of the calculation for Bayes Theorem is as follows:Where the probability that we are interested in calculating P(A|B) is called the posterior probability and the marginal probability of the event P(A) is called the prior.

The direct application of Bayes Theorem for classification becomes intractable, especially as the number of variables or features (n) increases.

Instead, we can simplify the calculation and assume that each input variable is independent.

Although dramatic, this simpler calculation often gives very good performance, even when the input variables are highly dependent.

We can implement this from scratch by assuming a probability distribution for each separate input variable and calculating the probability of each specific input value belonging to each class and multiply the results together to give a score used to select the most likely class.

The scikit-learn library provides an efficient implementation of the algorithm if we assume a Gaussian distribution for each input variable.

To use a scikit-learn Naive Bayes model, first the model is defined, then it is fit on the training dataset.

Once fit, probabilities can be predicted via the predict_proba() function and class labels can be predicted directly via the predict() function.

The complete example of fitting a Gaussian Naive Bayes model (GaussianNB) to a test dataset is listed below.

Running the example fits the model on the training dataset, then makes predictions for the same first example that we used in the prior example.

For this lesson, you must run the example and report the result.

For a bonus, try the algorithm on a real classification dataset, such as the popular toy classification problem of classifying iris flower species based on flower measurements.

I would love to see what you come up with.

In the next lesson, you will discover entropy and the cross-entropy scores.

In this lesson, you will discover cross-entropy for machine learning.

Information theory is a field of study concerned with quantifying information for communication.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event.

Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).

We can calculate the amount of information there is in an event using the probability of the event.

We can also quantify how much information there is in a random variable.

This is called entropy and summarizes the amount of information required on average to represent events.

Entropy can be calculated for a random variable X with K discrete states as follows:Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

It is widely used as a loss function when optimizing classification models.

It builds upon the idea of entropy and calculates the average number of bits required to represent or transmit an event from one distribution compared to the other distribution.

We can make the calculation of cross-entropy concrete with a small example.

Consider a random variable with three events as different colors.

We may have two different probability distributions for this variable.

We can calculate the cross-entropy between these two distributions.

The complete example is listed below.

Running the example first calculates the cross-entropy of Q from P, then P from Q.

For this lesson, you must run the example and describe the results and what they mean.

I would love to see what you come up with.

In the next lesson, you will discover how to develop and evaluate a naive classifier model.

In this lesson, you will discover how to develop and evaluate naive classification strategies for machine learning.

Classification predictive modeling problems involve predicting a class label given an input to the model.

Given a classification model, how do you know if the model has skill or not?This is a common question on every classification predictive modeling project.

The answer is to compare the results of a given classifier model to a baseline or naive classifier model.

Consider a simple two-class classification problem where the number of observations is not equal for each class (e.

g.

it is imbalanced) with 25 examples for class-0 and 75 examples for class-1.

This problem can be used to consider different naive classifier models.

For example, consider a model that randomly predicts class-0 or class-1 with equal probability.

How would it perform?We can calculate the expected performance using a simple probability model.

We can plug in the occurrence of each class (0.

25 and 0.

75) and the predicted probability for each class (0.

5 and 0.

5) and estimate the performance of the model.

It turns out that this classifier is pretty poor.

Now, what if we consider predicting the majority class (class-1) every time?.Again, we can plug in the predicted probabilities (0.

0 and 1.

0) and estimate the performance of the model.

It turns out that this simple change results in a better naive classification model, and is perhaps the best naive classifier to use when classes are imbalanced.

The scikit-learn machine learning library provides an implementation of the majority class naive classification algorithm called the DummyClassifier that you can use on your next classification predictive modeling project.

The complete example is listed below.

Running the example prepares the dataset, then defines and fits the DummyClassifier on the dataset using the majority class strategy.

For this lesson, you must run the example and report the result, confirming whether the model performs as we expected from our calculation.

As a bonus, calculate the expected probability of a naive classifier model that randomly chooses a class label from the training dataset each time a prediction is made.

I would love to see what you come up with.

In the next lesson, you will discover metrics for scoring models that predict probabilities.

In this lesson, you will discover two scoring methods that you can use to evaluate the predicted probabilities on your classification predictive modeling problem.

Predicting probabilities instead of class labels for a classification problem can provide additional nuance and uncertainty for the predictions.

The added nuance allows more sophisticated metrics to be used to interpret and evaluate the predicted probabilities.

Let’s take a closer look at the two popular scoring methods for evaluating predicted probabilities.

Logistic loss, or log loss for short, calculates the log likelihood between the predicted probabilities and the observed probabilities.

Although developed for training binary classification models like logistic regression, it can be used to evaluate multi-class problems and is functionally equivalent to calculating the cross-entropy derived from information theory.

A model with perfect skill has a log loss score of 0.

0.

The log loss can be implemented in Python using the log_loss() function in scikit-learn.

For example:The Brier score, named for Glenn Brier, calculates the mean squared error between predicted probabilities and the expected values.

The score summarizes the magnitude of the error in the probability forecasts.

The error score is always between 0.

0 and 1.

0, where a model with perfect skill has a score of 0.

0.

The Brier score can be calculated in Python using the brier_score_loss() function in scikit-learn.

For example:For this lesson, you must run each example and report the results.

As a bonus, change the mock predictions to make them better or worse and compare the resulting scores.

I would love to see what you come up with.

This was the final lesson.

Well done!Take a moment and look back at how far you have come.

You discovered:Take the next step and check out my book on Probability for Machine Learning.

How did you do with the mini-course?.Did you enjoy this crash course?Do you have any questions?.Were there any sticking points?.Let me know.