ML Algorithms: One SD (σ)- Bayesian Algorithms

ML Algorithms: One SD (σ)- Bayesian AlgorithmsAn intro to machine learning bayesian algorithmsSagi ShaierBlockedUnblockFollowFollowingFeb 18The obvious questions to ask when facing a wide variety of machine learning algorithms, is “which algorithm is better for a specific task, and which one should I use?”Answering these questions vary depending on several factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What do you want to do with the data.

This is one section of the many algorithms I wrote about in a previous article.

In this part I tried to display and briefly explain the main algorithms (though not all of them) that are available for bayesian tasks as simply as possible.

Bayesian Algorithms:A family of algorithms where all of them share a common principle, i.

e.

every pair of features being classified is independent of each other.

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.

Bayes’s formula provides relationship between P(A|B) and P(B|A)·Naive BayesA Naive Bayes algorithm assumes that each of the features it uses are conditionally independent of one another given some class.

It provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).

For example, assume that you have few emails which are already classified as spam or ham.

Now suppose that you want to classify a new email as spam or ham.

Naïve Bayes sees this issue as “what is the probability that new email is spam/ham given that it contains particular words” (e.

g.

the probability that an email that contains the word “Viagra”, be classified as spam/ham).

Some things to consider:Useful for very large data sets — you can use Naïve Bayes classification algorithm with a small data set but precision and recall will keep very lowSince the algorithm has an assumption of independence, you do lose the ability to exploit the interactions between features.

· Gaussian Naive BayesThe general term Naive Bayes refers the independence assumptions in the model, rather than the particular distribution of each feature.

Up to this point we have said nothing about the distribution of each feature, but in Gaussian Naïve Bayes, we assume that the distribution of probability is Gaussian (normal).

Because of the assumption of the normal distribution, Gaussian Naive Bayes is used in cases when all our features are continuous.

For example, if we consider the Iris dataset, the features are sepal width, petal width, etc.

They can have different values in the dataset like width and length, hence we can’t represent them in terms of their occurrences and we need to use the Gaussian Naive Bayes here.

Some things to consider:It assumes the distribution of features is normalIt is usually used when all our features are continuous· Multinomial Naive BayesThe term Multinomial Naive Bayes simply tells us that each feature has a multinomial distribution.

It’s used when we have discrete data (e.

g.

movie ratings ranging 1 and 5 as each rating will have certain frequency to represent).

In text learning we have the count of each word to predict the class or label.

This algorithm is mostly used for document classification problem (whether a document belongs to the category of sports, politics, technology etc.

).

The features/predictors used by the classifier are the frequency of the words present in the document.

Some things to consider:Used with discrete dataWorks well for data which can easily be turned into counts, such as word counts in text.

· Averaged One-Dependence Estimators (AODE)AODE is a semi-naive Bayesian Learning method.

It was developed to address the attribute independence problem of the popular naive Bayes classifier.

It does it by averaging over all of the models in which all attributes depend upon the class and a single other attribute.

It frequently develops more accurate classifiers than naive Bayes at the cost of a small increase in the amount of computation.

Some things to consider:Using it for nominal data is computationally more efficient than regular naïve bayes, and achieves very low error rates.

· Bayesian Belief Network (BBN)A probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph.

For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms.

Given symptoms, the network can be used to compute the probabilities of the presence of various diseases (another example can be seen above in the image).

A BBN is a special type of diagram (called a directed graph) together with an associated set of probability tables.

Another example is tossing a coin.

The coin can have two values- heads or tails with a 50% probability each.

We call these probabilities “beliefs” (i.

e.

our belief that the state coin=head is 50%).

Some things to consider:BBNs enable us to model and reason about uncertaintyThe most important use of BBNs is in revising probabilities in the light of actual observations of eventsCan be used to understand what caused a certain problem, or the probabilities of different effects given an action in areas like computational biology and medicine for risk analysis and decision support.

· Bayesian Network (BN)Bayesian networks are a type of Probabilistic Graphical Model (probabilistic because they are built from probability distributions).

These networks can be used for predictions, anomaly detection, diagnostics, automated insight, reasoning, time series prediction and decision making under uncertainty.

The goal of these networks is to model conditional dependence, and therefore causation.

For example: if you’re outside of your house and it starts raining, there is a high probability that your dog will start barking.

Which in turn, will increase the probability that the cat will hide under the couch.

So you can see how info about one event (rain) allows you to make inferences about a seemingly unrelated event (the cat hiding under the couch).

Some things to consider:You can use them to make future predictionsUseful for explaining observationsBayesian networks are very convenient for representing similar probabilistic relationships between multiple events.

· Hidden Markov models (HMM)HMM is a class of probabilistic graphical model that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables.

For example, predicting the weather (hidden variable) based on the type of clothes that someone wears (observed).

This can be a swimsuit, an umbrella, etc.

These are basically the evidence.

HMM are known for their use in in reinforcement learning and temporal pattern recognition such as handwriting, speech, part-of-speech tagging, gesture recognition, and bioinformatics.

HMM answers questions like: given a model, what is the likelihood of sequence S happening?.Given a sequence S and number of hidden states, what is the optimal model which maximizes the probability of S?Some things to consider:HMM is suitable to be used in application that dealing with recognizing something based on sequence of feature.

HMMs can be used to model processes which consist of different stages that occur in definite (or typical) orders.

HMM needs to be trained on a set of seed sequences and generally requires a larger seed than the simple Markov models.

· Conditional random fields (CRFs)A classical ML model to train sequential models.

It is a type of discriminative classifier that model the decision boundary between the different classes.

The difference between discriminative and generative models is that while discriminative models try to model conditional probability distribution, i.

e.

, P(y|x), generative models try to model a joint probability distribution, i.

e.

, P(x,y).

Their underlying principle is that they apply Logistic Regression on sequential inputs.

Hidden Markov Models share some similarities with CRFs, one in that they are also used for sequential inputs.

CRFs are most used for NLP tasks.

Suppose you have a sequence of snapshots from a day in your friend’s life.

Your goal is to label each image with the activity it represents (eating, sleeping, driving, etc.

).

One way to do it is to ignore the fact that the snapshots has a sequential nature, and to build a per-image classifier.

For example, you can learn that dark images taken at 5am are usually related to sleeping, while images with food tends to be about eating, and so on.

However, by ignoring the sequential aspect, we lose a lot of information.

As an example, what happens if you see a close-up picture of a mouth — is it about talking or eating?.If you know that the previous snapshot is a picture of your friend eating, then it’s more likely this picture is about eating.

Hence, to increase the accuracy of our labeler, we should consider the labels of nearby photos, and this is precisely what a conditional random field does.

Some things to consider:CRF predicts the most likely sequence of labels that correspond to a sequence of inputsCompared to HMM, since CRF does not have as strict independence assumptions as HMM does, it can accommodate any context information.

CRFs also avoid the label bias problem.

CRF is highly computationally complex at the training stage of the algorithm.

It makes it very difficult to re-train the model when newer data becomes available.

Until next time,Bobcat.

.