For this task, we have to take a look at data that we have already received.
The dataset contains 50 non-spams and 20 spams, which have been manually labelled.
This make a total of 70 emails.
Let’s define and analyze some features that can make an email spam.
For example, a feature can be an email without a subject.
Now we look at at least 70 emails and find that 4 out of 50 non-spam emails and 10 out 20 spam emails are email without a subject.
With the recorded data, we will try to find the probability that an email without a subject is spam.
We can see that, from our emails without a subject, four are non-spam emails, while ten are spam emails.
This gives us the following answer:71,43% is the probability that an email without a subject is spam.
We can then label future emails as non-spam or spam using the rule that if an email with no object is received, the probability of that email being spam is 71,43% and classifies an email as spam or not.
By using other features such as containing sentence “You are the big winner” or “Claim your offer”.
We can then combine all the features to classify the email.
This algorithm is called the Naïve Bayes classifier.
The method is also used for recommendation, sentiment analysis.
It can be easy implemented.
The next method, we will see is also a simple one and is very well suited for the analysis of large amounts of data.
K-means ClusteringK-Means is one of the most commonly used ML methods for grouping objects (cluster analysis).
We have bicycles and want to use them for our Bike Rental System.
For this purpose, we decide to place three bicycles rental stations in one district.
We are doing a study of an area and find that the people who are traveling more with bicycles, live in apartments, such as in the following map.
It sounds more correct with this arrangement to place each station in a group.
because the apartments are close to each other.
However, as we said at the beginning, the fact is that the computer has to be taught how to do a task.
That means he does not know how to do this at this point.
In this case, we need to use a new method to find a good placement.
We first randomly place three stations, where the bicycles are in the picture.
After placing the stations, we hope the people will rent bicycles from the closest stations.
That being said, those close to the orange station with rent from orange apartments, those close to pink station with rent from pink and those close to yellow station will rent from yellow.
But if we look at the distance of the inhabitants in the yellow apartments from the yellow station, we realize that it makes more sense to place the station in the center of the yellow apartments and repeat this step for the pink and yellow station.
By simply changing the locations of the stations, we can now rearrange the apartments to their closest stations.
By looking at the three apartments next to the five orange apartments, we can see that they are closer to the orange station than to the pink station, we then mark the ones as orange, and do the same with the two yellow ones next to the pink ones.
By moving the pink station to the center of their customers, we find that the three yellow apartments are closer to the pink stations than the yellow station, so we color them as pink.
We then move the yellow station to the center of the yellow apartments, as they are closer to this station.
This method is called k-means.
If we know how many clusters we want to have at the end, then we can use k-means.
If we do not have any idea of how many cluster, we want to have.
There is another algorithm for that, called Hierarchical clustering.
Hierarchical ClusteringWithout specifying the number of groups or clusters, clusters, hierarchical clustering is another method that can be used to cluster the apartments or building in the district.
With this placement of apartments in a district, it would make sense to say that if two apartments are close, then the inhabitants of the apartments can rent bikes in the same station.
With this rule, we can group the apartments as follows:We then group the next two closest apartments.
We repeat the previous step, and we will have the followingThen the closest possible nearby apartments are the pictures marked in the picture.
But they are a bit far away from each other.
If the distance reaches a certain length, the execution of the algorithm stops.
This method is called hierarchical clustering and is used if we do not know the number of clusters we want to have, but have an idea of what that should look like.
Linear RegressionIn this example, we try to estimate the prices of a car based on its size in length x width x height (LxWxH).
For this we do a small study.
Through this small study, we have three cars.
The smallest has the cost of 15,000 € and the largest car costs 45,000 €.
We now want to estimate the price of a car that has a size between the two cars.
For this purpose, we arrange the cars in a grid, where the x-axis corresponds to the size of the cars and the y-axis to the price of the cars.
To make the task a little easier, we use data that we had previously recorded from other cars.
These are represented by these brown dots.
We can see that these points can form a line.
We then draw a line that best fits these brown dots.
With the help of the line we can estimate the price of the car in the middle, and this corresponds to 30.
This method is called linear regression.
To find the green line above, that best fits the data, we use a different method called gradient descent.
Let’s make a short stop and talk about this method.
Gradient DescentLet us assume that we are at the top of a mountain and we need to find the shortest distance to the foot of the mountain.
We can make it to the foot of the mountain step by step.
We first need to find the appropriate direction that allows us to go more down the mountain.
We then move in that direction.
After that we repeat this process and take a step in the right direction.
We repeat the action until we reach the foot of the mountain.
This is the gradient descent algorithm.
It is very much used in machine learning.
To solve our mountain problem, we take small steps in the right direction until we reach the foot of the mountain (the solution).
Linear Regression — suiteWe now move back towards the method where we use a linear line to find the best matches of data.
We were still at linear regression trying to explain how to find the line that best matches the data.
Let us try with four points and find a line that better suits these four points.
As a computer, we do not know how to do that, so we first draw a line by chance.
Now we check how good or bad this line best fits the data.
This is done by calculating the error.
For this, we calculate the distance of the four points to the straight line.
And then we add these distances to get the error.
Then we move the line in a different direction and calculate the error and see that the error has become smaller.
We take this step, repeat the process, and minimize the error until we find a good solution.
This process of minimizing the error is done by the gradient descent.
When we are on top of the mountain, we have a big error.
Each step we take down in the correct direction, we minimize the error.
In real examples, we do not want to work with negative distances, so we use the square instead.
This is called the least square method.
Logistic RegressionFor this example, we received the task to classify the task to classify benign and malignant brain tumor.
Basically, benign brain tumor can be distinguished from malignant brain tumors with the characteristics that they usually grow slowly (and damage the surrounding tissue mainly by increasing pressure) and they are less likely to recur than malignant tumor, which growth rapids with the ability to invade healthy tissue.
We will use here these two features the growing speed and the recurrence grad of the tumor to classify that two types of brain tumor.
In this example we have recorded some data with the parameters, e.
growing speed grad of 20%, recurrence grad of 20% for data x, which is a benign tumor.
For data y, we have a growing speed grad of 79%, recurrence grad of 61%, which is a malignant tumor.
We now have new data, with a growing speed grad of 62%, recurrence grad of 45%.
The colors define the class label to which each instance belongs.
With the data we have, we want to determine which tumor type it is.
For this purpose, we arrange the data in a grid, where the x-axis corresponds to the growing speed rate and the y-axis to the recurrence grad.
We use previously recorded data of malignant and benign tumor.
Looking closer at the points, we can see that the points can be separated with a line.
This line is the model.
Most of the red dots are above the green line and most of the gray dots are below the line.
Every time we have a new point, we will be able to assign it to a tumor type using the model.
The data about the line are the malignant tumors and the data below the line are the benign tumors.
For the new data with the coordinates we can say that this tumor is malignant.
This method is called logistic regression.
As with linear regression, we examine the method of finding this line that separate the data.
Let’s take a simple example and try to find the line that best separates the data, the red dots from the brown dots.
In our example, we have two red dots and one brown dot, that means three errors.
We try to minimize the error by using gradient descent.
If we move the line in the right direction, we can see that the error is minimized.
Now we have only two errors.
And then we repeat the process until we have no more mistakes.
With real examples, we do not try to minimize the error but use a function called log loss function, which weights the number of errors.
We have two of eight dots that are not correct classified.
The log loss function will assign a large penalty to the incorrect points and a small penalty to the correct points.
We add all penalties from the respective points to get the log loss function.
To make that more clear, we have replaced this penalty by number for a large penalty and for a small penalty.
In the second step, we see that the error is bigger than in the next step.
The procedure here is to minimize the error function and to find the line that best separates the data.
Now we know a bit of machine learning algorithms.
let us see more in the next page.
Some good references Andrew Moore.
K-means and Hierarchical Clustering — Tutorial Slides.
 Michael Nielsen.
Neural Networks and Deep Learning.