Let’s Find Donors For Charity With Machine Learning Models

It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam.The different measures matter in varying degrees in different problems. Hence we should know which metric matters for us more. For instance identifying someone that does not make more than $50,000 as someone who does would be detrimental to CharityML, since they are looking to find individuals willing to donate. Therefore, a model’s ability to precisely predict those that make more than $50,000 is more important than the model’s ability to recall those individuals.We can use F-beta score as a metric that considers both precision and recall:Naive Baseline PredictorIf we chose a model that always predicted an individual made more than $50,000, what would that model’s accuracy and F-score be on this data set? We want to know how a model without training would look like.SUPERVISED LEARNING MODELSThis is not an exhaustive listOut of these options, I tried and tested three of them: SVM, ADABOOST and Random Forrest.The Adaboost model is best suitable for our problem. The F score on the testing data set is the highest for Adaboost. Besides that, there is not a big difference in F-scores between the training and testing sets like with RandomForest. This matters because we do not want our model to overfit on the training set and return us an inflated F score. The accuracy score and F score are highest for Adaboost at all training set sizes. The training and testing times are very low, which means that the model is computationally fast. The iterative aspect of the model makes it handle well a high number of attributes like in our case. Hence it is a good choice.How ADABOOST WorksAdaboost, short for adaptive boosting is an ensemble algorithm. The Adaboost algorithm uses iterative training to give an accurate model. It starts with a weak learner which is the initial classification of the data. Namely, the classification is done with decision stumps. This means that the classification is done just with a line that separates the data. It is called ‘weak’ because the data is not classified very well yet. But due to the further iterations, the model makes the learners focus on the misclassified points. To be more precise, in the first step the weak learner separates the data and all the points are weighted equally. If there are misclassified points, in the second iteration the weak learner tries to capture most of these previous errors and it assigns them higher weights. The essence is that the model focuses on the errors by weighing them more. This iterative process continues as long as we assign it to do it. As a result, with every iteration, the model captures better and better the data. The weak learners are combined and they are assigned weights respective to their performance. Predictions are made by calculating the weighted average of the weak classifiers. The final learner is a strong learner combined from the weak ones.Model TuningWe can improve the chosen model further by using Grid Search. The idea is to use different values for some parameters like number of estimators or learning rate, in order to achieve better performance metrics.Hope you’re not tired of these yetWe can see that the optimized model performed better than the unoptimized model. The accuracy score increased from 0.8576 to 0.8651, and the F-score increased from 0.7246 to 0.7396. Remember that the Naive Predictor gave us an accuracy score of 0.2478 and F-score of 0.2917, which is not surprising because the naive model doesn’t do any training on the data.Feature ExtractionOut of the 13 features in this data set, I was curious to see which ones have the highest predictive power. And what would happen if we used only those in our model?In the reduced model both the accuracy and F-score decreased. Less features in this case made the model generalize a bit worse, compared to the full model. However the reduction in scores is not high. In return training time is faster because the model contains less features. Hence, if training time was a factor, this trade off would make sense because we would not lose much in terms of performance.SummarySo what did we do? We got a data set and we set a target to classify people that make more than $50,000 annually. We cleaned the data, normalized and converted the necessary variables into numerical features so that we can use them in our models. We shuffled and split our data into training and testing sets. We set a baseline predictor and built three other models. We chose the ADABOOST model as the best choice. We tuned it further and made it slightly better. We tried to use the model with only the five main features but it performed slightly worse.Final WordsThis was it 🙂 If you made it through here, I want to say a big thank you..I hope this was a good and clear application of supervised machine learning..It is a powerful tool in data science; something that I am currently studying and want to master..Feel free to comment below if you have questions and you can always have a look at this project, including many others on my github.God bless you all!. More details

Leave a Reply