it appears that the paid/free status is not influencing termination of the account.
On the other hand, active males are more than females, and canceling males are higher than females too.
Seems like males tend to cancel more than females; i.
Gender seems to affect churn decision.
Let’s see when the churn users are most activeThe churn users are most active at the beginning of the month.
Thus, most cancellation happens at the end of the month, which is logical to avoid renewal fees.
Does the users’ operating system affect their activity?Here we find that the happiest users are the iPad users (No cancelation), then those of the iPad.
Most users who tend to churn are those who use Windows 8.
x, Windows XP, and Linux.
This may raise an issue about the software the customers use, is it as good and easy as the iPad and iPhone’s software that makes the customers happy?There are many other features we explored, which is detailed in the GitHub page below.
drnesr/SparkifyUdacity DSND capstone project.
Contribute to drnesr/Sparkify development by creating an account on GitHub.
comCreating or extracting possibly influencing features (Features engineering)After exploring the dataset and knowing the features to include, or those who need to be extracted from the existing data, we ended in the following set of possibly influencing features.
Categoric features (features with discrete specific values).
This includes gender, subscription level, and the operating system.
Numeric features (continuous values).
This includes average file length per session, session’s duration, sessions’ count, the total subscription days, in addition to the frequency of actions like thumbs-up, thumbs-down, friend invitations, listened-to files per session.
Machine learning algorithms only deal with numeric values, hence, we should convert categoric features to numbers, for example, to give one gender a 1.
, and the other 0.
If the feature contains unordered values, then we should encode them by the one-hot-encoding method, which involves creating a separate column to each value, with 1 or 0 to indicate if it applies or not.
By finishing this operation, we have a new dataset for analysis, which is in the form of userID >> feature01, feature02, …ModelingWe have tested 5 machine learning model for classification to see which produces the highest accuracy.
The used models are the Logistic Regression model, Decision Tree Classifier model, Gradient-Boosted Trees (GBTs) model, Random Forest model, Multilayer Perceptron Classifier model.
As a final adjustment for the data, we have normalized all the input features and combined them into one vector.
Then we divided the dataset into 80% for training the model, and 20% for testing.
The logistic regression modelAs we see in the table above, the accuracy of the logistic regression model is relatively good, 82% and 75% for the training and the testing datasets.
the other measures like precision, recall, and F-score are slightly fewer than the accuracy values.
This shows good performance of the model to detect churn customers.
The attached chart shows the weight of each feature and its directional effect; for example, the submit_upgrade feature is an indicator that the customer is happy, and will not unsubscribe soon.
Other important features for happy users are the number of songs they listen to (NextSong feature).
On the other hand, the mean_session_hours and the frequency of visiting the Help , save_settings, and Homepages are indicators of unhappy customers that will churn soon.
The Decision Tree Classifier modelThis model and all the tested classification models have a ‘Features importance’ output, which indicates that this feature influences the results more, regardless of whether it has a positive or negative effect.
The features importance show that the most influencing feature is the days_total_subscription, which indicate an effect of the subscription length to the churn possibility.
The second feature is the amount of thumbs_down , the Roll_advert, and the other shown features.
Feature importance — what’s in a name?By Sven Stringer for BigData Republicmedium.
comThis model appears to be very strict, as it neglects the effects of 31/37 features, while concentrates on 6 features only.
However, despite that, the accuracy and other performance measures are very high on training and testing datasets.
The Gradient-Boosted Trees (GBTs) modelThis model, has higher accuracy and performance measures on the training dataset than the previous two, but the results on the test dataset are worse, which means that the model overfits the data.
The features’ importance shows that the most important features are the NextSOng visits (number of songs played), which appear to be an indicator of satisfied customer, as well as the Thumbs_UP indicator.
The Error page is the second runnerup here, which appears to indicate that the user is almost bored of errors, and will leave soon.
Overfitting in Machine Learning: What It Is and How to Prevent ItDid you know that there's one mistake.
that thousands of data science beginners unknowingly commit?.And that this…elitedatascience.
comWhat is underfitting and overfitting in machine learning and how to deal with it.
Whenever working on a data set to predict or classify a problem, we tend to find accuracy by implementing a design…medium.
comThe Random Forest modelThis model, like the GBT before, has obvious overfitting, with very high training accuracy, and low testing accuracy.
The random forest model agrees with the Decision Tree Classifier in the features’ importance, as both show the most important indicators are the days_total_subscription and the Thumbs_Down while it agrees with the GBT on including all the features as important somehow.
(Note that all the features that have less than 3% importance are collected in the MINOR category.
)ConclusionThe machine learning modeling succeeded in predicting the customers' activity that will most probably end in unsubscribing.
Despite the good results of all the models, the Decision Tree Classifier model appears to be the best here.
However, the other models need to be re-adjusted using different settings to reduce the overfitting.