Customer Churn Prediction with PySpark on Sparkify Dataom tripathiBlockedUnblockFollowFollowingFeb 21This is udacity’s capstone project, using spark to analyze user behavior data from music app SparkifyOverview:Millions of users stream their favorite songs through our service through the free tier that plays advertisements between songs, or using the premium subscription model that plays songs ad-free, but pay a monthly fee.
Users can upgrade, downgrade or cancel their service at any time.
It is crucial that the users love the service.
Every time a user interacts with a service (i.
downgrading a service, playing songs, logging out, liking a song, hearing an ad, etc.
) it generates data.
All of this data contains the key insights to keeping the users happy and allowing the company to thrive.
It is There is a fictional online music streaming site, Sparkify, in which we are interested in seeing if we can predict which users are at risk of canceling their service.
The goal is to identify these users before they leave so they can hypothetically receive discounts and incentives to stay.
Sparkify is a music app, this dataset contains two months of sparkify user behavior log.
The log contains some basic information about the user as well as information about a single action.
A user can contain many entries.
In the data, a part of the user is churned, through the cancellation of the account behavior can be distinguished.
we will work on the given data set and engineer relevant features for predicting churn.
Customer churn is when an existing customer, user, player, subscriber or any kind of return client stops doing business or ends the relationship with a company.
so let's get started…also for full project reference please visit my GitHub repo:- linkStep1:- Load and Clean DatasetAs in any data science process first, we will import the data set do some cleaning in the data set.
till now in my experience, this is one of the important steps in the data science process as your model performance based on this and it will give you the understanding of the data set.
most of the data scientist spend their 70 to 80 % time in doing data set cleaning and analyzing.
we have used the smaller data set here which is around 230 MB.
and fist we import all the library and create a spark session.
sql import SparkSessionfrom pyspark.
functions import udf, col, avg, desc,countDistinct, count, when, concat, litfrom pyspark.
types import IntegerType, DateTypeimport pandas as pdimport numpy as npimport matplotlib.
pyplot as pltimport seaborn as snsimport datetimeimport warningswarnings.
filterwarnings('ignore')%matplotlib inlinefrom pyspark.
feature import CountVectorizer, IDF, Normalizer, PCA, RegexTokenizer, StandardScaler, StopWordsRemover, StringIndexer, VectorAssemblerfrom pyspark.
sql import Windowfrom pyspark.
feature import OneHotEncoder, StringIndexerfrom pyspark.
regression import LinearRegressionfrom pyspark.
tuning import CrossValidator, ParamGridBuilderfrom pyspark.
ml import Pipelinefrom pyspark.
classification import LogisticRegression, RandomForestClassifierfrom pyspark.
evaluation import MulticlassClassificationEvaluator# create a Spark sessionspark = SparkSession.
getOrCreate()#load data df = spark.
json")below is the data schemaroot |– artist: string (nullable = true) |– auth: string (nullable = true) |– firstName: string (nullable = true) |– gender: string (nullable = true) |– itemInSession: long (nullable = true) |– lastName: string (nullable = true) |– length: double (nullable = true) |– level: string (nullable = true) |– location: string (nullable = true) |– method: string (nullable = true) |– page: string (nullable = true) |– registration: long (nullable = true) |– sessionId: long (nullable = true) |– song: string (nullable = true) |– status: long (nullable = true) |– ts: long (nullable = true) |– userAgent: string (nullable = true) |– userId: string (nullable = true)The feature page seems to be a very important feature here as this will allow us to understand the user behavior, below are some page values.
Logout, Save Settings, Roll Advert, Settings, Submit Upgrade, Cancellation Confirmation, Add to Playlist, Home, Upgrade, Submit Downgrade, Help, Add Friend, Downgrade, Cancel, About, Thumbs Down, Thumbs Up, Error.
as we can see if the user visits cancellation confirmation or downgrade page we can understand that the user is not happy with service and possibly comes under user churn category.
now we will do some cleaning process like removing empty user value and removing the guest user and logged out user data.
# Althought it does not look any NA value but droping the NA value for safer sidedf = df.
dropna(how = "any", subset = ["userId", "sessionId"])# removing the empty string df = df.
filter(df["userId"] != "")#counting the the remaning and it should be 286500 – 8346 = 278154df.
count()df = df.
auth != 'Guest') & (df.
auth != 'Logged Out'))df.
show()around 8346 empty user data has been removed from the data set.
now we are ok to go ahead with step 2Step2: Exploratory Data AnalysisOnce we have defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned.
You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or the number of songs played.
first, let's create a new column churn and 1 if a user visit cancellation confirmation pagechurned_user_ids = df.
page == 'Cancellation Confirmation') .
flatMap(lambda x : x) .
collect()df = df.
otherwise(0))now below are some of the analysis I have done i am putting only the graph here for the detail code of the code you can visit my git hub repo.
1:- average number of songs played by churned and unchurned user.
2:- Average number of page visits: churned users vs unchurned users3:- Number of users churned while being: free subscribers vs paid subscribers4:- ration of mail female in churn and no churn user.
now after doing some exploratory data analysis its time for doing some feature engineering.
Step3: Feature EngineeringOnce we have familiarized ourself with the data, now its time to build out the features that we find promising to train our model on.
After analyzing all the column above I have decided to use the below feature in my model:1:- Gender2:- UserAgent3:- Status4:- PageOnce the columns were identified, we now have to make sure that they are all in the numeric datatype so that they could be put into the model that we choose.
The Gender, UserAgent and page columns had to be converted into numeric values using a combination of String Indexing and One Hot encoding.
#build pipelineGender_indexer = StringIndexer(inputCol="gender", outputCol='Gender_Index')User_indexer = StringIndexer(inputCol="userAgent", outputCol='User_Index')Page_indexer = StringIndexer(inputCol="page", outputCol='Page_Index')Gender_encoder = OneHotEncoder(inputCol='Gender_Index', outputCol='Gender_Vec')User_encoder = OneHotEncoder(inputCol='User_Index', outputCol='User_Vec')Page_encoder = OneHotEncoder(inputCol='Page_Index', outputCol='Page_Vec')assembler = VectorAssembler(inputCols=["Gender_Vec", "User_Vec", "Page_Vec", "status"], outputCol="features")indexer = StringIndexer(inputCol="churn", outputCol="label")now its time to do some machine learning.
Step4: ModelingSplit the full dataset into train, test, and validation sets.
Test out several of the machine learning methods you learned.
Evaluate the accuracy of the various models, tuning parameters as necessary.
Determine your winning model based on test accuracy and report results on the validation set.
Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.
lr = LogisticRegression(maxIter=10, regParam=0.
0, elasticNetParam=0)pipeline = Pipeline(stages=[Gender_indexer, User_indexer, Page_indexer, Gender_encoder, User_encoder, Page_encoder, assembler, indexer, lr]#Train Test Split: As a first step break your data set into 90% #of training data and set aside 10%.
Set random seed to 42.
rest, validation = df.
1], seed=42)paramGrid = ParamGridBuilder() .
1, ]) .
build()crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator(), numFolds=3)cvModel_q1 = crossval.
avgMetricsOutput :- [0.
8539187024054737]now here I have shown you only one example of Logistic regression and how to build an ML pipeline in spark.
in my github, you can find one more model pipeline.
Improvement:As with anything in life, there is always room for improvement.
One thing that could be looked more into, is to not only predict people that would cancel their subscription altogether but to predict which paid users might downgrade to a free membership.
This is also a concern to Sparkify, as this would lower the subscription fees that they would receiveConclusionWe are done with the project and below are the steps we follow1) Loaded the data2) Exploratory data analysis 3) Feature engineering and checking for multicollinearity4) Model building and evaluation5) Identifying important features 6) Remedial actions to reduce churn 7) Potential Improvements to the model.
Overall, I would say that this was a very exciting project to work on.
This was a real-world scenario for a much online business that relies on subscriptions (both paid and unpaid).
Utilizing the Spark technologies allowed me to get a better feel of Big Data Technologies and all of the potential that it has out in the real world.
.. More details