Yet Another Full Stack Data Science ProjectA CRISP-DM ImplementationRam Saran VuppuluriBlockedUnblockFollowFollowingMay 2Photo by rawpixel.
com from PexelsIntroductionI came across the term “Full Stack Data Science” for the first time a couple of years back when I was searching for Data Science meetups in Washington D.
Coming from a software development background, I am quite familiar with the term Full Stack Developer, but Full Stack Data Science sounded mystical.
With more and more companies incorporating Data Science & Machine Learning in their traditional software applications, the term Full Stack Data Science makes more sense now than at any point in history.
Software development methodologies are meticulously developed over the years to ensure high-quality software applications in low turn around time.
Unfortunately, traditional software development methodologies do not work well in the context of Data Science applications.
In this blog post, I am going to emphasize on Cross-industry standard process for data mining (CRISP-DM) to develop a viable full stack Data Science product.
I firmly believe in the proverb, “the proof of the pudding is in the eating.
” so I have implemented Starbucks challenge by applying CRISP-DM methodology as a sister project for this post, and is referred at multiple places in this blog post.
Starbucks Dataset OverviewStarbucks has generated the data set using a simulator program that mimics how people make purchasing decisions and how promotional offers influence those decisions.
Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable characteristics.
People perform various events, including receiving offers, opening offers, and making purchases.
As a simplification, there are no specific product to track.
Only the amounts of each transaction or offer are recorded.
There are three types of offers that can be sent:buy-one-get-one (BOGO)discount, andinformationalIn a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount.
On receiving a discount, a user gains a reward equal to a fraction of the amount spent.
In an informational offer, there is no reward, but neither is there a required amount that the user is expected to spend.
Offers can be delivered via multiple channels.
The primary task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present these offers.
What is the Cross-Industry Standard Process for Data Mining (CRISP-DM)?The cross-industry standard process for data mining (CRISP-DM) methodology is an open standard process that describes conventional approaches used by data mining experts.
CRISP-DM is a cyclic process that breaks down into six phases.
Business UnderstandingData UnderstandingData PreparationModelingEvaluationDeploymentCross-Industry Standard Process for Data Mining (CRISP-DM) By Kenneth JensenBusiness UnderstandingFocuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan.
For the current scenario, we are going to:Perform exploratory data analysis on multi-variant frequency distributions.
Build a machine learning model that predicts whether or not someone will respond to an offer.
Build a machine learning model that predicts the best offer for an individual.
Build a machine learning model that predicts purchasing habits.
Data UnderstandingThere are two ways in which Data Understanding phase is practiced:Start with an existing data collection and proceed with activities to get familiar with the data, to discover first insights into the data, or to detect interesting subsets to form hypotheses for concealed information.
Recognize specific interesting questions and then collect data related to those questions.
The transformation from Business to Data understanding phase is not linear; instead, it is cyclic.
In this project, we are going to utilize only the data provided by Starbucks as it is challenging to work with inherent limitations in data.
Thereby we are practicing the first method.
Starbucks has distributed data in three json files:portfolio.
json — containing offer ids and metadata about each offer (duration, type, etc.
json — demographic data for each customertranscript.
json — records for transactions, offers received, offers viewed, and offers completeData PreparationThe data preparation phase covers all activities to construct the final dataset from the initial raw data.
Data preparation is 80% of the process.
Data wrangling and Data Analysis are the core activities in the Data Preparation phase of the CRISP-DM model and are the first logical programming steps.
Data Wrangling is a cyclic process, and often we need to revisit the steps again and again.
Data Wrangling is language and framework independent, and there is no one right way.
For the sister project, I am using Python as the programming language of choice and Pandas as the data manipulation framework.
As a rule of thumb I will approach data wrangling in two steps:Assess Data — In this step we are going to perform the syntactical and semantical check on the data and identify any issues in the data along with potential fixes.
Clean Data — In this step we implement the data fixes from the Assessment phase.
We also run small unit tests to make sure the data repairs are working as expected.
I performed the Data Wrangling on all three of the data sources provided by Starbucks.
Data Wrangling — portfolio.
jsonFrom visual and programmatic assessment, Portfolio data set has only ten rows with no missing data.
However, the data is not in machine learning friendly structure.
We are going to apply one hot encoding methodologies on “channels” and “offer_type” columns.
Data Wrangling — portfolio.
jsonData Wrangling — profile.
jsonFrom the visual assessment, on the Profile Data set:“became_member_on” column is not in DateTime format.
if the age information is missing, then the age is populated by default with ‘118.
’Age frequency distribution —before data wranglingFrom the programmatic assessment, on the Profile Data set:gender and income columns have missing data.
the same columns with missing gender and income information are having age value ‘118.
’Following fixes are implemented on the Profile Data set in clean phase:Drop rows with missing values, which should implicitly drop rows with age ‘118.
’Convert became_member_on to Pandas DateTime datatype.
The data is not in machine learning friendly structure.
I will create a new ML friendly Pandas data frame with the following changes:apply one hot encoding methodologies on gender column.
“became_member_on” column is split into year, month and date columns, and the became_member_on column is droppedData Wrangling — profile.
jsonOnce the data wrangling step is completed, there are no rows with missing values (implicitly dropping rows with age ‘118.
’)Age frequency distribution — after data wranglingData Wrangling —transcript.
jsonFrom a visual and programmatic assessment, there are no data issues in the Transcript data set.
However, the data whether a promotion influenced the user is not defined.
A user is deemed to be influenced by the promotion only after the individual made a transaction after viewing the advertisement.
We will apply multiple data transformations to extract this information.
Data Wrangling — transcript.
jsonNow that we have all three data frames cleaned, let us consolidate into one data frame.
Data Wrangling — consolidationExploratory Data Analysis — multi-variant frequency distributionsFrom the transcript data, we identify five kinds of events:Offer receivedOffer ViewedOffer CompletedTransaction (purchases)Influenced (only if the purchase was made after the offer is viewed)The characteristics extrapolated from clean data are mentioned below the corresponding visualizations.
Event Distribution by GenderProfiles that were registered as Male made the most number of transactions.
Profiles that were registered as Female are much likely to be influenced by the promotions.
Event Distribution by IncomeIndividuals in the data set are within the income range of 30,000 to 120,000.
Most of the individuals in the data set make less than 80,000 per annum.
Most of the transactions were made by individuals earning less than 80,000 per annum.
Event Distribution by Offer TypeBOGO offers have a higher rate of influence.
Informational offers have negligible influence.
Even though there is not much knowledge gained from data exploration, it has yielded a critical insight — target classes are imbalanced.
Primarily when working on classification models, this information is vital to decide what evaluation metrics should be used.
Modeling & EvaluationModeling is not a necessary step and is solely dependent on the scope of the project.
For this project, I am going to build a machine learning models that:Predicts whether or not someone will respond to an offer.
Predicts the best offer for an individual.
Predicts purchasing habits.
All three models are trained on ensemble models.
For classification models, due to the imbalance in the target classes, we are going to use Precision, Recall, and F1 score values as the evaluation metrics.
For the regression model, we are going to use mean squared error and R2 values as the evaluation metrics.
Predicting whether or not someone will be influenced by an offerModel for predicting whether or not someone will be influenced by an offerI have employed a grid search to find the model that yields high F1 score values.
AdaBoostClassifier with a learning rate ~ 1.
87 with 10 estimators produced consolidated F1 score of 0.
87 both on training and testing datasets achieving a right balance between bias (underfitting) and variance (overfitting).
Precision, Recall and F1-score for training and testing data setsNot all features in the data set are utilized to make predictions.
We can get the weight of each feature from the model’s perspective.
The weights will be in the range of 0 to 1, and the total weight of all features will add up to 1.
Feature Importance for model to predict influenceModel to predict whether an individual is influenced by promotion or not is highly dependent on the amount spent.
This model is highly reliant on after the action (purchase).
Therefore we cannot use this model as it is.
Ideally, we need to collect more data to address this problem.
As I am working only with the data provided by Starbucks, I cannot devise the desired model.
Predicting the best offer for an individualModel for predicting the best offer for an individualI have employed a grid search to find the model that yields high F1 score values.
Unfortunately, the maximum consolidated F1 score that could be achieved was “0.
Gathering more data should help to increase the F1 score, but as I am working only within the limits of data provided, I cannot deploy this model.
Precision, Recall and F1-score for training and testing data setsPredicting the purchasing habitsModel for predicting the purchasing habitsI have employed a grid search to find the model that yields decent R2 and MSE values.
GradientBoostingRegressor with a learning rate 0.
1 with 100 estimators produced R2 score of ~ 0.
32 and MSE of ~2300 for both training and testing datasets achieving a right balance between bias (underfitting) and variance (overfitting).
R2 & MSE for training and testing data setsFeature Importance for model to predict amountUnlike the model that was used to predict whether an individual will be influenced or not, the model to predict purchasing power is dependent on multiple features and none of them are after the fact attributes.
I will use this model in the Web application to make the predictions.
DeploymentGenerally, this will mean deploying a code representation of the model into an application to score or categorize new unseen data as it arises.
Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development.
I have created the web application which will utilize the data analysis and the pre-trained model to predict the purchase amount for a profile and offer code combination.
A complete description of the steps that need to be followed to launch the web application is mentioned in the README.
Below are the screenshots from the web application:Overview of the data setPath: http://0.
0:3001/Predict AmountPath: http://0.
0:3001/predict_amtConclusionContrary to conventional software development, it is not always feasible to materialize the business requirements.
Sometimes more data will help, but in our current case, we are set to work within the data provided.
Evaluation of the test set was part of the modeling phase; this is not common in the real world.
As there are no dedicated testers for this project both modeling and evaluation are done simultaneously.
At no point, testing data is exposed to pretrained models.
Source code for this analysis is hosted on GIT.
ReferencesCRISP-DM — a Standard Methodology to Ensure a Good OutcomeFull Stack Data Science (with Benjamin S.