Trap DS Projects: Beware of “Easy” Segmentation Projects99% chance you’re not readyRandy AuBlockedUnblockFollowFollowingMar 27tl;dr: Unbounded, poorly scoped projects that rely on a ton of other things are about as fun as you’d expect.
Somewhere today, someone in the world is saying some version of this sentence to a data scientist: “we’ve got lots of customer behavior data, let’s cluster them into some useful segments to see where we can learn about our business, how hard can it be?”Spoiler alert: Hard, very hard, deceptively hard.
Most systems aren’t ready.
And you’re on the boatOver my decade+ at startups, I’ve embarked on something like this no less than 3 or so of these projects either on my own, or under direction from various people.
They all didn’t end well for essentially one reason — we were not ready for it.
Segmentation sounds easyhttps://dilbert.
com/strip/2000-11-13Of all the machine learning algorithms floating out there, the clustering ones are among the easiest to understand and implement.
Stuff like k-means sounds easy to apply, in broad strokes you just…Throw activity data into the algoClusterEyeball the output against realityIterate?Profit!!!Except, no.
“Throw data into the algo”? Really?Warning bells should always go off when the words “throw X into Y” gets tossed around.
Data is almost never in a state that can be “thrown” anywhere.
This stage alone can kill your project once you start going down the route of figuring out what data you’re metaphorically throwing around.
First there’s the basic question of “do you even have data”.
Just because you’ve been collecting things into a database doesn’t make it useful for machine learning.
If you’ve been lazy about logging critical events consistently, or have massively biased samples, or just throw away the data for cost reasons, you’re going to have to revisit all those decisions.
— But I’ve got data, someone even mentioned a Data Warehouse thing before?Most likely you have transactional data and the data store has been optimized for OLTP workloads, it’s the natural state for many production systems.
But analytics typically ask questions that look at the data sideways.
This is why “data warehousing” became a big topic when “data mining” was the hot stuff in industry about 15–20 years ago.
Remember when “data mining” was a thing?Now, if you happen to have a nice, clean, maintained data warehouse somewhere with curated metrics and nicely laid out data cubes, you’re actually in a fairly good starting position for a clustering project.
I haven’t really seen many startups < 100 employees thathave one of these things.
They’re complex systems that require maintenance, especially as other systems evolve and new data is created.
Most startups don’t have the cash for that level of handholding just for analytics.
—I’ve actually got data, so I can use it right?Let’s assume you’ve got some data in a good state to be used, organized and relatively bug-free.
We’re ready right?Not so fast!!Next you need to do some form of feature generation/selection.
While the algorithm itself doesn’t care if you throw random numbers in, you as the analyst and end user do care.
Unless you love spurious correlations, you must have some idea (a hypothesis, if you will) as to what features matter and what doesn’t.
Do you have the necessary domain knowledge to make that educated guess?.Does anyone?So you’re going to at least need to know some basic drivers of your business, or have an idea of what is correlated w/ the success you seek.
Which means you need to have done basic research about who your customers are, what they normally do, even if just at a high level.
Let’s solidify this with some concrete examplesLet’s say you think the sequence of pages viewed by users is important to distinguish your user’s future behavior.
How long of a sequence matters?.Because storing/processing infinitely long sequences is $$$.
What about behavior over multiple sessions, do you even have sessions?.Is there enough overlap in behavior with other users so that you have useful entropy?.Where do you cut off the long tail?What about whether they used a coupon, or registered via print ad link, or bought 5 widgets their first day?.Maybe its their country of origin, or whether they have a credit card on file, or their profile text?.Are all those things instrumented correctly, tied back to the user appropriately in time and space?— Wait, I’m just going to use EVERYTHING!.Why can’t I just ML now?Because of two big issues, various forms of information leaking where existing data gives away your output.
Imagine you found out users who had a personal sales call are very likely to use a free trial… because the sales team deletes old unsuccessful leads from the CRM in a misguided effort to “save space”.
Or you have a “currently paying customer” flag in your system (super common), and so if you blindly include it, it’s going to be GREAT predictor of revenue relative to everyone who doesn’t have that flag.
The other reason is all the data you want to use is very likely not universally compatible and consistent.
Is there a consistent user key spanning all the data ?.There’s fields that are indicate current state in the present, other fields that mean historic states in time, and other states that are implicit (such as “‘”users don’t get a row in this setting field until they leave default state because default state is assumed”).
Just because it exists doesn’t mean it’ll worth together to tell a consistent story about the world.
— Fine, so I just look over the fields and clean it up, how hard can that be?Sure, so you go to write the queries to gather the data, and immediately you start uncovering bugs, and the numbers don’t quite add up, and so you gotta figure out what’s going on there and fix them, right?.Then you get sucked into efforts to fix various systems and maybe implement new systems to replace bad ones.
Suddenly that 2-week “fun” project is on its third month.
Since many of those features won’t be in a format that’s usable by a clustering algorithm, you’re going to want stage them somewhere.
Maybe a temporary table there, an aggregation table here.
Then you’ll want to track these over time, and it’d be nice if it were all in one system.
Wait this is starting to sound like a data warehouse (another project where dreams go to die, but I digress)…ClusterSeriously the easiest part of the processIf you manage to reach this step, this part usually isn’t too bad, usually.
It’s often a pain to figure out the exact format of data your particular clustering package takes, because there’s 15 million ways to create and store a “distance matrix” and everyone thinks their format is self evident and doesn’t really need documentation…But aside from that, just press the buttons!“Eyeball clusters against reality”— Because that’s a well defined problem, right?Let’s say you run your k-means attempt for 5 clusters.
We’re not sure if 5 is the right number of clusters aside a number we pulled out of the air and we came into this project blind, but we have humans and we can check to see how right if “feels”, right?https://xkcd.
com/1838/How will that work in practice?.We dump out examples of rows and trying to derive some rhyme or reason to it.
Our hope is that we look at the clusters and it sorta kinda “makes sense” in an ill-defined way.
Essentially you’d have to look at what the machine’s have classified, and see if your over-creative human brain can come up with a believable explanation to name the cluster.
But there’s no real guarantees that you’re finding something useful to you.
It’s not clear whether that lump of data points is a useful group or a dumb artifact of the features we selected.
Even if the clusters represent a “real phenomena” we might not be able to make sense of it enough to use it broadly.
In any of such ambiguous instances, you’ll be wondering if you should keep going with what you found, or scrap it all and try again.
Oh yeah, did I forget to mention that depending on what clustering method you use and the properties of your data, there might be issues with how stable the clusters are across multiple runs and over time?Iterate— Because this was so fun to do the first time we want to do it a few more times until we feel we got it rightPaaaar-tay.
(By the way, how many months passed now to get to this point?)Profit?If you’re lucky.
The end goal of the whole clustering exercise is to either 1) learn new interesting things about your users, or 2) apply the newly found groups to other things, which will be useful further on.
You’ll want to try doing those things now that you (presumably) have a clustered set of users.
Hopefully they work and it unlocks all sorts of insights and magical unicorns.
So is there no hope?!?There’s some!Do some qualitative homework firstNotice that when you validate these arbitrary unsupervised clusters, you essentially are using your own mental models of the data as a reference point?.The clusters don’t feel “correct” if you can’t come to a coherent understanding of what makes that a cluster.
Since you’re in need of this as a reference point anyway, do it first!Then you’[ll have some a priori hypotheses for the number of clusters you’re looking for, as well as a better sense of what potential features matter to the model.
All of this will help your model and you might even find that this homework is good enough, there may be no compelling need to even do the data mining.
Limit your scope as much as possible!The key is to recognize that the more data you want to use, the more pain is involved, until it everything becomes impossible.
The way to safely survive is to very tightly control the scope.
It means you should haveA very firm and short list of factors that users should be clustered against, hypotheses are encouraged!Your data collection systems debugged ahead of timeYour data access/movement/pipelining infrastructure is in a manageable state (can you get to everything and copy it to where you need to process?)A strong willingness to push back if all these fundamental pieces aren’t availableKnow when to stop and push backOpen ended problems will suck as much time as you’re willing to throw into them, so you should reevaluate the whole project if you’re finding that you’re doing more data engineering work than clustering.
The data engineering part might actually be necessary work you should be doing to move the company forward, but it’s wrong to put all that work under the umbrella of “doing a clustering project” under the purview of a lone data person.
That amount of work should be a serious effort with different resources and discussions with management.
It’s important to push back before you get too invested into things.