maybe even their academic background?Being a data enthusiast myself, it only made sense to turn to actual data to answer these questions.
And so the journey begins…The dataFirst things first — data.
At first, I considered using LinkedIn profiles of actual Data PMs as a dataset.
By comparing profiles of Data PMs to non-Data PMs, one would expect to see the main differences stand out.
However, even when leaving the scraping and privacy aspects aside, the self-written role descriptions on LinkedIn are too personalized and not easily comparable.
And so I turned to job descriptions.
These are written with a clear focus on what the role requires, are better structured, and also easier to obtain.
There was still no available dataset to use, but this seemed like an easier one to solve.
com provides a pretty good aggregation of links to job postings, which I could then follow and scrape the content.
I could also easily select companies that offer both a Data PM and a non-Data PM positions, to hopefully have a better signal to noise ratio, though admittedly this may cause a slight bias for larger companies (having two or more PM openings).
The resulting dataset includes 100 recently posted job postings, by 50 different companies, where for each company one position was for a Data PM and the other for non-Data PM.
For this purpose, I’ve defined “Data PM” as having the word “Data” in the title; this seemed like a reasonable starting point, but I’d welcome any feedback as to what such a definition may have missed in the bigger picture.
Data processingAn interesting exploration step we can take as we get started, is to look at the role titles we actually got in this sample.
Taking the 50 Data PM postings and removing all the standard PM title keywords (including “Senior”, “Associate”, “Product Manager”, “Product Management” etc.
), provides this distribution of title fragments which indicates what these data products seem to be about.
As the graph below shows, around two thirds of the postings focus on a short list of general names: Data Science, Data Platform, Data Products, or simply Data.
The “Other” third is made up of a long list of more specific terms all across the data pipeline — Data Integration, Ingestion, Modeling, Foundation, Analysis, Strategy and more.
It seems safe to assume therefore, that in most cases Data PMs own the entire data domain in their organisation, end to end.
Next, we’ll process the job posting text itself.
The general approach I took was to compare the two large groups of 50 Data postings to the 50 non-Data postings, and look for features that best separate these two classes.
We’ll start by tokenizing, removing stopwords, and stemming the terms (all using nltk), then parsing the resulting texts to extract n-grams and their frequency in each class.
For each n-gram, we’ll register the number of postings it was found in, the ratio of Data to non-Data counts, and the Information Gain measure when using this n-gram to separate the two classes.
We’ll also remove low-count n-grams (having less than 5% matches), as well as ones with zero or very low information gain.
Analysis and findingsNow we’re finally ready to see the actual results… so what are the keywords and key phrases that differentiate a Data PM from other PMs?The above table shows the top 10 n-grams by their information gain, or how well they separate the two groups.
Examining the top items indicates that organisations view Data PMs as professionals that build data platforms, work with teams of data scientists and data engineers, and derive or work with data models.
It’s worth noting that the term data in itself is not so unique anymore to Data PMs, and actually has a high frequency also in non-Data PM job descriptions (hence the low “Data Ratio” value), while other top terms and phrases appear in small numbers of postings, but when they do — they are highly informative.
When we look for names of data tools in the list, we will find very few that made it high in the list.
SQL stands out as a tool that got mentioned in 16 Data postings vs.
3 non-Data, Tableau appears in 9 Data postings vs.
1 non-Data posting, while Python appears in only 6 postings but all are for Data PMs.
All of these terms combined cover about 40% of the Data postings, illustrating the expectation from many Data PMs to be able to access and manipulate data, from the basic SQL to actual coding.
Flipping the list around to a high-ratio of non-Data to Data postings, we can learn what terms highly predict non-Data postings.
Not surprisingly, we’ll find user-facing keywords such as engage, delight and experience, but more interestingly there are quite a few classic product skills and terms, such as product backlog, product definition, portfolio and launch.
That can be interpreted as an indication that Data PMs are assumed to be experienced PMs who have already mastered product management basics, and so the posting focuses on the Data-specific aspects.
What about degree requirements?.In general, degree requirements, whether undergraduate or graduate (as well as MBA), all do not seem to have any particular significance for Data PMs, showing zero information gain.
On the other hand, statistics, whether a degree or just having background in, is a clear attribute of Data postings, with 30% mentioning it versus only 2% in non-Data postings.
While this small dataset may not be large enough to be a true sample, it does give an interesting snapshot for how the job market views the role of a Data PM right now.
If you have any further insights or comments, I’d love to hear your thoughts in the comments!Originally published at http://alteregozi.
com on April 28, 2019.