Framing Data with DataFrames

Framing Data with DataFramesChanneling Open Data into Business FrameworksAdam DickBlockedUnblockFollowFollowingApr 8At the outset of any challenge, but especially for the complex and ambiguous ones that data scientists often face, structuring a thought process to approach the fundamental issue is one of the initial tasks.

Many industries employ subject matter experts that have domain-specific knowledge, where unique paradigms of thinking are continually developed, taught and enhanced.

A functional organizational structure separates employees with specific skills into departments such as Human Resources, Sales, Marketing, Finance and Engineering, etc.

This can produce a silo effect, in which useful information is created, but not necessarily shared.

In response, companies assemble cross-functional teams to facilitate sharing and learning across the organization.

These teams can increase overall company effectiveness because they strengthen the firm’s interdependent internal network.

For example, strategic decisions can be improved by incorporating, say, information and insights inferred by data scientists.

So let’s take a look at how rapidly growing data can be channeled into intuitive frameworks that are familiar to business leaders.

Exploratory Data AnalysisBusiness strategists and consultants often explore an industry by performing a Situation Analysis, which considers the 5C’s (Context, Consumer, Company, Competition and Collaboration).

The Growth-Share Matrix, commonly known as The BCG Matrix, was introduced by the Boston Consulting Group in 1970 and is a popular business framework for visualizing a market analysis.

The Growth-Share Matrix (a.



The BCG Matrix), The Charts that Changed the WorldThe BCG Matrix evaluates a market based on constituent growth rates and market share, creating four quadrants called Dogs, Question Marks, Stars and Cash Cows.

Diversified companies have business lines in mature, high-share markets that generate excess cash flow as well as market segments with high-growth opportunities that often require significant cash investments.

Dogs refer to low-growth, low-share markets that typically struggle to break-even.

The conventional wisdom of analysts is to liquidate business in this area.

Question Marks are high-growth, low-share markets that have both potential and risk.

New businesses target this area through a select and divest strategy.

Stars are Question Marks that have been widely adopted to become market leaders.

Continued growth pushes firms to invest due to high competition.

Cash Cows are mature markets where entrenched players can maintain their position with minimal investment, allowing them to milk reliable cash flows.

Let’s use a real example to see how we can apply the concepts in the BCG Matrix to a specific industry.

Moreover, we can take open data that is freely available and organize the information in a meaningful way to populate the chart.

Finally, we can use render our market analysis in an interactive, cloud-based visualization that can be queried to access layered information.

The Pet Care IndustryFrom Anaconda to Python to Pandas, data scientists have an strong affinity for animals.

Robbrecht van Amerongen has compiled a list of favorites in the Zoo of Programming Languages.

As inspiration, we can generate a BCG Matrix by searching for available open data sources for the fast-growing and much-beloved pet care industry (note: market research can cost up to $9,450).

The American Pet Products Association (APPA) estimates that the pet care industry will exceed $75 billion in 2019, increasing by 3.

9% over the previous year.

Consumer spending categories are led by pet food ($32 billion), veterinary care ($19 billion) and supplies & OTC medicine ($16 billion).

Now let’s scan the web for information sources to develop our own market analysis.

New York NeighborhoodsWe could search for a dataset on Kaggle, but would only find seven results for pets, although there is a myriad of cat and dog image repositories.

The most authoritative and complete dataset is the NYC Dog Licensing Dataset, which tracks 15 columns and 122k records as part of the NYC Open Data initiative.

The dataset includes Borough and Zip Code information for licensed dogs in New York, which offers the potential to pivot on the data by location, but Borough seems too coarse and Zip Code seems too fine.

So let’s refer to the New York State Department of Health’s ZIP Code Definitions of New York City Neighborhoods that subdivides the city into 42 manageable Neighborhoods.

First 10 Neighborhoods in New York, Mixed Format (Wide and Long ), NYS Dept.

of HealthThe initial Pandas DataFrame is arranged in a Multi-Hierarchical Index, except for Zip Codes, which are packed into a Wide Format.

The goal of this table, though, is to map the Zip Code of each record in the NYC Dog Licensing Dataset to its respective Neighborhood.

So the first task is to reshape the table into a Long Format for easier processing, which can be indexed by Zip Code.

First 10 Neighborhoods in New York, Long FormatCompetitor ConcentrationIn addition to the licensed dog data that represents the New York consumer market, we can also gather information about business competition.

The Yelp Fusion API has a comprehensive database of various kinds of businesses in New York.

By searching for pet stores and pet services, we can collect 1,502 unique businesses actively catering to the pet care industry in 2019.

First 10 New York Pet Stores and Pet Services Actively Operating in 2019, Yelp Fusion APIMarket concentration is a fundamental measure of industry competitiveness, where markets can range between perfect and monopolistic competition.

Without revenue or client traffic data for each store, we have to make a large and unrealistic assumption that each store is equivalent.

Having said that, we can count the number of stores of each chain as a proxy for total market share.

Concentration Ratios aggregate the total market share held by the top firms in the industry.

For example, CR4 for our top four firms stands at only 8.

9%, while CR8 is 11.

9%, indicating a highly distributed market.

Alternatively, the Herfindahl-Hirschman Index, used by the Department of Justice in anti-trust cases, is 31 out of a range of 0 (perfect competition) and 10,000 (monopoly).

Concentration Ratios for the Top 8 Firms in the New York Pet Care IndustrySimilarly, we can group our competitor data by neighborhood and measure geographic concentration or density.

We measure CR4 to be 30.

6% and CR8 to be 49.

9%, meaning that half of all pet stores and services are located in eight neighborhoods.

The neighborhood HHI of 448, which is the sum of the squares of all neighborhood shares, is still considered to be unconcentrated.

Concentration Ratios for the Top 8 Neighborhoods in the New York Pet Care IndustryMarket SizeNYC Open Data conveniently lets us download the entire NYC Dog Licensing Dataset as a CSV file, which we can import into a Pandas DataFrame.

We notice that it has the License Issued Year as well as the License Expired Year, allowing us to count the number of actively licensed dogs in a given year.

This metric is very similar to the Monthly Active Users (MAU) in digital marketing.

First 10 of 121,713 New York Licensed Dogs from 2014–2022, NYC Open DataNow that we have gathered our neighborhood, pet business and licensed dog datasets, we can insert them into a SQL database and join them in a single DataFrame through some powerful queries.

By selecting only the active dog licenses in every calendar year, we can calculate the 5-Year Compound Annual Growth Rate (CAGR) from 2014 to 2019 for each neighborhood.

Market Share and Growth Rate (2014–2019) of Pet Care Industry for First 10 Neighborhoods in New YorkNow that we have calculated the market share and growth rate of licensed dogs in each New York neighborhood, we can render our analysis as a BCG Matrix in a data visualization engine such as Plotly.

We can add richness to the graph by sizing the data points as the number of pet businesses in each neighborhood and shading them by the number of licensed dogs per store.

The Growth-Share Matrix of Licensed Dogs in New York exhibits a striking pattern of neighborhoods that follow a trajectory from starting out as Dogs, progressing to Question Marks, becoming Stars and maturing into Cash Cows.

We can readily see that four neighborhoods enjoy high market share and two of them present opportunities since they are underserved by pet businesses.

Growth-Share Matrix of Licensed Dogs in New York (Interactive Plot)The source code written for this analysis is available on GitHub, which includes gathering the datasets from Yelp and NYC Open Data, wrangling the data in Python, Pandas and SQL, and rendering the BCG Matrix in Plotly.

The entire Plotly code is short and can be automatically uploaded and hosted by Plotly, which provides a dashboard for modifications and embeddable links.

Visualizing the BCG Growth-Share Matrix with Plotly (Python Code)Mind the GapThe market share of pet businesses is only calculated by the number of physical store locations held by a business or a within a given neighborhood, which could be improved with revenue or foot traffic data.

The New York City Economic Development Corporation estimates that only 20% of dogs are licensed in New York, so it is important to note that the NYC Licensed Dog Dataset is only a sample of the entire dog population.

The Dog Dataset began tracking data in 2014, so there is a noticeable spike in registrations in the subsequent years.

This growth in registrations should not be confused with the growth of the actual dog population.

The Dog Dataset was last updated in 2017, so recently issued licenses are not captured, while ongoing expirations sunset the dog population by 2022.

This gap in data can affect the CAGR rate depending on years considered.

Given the limitations with any dataset, we should now feel confident that we can collect open data sources, perform a market analysis and visualize the BCG Matrix so that our data can be easily relatable to business leaders.


. More details

Leave a Reply