Why we should know about our DataIntroduction to EndogeneityExamples describing different types of endogeneityashutosh nayakBlockedUnblockFollowFollowingJun 1An ice cream vendor sells ice cream on a beach.

He collects data for total sales (Y) and selling price (X) for 2 years.

He gives the data to a data scientist asking him to find the optimal selling price.

The data scientist fits a linear regression model, Y = alpha+beta X+error and finds the optimal by differentiating the profit function =(X-cost)(alpha+beta X).

However, the ice cream vendor used to increase the price of the ice creams once the temperature was high as the demand went up.

He forgot to mention his pricing strategy to the data scientist.

The linear regression thinks that as the selling price increases, the sales increases (beta > 0) or the slope is not as negative as expected.

Thus the optimal selling price from the model is at the very least, sub-optimal (if not harmful to business).

A merchandise store gives customized coupons to its customers.

There are different coupons (with different offers).

The store collects data for customers response, that is, they used the coupon or not (Y) and coupon details (X).

They want to know the effectiveness of each coupon so that they could customize it and send it to different customers in the future.

The data scientist fits a model to predict the probability of purchase (in the next month or next few months) for each customer from each coupon.

Again, the store had a strategy as to whom they send the coupon and what coupons do they send (instead of randomized coupons).

For example, if they think a customer will buy even without a coupon, they did not send it or if they think a person might buy, they sent them more coupons.

This strategy was not disclosed to the data scientist.

The could also give biased ‘effectiveness’ of coupons if more coupons were sent to the customers with a higher chance of purchase.

Rich becomes richer, poor becomes poorer (conceptually similar to Example 2 as explained later but this example is included to show endogeneity is not just restricted to data science).

This is also called Mathew’s effect in sociology.

Hard work helps movie artists win awards.

Producers want to affiliate themselves with artists who have won awards as they affiliate award with acting quality, hence they get better opportunities and thus they reap more recognition and more awards (again, rich becomes richer).

All these examples violate one of the fundamentals of Design of Experiments- randomization.

Often times, a company cannot afford randomization.

In field experiments, the subjects are assigned to different treatment levels randomly (or we should randomly set the value of independent variables and observe dependent variables).

But in the above examples, this is violated (as the independent variables depend on certain strategy — by ice cream vendor/ merchandise store/ movie producer).

In other words, the independent predictor variables (X) should be exogenous and not endogenous.

Exogenous variables are not driven by other factors (observable or unobservable).

In the first example, the price was driven by temperature.

In the second example, coupons were driven by past sales).

Endogenous variables are correlated with the error terms — let’s deconstruct this.

The response (Y) in a model should be explained by independent variables(X) and all that which could not be explained by X is included in the error term.

( We know the issue and adverse effects of multicollinearity in linear regression — in the case of multicollinearity, X1 is explained by X2).

If there is a factor that could explain Y but is unobserved, say Z, it is included in the error.

But if Z could explain any of the X (in example 1, Z = temperature, X = price), the error term is correlated with X (hence, the price is an endogenous variable).

Thus, the issue of endogeneity arises when we have a Z that is related to Y, but it is also related to X and not included in the model.

There is a wide range of sources of endogeneity — what if the vendor changes location based on time, what if the vendor changes price based on the queue.

The sources of endogeneity can be broadly classified as:omitted variables (as explained by the ice cream example where the price is the endogenous variable and temperature is the omitted variable).

simultaneity (example 2) where X causes Y and Y causes X.

This is most common and also the difficult one.

In the case of omitted variables, if the structural information is known, it is straight forward (it is covered in Part 2 of this blog).

Example 2 is an example of simultaneity as coupons cause sales and sales leads to more coupons to that customer.

In example 1, if the vendor changes the price as a control (studied in dynamic programming or sequential control theory), based on the queue length, it is simultaneity as queue causes price and price causes queue.

Simultaneity can also be caused by self-selection.

selection bias (example 3) we cannot determine the effect of treatment on a subject if sampling is biased.

For example, if we collect ice-cream sales data for weekends or in a posh neighborhood.

We will overestimate the acting quality of an actor if our sample contains mostly award winners.

Dr.

Heckman won a Nobel prize in economics on this topic (here).

Revision of some definitions:P(X=x, Y=y) is a joint distributionP(X=x|Y=y) is a conditional distribution of X given YP(X=x) = sum [ P(X=x|Y=y)P(Y=y) over all y] is the marginal of XIf P(X=x, Y=y) = P(X=x)P(Y=y), then X is independent of YI hope the following statement makes more sense now (which simply means that the independent variables are set strategically or in which independent variables are endogenous):Endogeneity arises when the marginal distribution of the independent variable is not independent of the conditional distribution of the dependent variable given the independent.

What if we don’t consider endogeneity?the examples show two reasons: firstly, the actual impact of independent variables of the model can be underestimated or overestimated.

Secondly, decisions based on inferences from the model could be sub-optimal.

when the data scientist fits the model without accounting for endogeneity, the model fit is better (increases confidence in something that is wrong).

This claim is explained in Part II.

The contents are heavily taken from here and here.

Here, we will go through the case study of the ice cream vendor discussed here.

Here will go through an introduction to simultaneity and the references used in this post.

Originally published at https://medium.

com on June 2, 2019.

.