Why we should know about our DataDealing with Type I endogeneityUsing ice-cream vendor example to explain Type I endogeneityashutosh nayakBlockedUnblockFollowFollowingJun 1Here we discussed the meaning of endogeneity through examples, few possible sources of endogeneity and why is it important.

This part focusses on ice cream vendor example to dive deeper into why we should be careful about endogeneity.

The linear regression model is given by:Sales_i = alpha+beta Price_i+error_i ( m1)error_i ~N(0, var_e)Following Part I, Price_i is an endogenous variable as it can be explained by temperature_i.

Thus | cov(Price_i,error_i)| > 0.

Let Price_i ~N(p,var_p).

Considering the distribution of Sales_i and Price_i as a bivariate normal distribution, we can derive the condition distribution of S ales_i|Price_i.

This is a standard derivation and its derivation can be checked here.

As shown in the figure to the left, if we ignore endogeneity:1.

we are not calculating true coefficients, leading to sub-optimality as we are calculating alpha’ and beta’.

2.

variance from the model ( Sales_i|Price_i) is lower than the actual variance.

This could give us false confidence that the model performs (fits) well.

RemedyPredict Price’_i using Temperature_i and replace Price_i with Price’_i in m1.

Here, Temperature_i is called Instrumental Variable (IV) and this approach is called IV methodPredict Price’_i using Temperature_i and find E_i = Price_i-Price’_i and use:Sales_i = alpha+beta Price_i+gamma E_i+error_i (m2)If there is endogeneity, the mass of gamma would shift away from 0.

Or simply, |gamma| > 0 with confidence (statistically significant).

This approach is called a control function approach.

Endogeneity from omitted variables requires the knowledge of the problem structure, but if known, it is straightforward.

However, it is very difficult to find the right IVs.

Precaution can be taken as finding a weak or wrong IVs would lead to worse results.

The following statement is my opinion and claim could be wrong:Highly non-linear functional approximations of Sales_i=f(Price_i), for example, random forest/ hidden markov models/ deep learning, may handle endogeneity as the underlying non-linear hidden structure aims to find Price’_i.

In factor analysis or SVD, hidden/latent factors are similar to IVs if this problem is handled as a joint pdf of Sales_i and Price_i.

The more common and difficult endogeneity issue is the case of simultaneity.

The discussion on introduction to simultaneity is presented here.

Originally published at https://medium.

com on June 2, 2019.

.