Why we should know about our DataDealing with Type II EndogeneityExamples from the literature dealing with Type II endogeneityashutosh nayakBlockedUnblockFollowFollowingJun 2With an introduction and dealing with a ‘simpler’ form of endogeneity (Type I), this part explores the more difficult problem of endogeneity by simultaneity.
Simultaneity arises when Y causes X and X causes Y (rich gets richer).
The problem is difficult as:1.
addition of Instrumental Variables (IVs) may not help2.
it is most prevalent and also difficult to diagnose (or recognize)3.
the standard assumption of X being i.
does not hold (effect of the number of coupons in January could last till March).
If simultaneity is not corrected for, the ill effects of the wrong model and confidence in the wrong model are still valid for simultaneity (as discussed in Part II for omitted variables).
In most of the companies where coupons are given to the customer, they are decided based on some strategy (it could be the model built by the previous data scientist).
Simultaneity cannot be covered in a blog.
So some pointers with examples on where to read on simultaneity.
Drug detailing: Pharma companies spend on advertising to physicians.
They spend more on higher prescription volume physicians.
Data scientists can use lagged prescription in the model where current prescription depends not only on the current level of detailing but also on previous prescriptions.
Movies/products/shows are released sequentially in different cities, depending on the competing movies being shown in the theatres and movie reviews from the cities where the movie was previously released.
Thus, the decision of where and which city to launch a movie depends on the movie performance.
Selecting the right instrumental variables (or control variables that makes “ other things equal “), endogeneity can be handled.
The idea is: exogenous variables that drive box office in previous cities will drive the box office in the next city.
Thus, finding the right IVs can help counter simultaneity.
In the paper, they used metrics from competing movies as IVs.
A company may give coupons to a buyer based on R,F,M values (very commonly used: Recency, Frequency and Monetary).
RFM at t-1 might be used as control variables (since they will be uncorrelated with the error at t).
A control function (Part II) approach can also be considered where R,F,M are regressed as a function of R,F,M at t-1 and the difference between predicted and actual values of R,F,M are used as an independent variable when modeling for sales.
In the ice-cream vendor example, instead of using temperature as IV, joint probability distribution of price and error_i could be used.
While IVs, control approach models need the exact structure of data collection, it is not always available.
The model considered in Example 1 assumes knowledge of response model (how X|Y leads to Y) and this is also not known.
Copula method is a model-free approach that uses a joint distribution of regressor X and error_i.
Copula is a function that couples m dimensional multivariate distribution to m one-dimensional marginals.
This method uses maximum likelihood to obtain the joint distribution.
I hope the three parts might have given some idea on the importance of knowing the source the data (how it was created or collected).
While modern machine learning methods are capable of handling endogeneity by omitted variables, simultaneity needs some brainstorming before building the model.
The references are:Non-technical Guide to endogeneity (base paper for Part I, II of the blog)Drug Detailing: Journal of Marketing ResearchEffect of User Reviews on Movie performance: Marketing ScienceUsing copulas to handling endogeneity: Marketing ScienceCoupons on Revenue: Management ScienceOriginally published at https://medium.
com on June 2, 2019.
.. More details