Replication Crisis, Misuse of p-values and How to avoid them as a Data Scientist[Part — I]Shubham GuptaBlockedUnblockFollowFollowingMar 16Replication and Reproducibility are now one of the corner stone of the scientific advancement.
They are required from the field of economics, sports, politics to sociology, psychology, and even medicine.
Replication is done to confirm any conclusion of any study and publish a reproducible analysis.
This is essential and key component to be used from medical science journals to artificial intelligence.
These not only confirm the studies but also reduces the risk of bad decisions and failing of expensive research/studies.
Its helpful during the journal publications and also organisational decisions for a data scientist and leaders.
But with recent developments from start of this decade it has been observed by the scientific community that there is a methodological flaw in most of the published studies.
False findings are a primary reason and crisis in today’s scientific community.
AMGEN, an American biotech company took top 50 cancer studies published in top journals and were only able to replicate 11% of the results.
This is one of the prime example.
And all these events let to identify similar crisis in published studies and coining of the term, “Replication Crisis”.
Before we start taking deep dive in Replication Crisis, let me clear something between Replication and Reproducibility.
Let me quote from Plessar’s paper, where ACM defines both of these key-terms as:Replicability (Different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials.
For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts.
Reproducibility (Different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials.
For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.
But if you look closely, reproducibility is just another set of experimental setup.
How the data is being generated and collected is different but to conform to the same set of results that were originally published.
But because of the trails and procedural errors in initial studies have let to bleed this Replicability Crisis into Reproducibility as well.
Hence almost everywhere Replication Crisis and Reproducibility Crisis are inter exchanged regularly.
Wikipedia defines Replication Crisis as:“The replication crisis (or replicability crisis or reproducibility crisis) is an ongoing (2019) methodological crisis primarily affecting parts of the social and life sciences in which scholars have found that the results of many scientific studies are difficult or impossible to replicate or reproduce on subsequent investigation, either by independent researchers or by the original researchers themselves.
”xkcd knows!As a Data Scientist one should be aware of the reasons of Replication Crisis that can potentially ruin your models and studies.
There is definitely a risk and a chance of error involved with every study that a data scientist does.
One should not be afraid in admitting mistakes and identifying what went wrong in their publish.
These days while working with and being funded by big brands/names there are a huge number of Unscrupulous Researchers.
They are more concerned with attention/sparkly headlines than good science.
A data scientist might fall into this space if he/she is biased towards a particular set of results, that might make a headline first among his/her peers.
Every one of them should keep it in their mind no one analysis is going to help them find the singular truth.
Multiple iterations and/or reproducibility are required for conclusion.
Identify your resultsHow do we know if study performed is of mediocre or low quality?.Data scientist need to be take care of few important results while replication/reproduction.
Following are key reason where replicated/reproduced studies can go wrong for a data scientist:During replication studies fail to find an effect that was claimed in an earlier study.
A new study was found that was not mentioned in earlier studies.
Evidence in support of an effect was weaker than that claimed by the researchers.
Overestimating the model’s performance because of fluke in dataFound a smaller effect than that found in the original study, and the difference was material.
Found a larger effect than that found in the original study, and the difference was material.
Check quality of studiesReplicating results and then finding issues in the studies should not be our first step.
We need to be careful while we are designing our experimental studies.
You can identify a low or mediocre studies with some important checks.
This can tell you the quality of study and the accountability of the author as well.
There are numerous checks I have looked into and I boiled down to few important ones:When data scientist don’t know how they got to that point in analysis and/or don’t share their document method of study.
They do not enlist all the data points they covered and/or points they excludedDo not which model they ran to find the statistical evidence.
Errors in programming or reporting.
Poorly designed experiments, which includes data leakage.
p-values are not very well understood.
We saw how Replication/Reproduction allows to weed out false significant events.
And we jolted down the points which can actually help us to discard bad studies and also identify bad or mediocre studies.
But what if we have to start our own study, build models and make sure we are performing our best.
Goodman et al in their paper, What does research reproducibility mean?, started a dialogue for ‘A New Lexicon for Research Reproducibility’.
I found it fascinating because their methodology make the author accountable of their actions and also provide some mentioned few of the differences that affect the approach to reproducibility in distinct scientific domains, like Degree of determinism, Signal to measurement-error ratio, Purposes to which findings will be put, Closeness of fit between hypothesis and experimental design or data, consequences of false conclusions etc.
They mentioned few important aspect of keeping your work Reproducible.
There are 3 major categories and 1 minor category of reproducibility we have to take care of while working on any experiment:Methods reproducibility is meant to capture the original meaning of reproducibility, that is, the ability to implement, as exactly as possible, the experimental and computational procedures, with the same data and tools, to obtain the same results.
Data scientist need to be careful while designing the data flow or pipelines for operations.
And keep source and methodology well documented.
Results reproducibility refers to what was previously described as “replication,” that is, obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible.
Robustness and generalisability[minor]: Robustness refers to the stability of experimental conclusions to variations in either baseline assumptions or experimental procedures.
It is somewhat related to the concept of generalisability (also known as transportability), which refers to the persistence of an effect in settings different from and outside of an experimental framework.
For example, my baseline values for the ratio of currency exchange might change over two separate period but it should work perfectly fine for generalised models from quantitative economics.
Inferential reproducibility, not often recognised as a separate concept, is the making of knowledge claims of similar strength from a study replication or reanalysis.
This is not identical to results reproducibility, because not all investigators will draw the same conclusions from the same results, or they might make different analytical choices that lead to different inferences from the same data.
Replication crisis is challenging a lot of research, man hours, investments and decision that took place over the last few year.
Replication has been avoided because its expensive and value is not being taught in the academia.
But we should also discuss the importance of replication in scientific community and why it can build trust and hold authors/scientists more accountable.
Key points to keep in mind to avoid replication crisis would be:Even though it is expensive, we should perform replication often.
We should publish more NULL results, that does not support hypothesis so that p hacking can be avoided.
We need to share data with public while publishing so data can be studied and rectifications can be suggested.
ASA, American Statistical Association suggested the correct use of p-values, which I will cover in part-II.
Until then, I leave you all data scientists to be responsible and thorough with the data.
And be unbiased towards results.