Modernizing Academic Data Science

Modernizing Academic Data ScienceFour principles from software engineering to improve scientific reproducibilityPatrick BeukemaBlockedUnblockFollowFollowingDec 14SummaryReproducibility is essential to scientific progress, yet most results cannot be reproduced. Today’s research papers rely heavily on programming analysis pipelines and statistical models. For decades, software engineering has evolved best practices on how to achieve high quality, reproducible code. However, scientists are not typically trained in these best practices. This post covers four fundamental principles that will promote greater reproducibility in scientific research: code review, testing, version control, and documentation.Photo by Conner Baker on UnsplashThe Problem of Irreproducible ResearchTowards the end of my graduate career I researched the reproducibility crisis as a side project which, for an increasingly disillusioned Ph.D. student, was about as smart as pouring gasoline on a fire. The fact that some published research findings are false is not new nor particularly alarming. What is concerning, however, is the mounting evidence that most published research findings are false.Psychology bears the brunt of the bad press, but the crisis affects most fields. Statistical power across all of neuroscience, for example, is about 20 percent. And while the latest posterior odds of the power pose or extrasensory perception probably do not keep you up at night, the consistent failure to replicate findings in cancer biology, genetics, or pharmacology may be cause for KC GreenAcademia’s warped incentive structure lies at the heart of much of the statistical abuse. Significant results lead to money and fame while insignificant results lead to nothing at all. Selective reporting, facilitated by the dichotomization of evidence, is both easy and professionally convenient. The problem is not just explicit p-hacking, but also — and more insidiously — the implicit hacking that results from abundant researcher degrees of freedom and our bias towards positive results. This is why reproducibility is a concern in so many fields, including machine learning and AI.Towards Reproducible ResearchGrowing awareness of the lack of reproducibility has led to a number of recent proposals such as moving the significance threshold, supplementing analyses with effect sizes and confidence intervals, or switching to Bayes factors. These are all statistically principled recommendations, and yet deeply unsatisfying. Why? Because these are at best stopgaps and a lasting solution requires a cultural transition.One place to start with that transition is the software development that is central to many academic research projects. It is therefore logical to borrow principles from the software industry that promote reproducibility, transparency, and robustness.Code review • Testing • Version Control • DocumentationWhy are these practices practically nonexistent in academia? One reason is that graduate schools have not kept up with the exponential growth in data. For example, even among the highest ranked neuroscience Ph.D. programs, coursework in quantitative methods and programming fundamentals is optional or outdated. Harvard’s quantitative methods boot camp for incoming graduate students is still focused purely on Matlab. Incidentally, closed-source code and expensive licenses are probably not the greatest recipe for reproducible science.The result is that students must discover the best practices in modern data science and development on their own. But labs have adopted the programming languages without the supporting machinery. Academic labs are not industrial software development teams, nor should they be. But the tools used in industry that ensure reproducibility, functionality, and transparency are highly relevant to academia and will foster more reproducible science.Core Principles to Improve Scientific Reproducibility1) Code ReviewConsider how closely a manuscript is scrutinized before publication compared to how little attention is paid to the code that produces the results. How many people, besides the original author, ever even look at the code?On development teams, code is routinely peer reviewed by multiple people. This helps ensure the intended functionality, identification of bugs, and adherence to standards.How do you it? Read the code line by line and ask yourself:Am I able to understand the code easily?Does the code do what it is supposed to do? Run it.Does the code follow standards? Python style guide, R style guideIs the code modular?The reviewer should be familiar with the programming language but may not have expertise in the exact physiological, statistical, or mathematical model. Therefore, it is a good idea to incorporate a bulletproof testing regimen.2) TestingThe first principle is that you must not fool yourself — and you are the easiest person to fool. — Richard FeynmanThere are many kinds of software testing, both manual and automatic, covering varying degrees of granularity..They generally fall into three categories, unit tests for individual functions or classes, integration tests for multiple components, and end-to-end tests for an entire project..The purpose of testing is to verify that a piece of software behaves as intended and to anticipate how it could break (like this cancer study that was sunk by a single line of code).In the context of academic data science, one of the most common concerns are spurious findings. Testing will help guard against spurious findings.For example, linear discriminant analysis is routinely used as a classifier in many research papers. However, the classification results can easily be misinterpreted if the assumptions of the model are not satisfied. One of the most critical assumptions is that the features are not collinear. Here is an example of a test that will halt execution and raise an Error if that assumption is violated.Testing enables much faster debugging. This fact is especially salient when relying on complex code bases dependent on rapidly evolving libraries, perhaps integrated with legacy code from another lab.Good testing will require thinking carefully about the data and the possible failure points. This makes writing tests a useful exercise by itself. Here is a more detailed guide to getting started with testing in Python.3) Version controlVersion control systems answer the question: “who did what, and when?” by recording and maintaining versions of the code that can be compared across time..It has been around for about 50 years, but we have somehow found a way to keep naming files version_final-edits_december_2018..If you are not familiar with version control, you are creating more work for your colleagues and your future self.Currently one of the most popular systems for version control is Git, used in conjunction with Github (a Git hosting service).New Developer? You should’ve learned Git yesterday.My number one piece of advice for new developers: Learn Git. codeburst.ioLearning git is an investment that immediately pays for itself. It will prevent losing your work, not remembering what worked yesterday, not having a functioning version of your code to share, and so on.Try not to let the many thousands of “Easy Git Guides” convince you that it is some trivial aspect of software workflow that everyone knows perfectly well. Even Linus Torvalds, who created Git, called it a “goddamn idiotic truckload of sh*t: when it breaks”. It takes time to use Git effectively, here are some resources to help.4) DocumentationYou may have discovered the missing key to strong AI, but no one can tell if it is impossible to understand your code..While there exist those who prefer to manually patch binary, it is likely more convenient to read and edit code that has text strings along the way telling you what that code does..Those helpful strings are called docstrings..Here is an example of a docstring for a function written in Python:example docstringsThe text between the triple quotes """ explains what this function does, the expected inputs and outputs along with their associated types..In Python, the convention for including docstrings is to put them as the first statement of every module, function, class, and method.You are not obligated to follow any rules when writing documentation, but using a standard, like this one for the NumPy library in Python, allows for automatic generation of useful reference guides..The rule of thumb is to use whatever the team you are joining is already accustomed to as the specifics matter less than adherence and consistency.Documentation is also not the only aspect of readability, but it is a very powerful starting point.To summarize, software that is reviewed, tested, version controlled, and documented, will improve the robustness and reproducibility of results.These principles should not curtail creative exploration but rather help remove unwanted stochasticity.If you are interested in further optimization, there are other useful practices, especially containerization, packaging, and continuous integration, that would increase the value of academic research projects..It is worth remembering that reproducibility is a spectrum, and that small changes can have large impacts.Please reach out if you would like to chat, you can find me online here.. More details

Leave a Reply