Garrett answered this question in his excellent talk on R Markdown, explaining how reproducible work is essential for good data science and good business.
I’ll echo David Smith and agree that in this case a slide is worth a thousand words:Evidence for p-hacking and the need for reproducing published resultsOther talks that built on this theme were Gabor’s talk on package installations, Jim Hester’s talk on dependencies, Mike Smith’s talk on rmarkdown, and of course, Karthik’s talk on reproducible research.
(There may have been others, I can’t wait to go watch them all!).
My talk on RStudio Package Manager built on similar themes.
One thing you may notice about this list: talking about reproducibility often leads to talking about R packages.
Why?Lionel explaining why R packages can be broken even if they installed successfullyConsider the case of the poor bloke outlined in this rlang issue.
(Warning: The following story is based on real events).
They created an awesome analysis in R .
Time passed, decisions were made, and now someone asked if they could double check and re-run the analysis again.
They open up their beautifully version controlled code, click “run”, and watch in horror as dplyr returns a terrifying red error message.
Code that had worked no longer worked.
The same code that had worked no longer worked.
Having a package management strategy is key to avoiding this type of pain.
It isn’t the sexiest part of our job, but I think all of us as data scientists have a responsibility to think about package management.
Need help getting the discussion started?.Try installing the countdeps package and running the function countdeps() from within your project.
Tweet @rstudio with the results and tag #gottacountemall.
Invitation to join the #gottacountemall challengeNeed an even more controversial conversation starter?.Try on this: “Docker doesn’t solve our dependency problem”.
I bet you’ll get some interesting feedback.
Why Docker can fail to reproduce R environments/begin sidebar Docker is a powerful tool for specifying what a compute environment should look like.
However, a Dockerfile that contains the line install.
packages is subject to give different results depending on when the Dockerfile is compiled into an image.
Unless you are careful about keeping old images around, you could be headed for an unpleasant surprise.
Instead, consider Docker to be a tool that can complement a strategy.
For instance, you may replace your install.
packages command with a packrat restore, or a reference to a frozen repository.
/rantTo summarize, I am incredibly excited for R in 2019.
I predict we’ll see R running in production, data science teams growing and embracing code, and a flourishing of an already brilliant community.
Let’s be sure we do our diligence and reproduce all the great parts of work.
Let’s talk about reproducibility.
Let’s talk about package management.