If you’ve ever tried using R in a Docker container (and you really should!), you may have discovered how wonderful of a tool it is for putting R into production. With Docker, you can use a Dockerfile to specify how an image of a computer is built, and then you can easily run those images as containers in a repeatable fashion. Unfortunately, the process of building an image for using R can be really slow ' sometimes taking up to half an hour! The process has been slow because building an image requires the installation each R package required. The process of compiling R packages in linux can be really slow as it requires lots of compiling C++ on the backend. And you may have to rerun all of these package installation steps every time you make changes to what’s in the image. Over the course of developing a piece of production R code you may have had to build the image 50 times. So because of that having long build times can really be a drain.
Really excited that https://t.co/VV8S7GkE9H is now publicly available — among other things it provides Linux binaries for CRAN packages, which makes install so so much faster. Learn more at https://t.co/zId043OT7x #rstats— Hadley Wickham (@hadleywickham) July 3, 2020
Thankfully, that changed this week as RStudio announced a new public package manager. Their package manager does several things that are very helpful for putting R in production:
RStudio takes snapshots of all R packages on CRAN several times a week. This means that their packages will always be the same for a given snapshot, and you can trust that your Docker images will be deterministic. So you don’t have to worry that rebuilding a Docker image months in the future will have packages with different code in them. Previously this functionality was available from MRAN (Microsoft’s daily snapshots of CRAN), so now we have two companies taking snapshots of CRAN continuously. Nice.
Their package manager includes Linux binaries. This means that Docker containers that pull packages from RStudio’s package manager won’t have to do the laborious task of recompling all of the package code each time the image is rebuilt. This should absolutely solve the problem of Docker builds taking forever. Since compiling the binaries was the slow part.
And the good news is that if you’re using Rocker images for R in Docker (which if you followed me and Heather’s blog posts from earlier, or our Keras pet names generator repo, or T-Mobile’s R TensorFlow API repo, you already are), then getting the cool new RStudio package manager is a breeze! By updating your Rocker base image to R version 4.0.0 or above you’ll be using RStudio’s package manager.
But does RStudio’s package manager truly make a difference in run time? I decided to find out. I ran a test where I made a docker image that only installed a single R package: dplyr. The package dplyr always takes foreeeeever to build. It both has a lot of R dependencies, and also has a lot of C++ to compile. This almost always is the longest package to install for any R Docker image I’ve personally worked with.
For the experiment I built two different Docker images, one without RStudio’s package manager and one with it, and time just the step of installing dplyr:
# Uses MRAN for packages (should be slow) FROM rocker/r-ver:3.6.3 RUN install2.r --error dplyr
# Uses RStudio for packages (should be faster) FROM rocker/r-ver:4.0.0 RUN install2.r --error dplyr
The results? The MRAN installation took 4 minutes and 33 seconds, while the RStudio package manager one took 1 minute and 29 seconds. That’s less than 1/3rd the time to build, and the only work required was to change 3 characters in the Dockerfile.
So hopefully with this new system in the world 30 minute R Docker builds will be a thing of my past, and something that will be passed down in data science lore as “the dark times.”