R Docker faster

R Docker images will build much faster thanks to RStudio's package manager

A scenic Pacific Northwest mountain

If you’ve ever tried using R in a Docker container (and you really should!), you may have discovered how wonderful of a tool it is for putting R into production. With Docker, you can use a Dockerfile to specify how an image of a computer is built, and then you can easily run those images as containers in a repeatable fashion. Unfortunately, the process of building an image for using R can be really slow ' sometimes taking up to half an hour! The process has been slow because building an image requires the installation each R package required. The process of compiling R packages in linux can be really slow as it requires lots of compiling C++ on the backend. And you may have to rerun all of these package installation steps every time you make changes to what’s in the image. Over the course of developing a piece of production R code you may have had to build the image 50 times. So because of that having long build times can really be a drain.

Thankfully, that changed this week as RStudio announced a new public package manager. Their package manager does several things that are very helpful for putting R in production:

And the good news is that if you’re using Rocker images for R in Docker (which if you followed me and Heather’s blog posts from earlier, or our Keras pet names generator repo, or T-Mobile’s R TensorFlow API repo, you already are), then getting the cool new RStudio package manager is a breeze! By updating your Rocker base image to R version 4.0.0 or above you’ll be using RStudio’s package manager.

But does RStudio’s package manager truly make a difference in run time? I decided to find out. I ran a test where I made a docker image that only installed a single R package: dplyr. The package dplyr always takes foreeeeever to build. It both has a lot of R dependencies, and also has a lot of C++ to compile. This almost always is the longest package to install for any R Docker image I’ve personally worked with.

For the experiment I built two different Docker images, one without RStudio’s package manager and one with it, and time just the step of installing dplyr:

# Uses MRAN for packages (should be slow)
FROM rocker/r-ver:3.6.3
RUN install2.r --error dplyr


# Uses RStudio for packages (should be faster)
FROM rocker/r-ver:4.0.0
RUN install2.r --error dplyr

The results? The MRAN installation took 4 minutes and 33 seconds, while the RStudio package manager one took 1 minute and 29 seconds. That’s less than 1/3rd the time to build, and the only work required was to change 3 characters in the Dockerfile.

So hopefully with this new system in the world 30 minute R Docker builds will be a thing of my past, and something that will be passed down in data science lore as “the dark times.”