Pretending to have a SQL database with RStudio
All the fun of SQL with nothing to manage
Here is a data team riddle for you. Suppose you had an analyst on your team who loves working with SQL. Unfortunately, none of your data is in a SQL database, instead it is scattered across csv files, in AWS S3 buckets, whatever. The data isn’t particularly large, but the fact that it’s scattered and not in a SQL database is making it a real hassle for the analyst to do their job. And because of IT requirements, you can’t just go make a SQL database to put it all in. How can you, intrepid data team member, create a way for this analyst to easily write SQL queries on the data in a nice maintainable GUI without you having to set up and run a complex SQL database yourself? And without the analyst having to learn another programming language besides SQL?
One answer it turns out, is with RStudio.
While RStudio Desktop and RStudio Server are integrated development environments that were originally made to handle R, they can now interoperate with other languages surprisingly well, including Python and SQL. In this riddle of a situation where an analyst wants SQL, we could write a small R script that loads the data from whatever format it’s originally stored in and into an in-memory SQLite database using R. Then, once we have in-memory SQLite tables of data, the analyst can query the code using SQL commands to their delight. By wrapping it all in an RMarkdown file all of the commands can be run and the analysis output can be recorded and saved in an easy way.
First, it’s only a little bit of R code to create a SQL database in-memory in R, then load tables into it. All you need to do is use the DBI and RSQLite packages. The R code below does two things. First, it creates a connection con
to a new SQLite in-memory database with DBI and RSQLite. Second, it loads some tables from csv files into that database. Each read.csv
command reads a table from a csv file, then the dbWriteTable
command wrapped around places the table into the database. Below is an example R script that sets this up.
library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "customers", read.csv("customer_data.csv"))
dbWriteTable(con, "orders", read.csv("orders.csv"))
R is really flexible and can load data from tons of places! So if instead of a csv file you had an Excel file you could use readxl::readxlsx
or you could pull data from the cloud or do a million other things to get data into the table. The analyst could learn a few commands to load data without really having to learn R in full and instead mostly staying in the SQL world.
Now from the same R script, you can use SQL queries on those tables with the dbGetQuery
command:
dbGetQuery("SELECT * FROM customers LIMIT 10")
While it has that slightly pesky dbGetQuery()
wrapper, this is otherwise just a standard SQL query that the analyst should feel comfortable writing. So for each SQL query the analyst wants to write, they can use a dbGetQuery()
command.
But wait, it gets way better with RMarkdown!
So that’s pretty decent, but it becomes amazing when you put it in an RMarkdown file. An RMarkdown file lets you split your code into chunks and include formatted text between the code. The code chunks don’t have to be just R, they can also be SQL. Which means you can write an RMarkdown file like this:
## Here is the setup code!
```{r setup}
library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "customers", read.csv("customer_data.csv"))
dbWriteTable(con, "orders", read.csv("orders.csv"))
```
## Here is the sql code!
```{sql, connection=con}
SELECT * FROM customers LIMIT 10
```
## And here is different sql code!
```{sql, connection=con}
SELECT * FROM orders LIMIT 10
```
So now each of the analysts queries can be a nice SQL chunk in the RMarkdown file with descriptive text before and after. And the Rmarkdown file can be knitted into a nice HTML output that includes all the queries, the text, and the outputs:
This Rmd file and the corresponding HTML output is a great way for an analyst to do SQL work and save it (in the Rmd file) as well as save a record of what the actual output was (in the HTML file). It really beats the more traditional method of “save each output as a separate csv” or the tragic “copy and paste the results from SQL Server Management Studio into Excel” that are often used by analysts with SQL
The somewhat unusual case of a SQL using analyst who doesn’t have a SQL database is rare, but the case of “a person doesn’t quite have the data in a way they want it” is common. This example goes to show that R and RStudio are extremely good tools for bridging that gap between desires and reality with a toolset that data experts have built up to be incredibly flexible and customizable.