A shallow benchmark of R data compression methods

Protocol

There are several reasons why you would want to save R objects, such as data frames, to a file :

In each case, your tradeoff between execution time and resulting file size will be different.

Here we propose a quick (and not that clean) benchmark of some of the ways R can save data. This benchmark is based on three criteria :

First, we randomly generate a medium-size data frame (10 million lines) with different types of columns :

strings <- c("Nulla nulla purus, sagittis ornare bibendum vitae, imperdiet a ipsum. Nam non dolor viverra, faucibus sapien nec, maximus elit. Aliquam consequat odio urna, in tincidunt leo rhoncus eget. Pellentesque habitant morbi tristique senectus et netus.",
             "Aenean semper aliquet sapien. Vivamus tempor consequat aliquam. Proin consectetur tincidunt augue sit amet congue. Curabitur ac est a elit ultrices tempus. Curabitur molestie commodo malesuada. Suspendisse vulputate nibh et accumsan fringilla. Proin a urna.",
             "Curabitur maximus ullamcorper mi, ac condimentum erat luctus vitae. Maecenas laoreet venenatis erat. Vestibulum viverra ullamcorper diam ac sodales. Donec pulvinar, lectus non placerat pretium, elit lectus finibus nulla, molestie sagittis leo quam eget mauris. Suspendisse et nisl dictum, lacinia leo vitae, fermentum sapien. Donec.",
             "Phasellus vestibulum feugiat purus, sed dictum libero tincidunt quis. Suspendisse nec leo sollicitudin, efficitur mauris id, venenatis nibh. Morbi nulla.")

set.seed(1337)
n <- 10000000
d <- data.frame(
  num1 = rnorm(n),
  num2 = sample(1:10000, n, replace = TRUE),
  char = sample(c("Green", "Red", "Blue", "Yellow", "Black"), n, replace = TRUE),
  fac  = factor(sample(c("Green", "Red", "Blue"), n, replace = TRUE)),
  logical = sample(c(TRUE, FALSE), n, replace = TRUE),
  ltxt = sample(strings, n, replace = TRUE),
  stringsAsFactors = FALSE
)

The resulting object size is 343.3 MB (in terms of R memory usage). Not terribly large, but enough for the benchmark to make quite some time to run.

This object is saved to disk, and then read again into R. We then note the time used to compress and write the object, the resulting file size, and the time taken to read it back. To mitigate the variations due to other processes on my laptop, the operation has been run four times for each benchmark, and the median time has been taken. Furthermore, the benchmark has been run with as few other processes as possible, and no active network connection.

The following data formats have been tested :

Limitations

Of course, there are several limitations to this benchmark :

These limitations in mind, let’s show the results anyway.

Results

Here are the results for the CSV file format :

Format Method / Compression Write time (s) Size (MB) Read time (s)
CSV write.csv + read.csv 41 2885 105
CSV readr::write_csv + readr::read_csv 81 2748 22
CSV readr::write_csv + data.table::fread 92 2748 13

And the results for Rdata and RDS. Yes, I know, I shouldn’t be allowed to play with colors :

Format Method / Compression Compression level Write time (s) Size (MB) Read time (s)
Rdata gzip 1 20 168 13
Rdata gzip 2 21 164 13
Rdata gzip 3 22 161 14
Rdata gzip 4 29 134 13
Rdata gzip 5 31 131 13
Rdata gzip 6 37 126 13
Rdata gzip 7 48 125 13
Rdata gzip 8 84 122 13
Rdata gzip 9 128 121 14
Rdata bzip2 1 370 134 44
Rdata bzip2 2 423 118 48
Rdata bzip2 3 462 112 50
Rdata bzip2 4 481 109 50
Rdata bzip2 5 516 107 52
Rdata bzip2 6 529 106 51
Rdata bzip2 7 547 105 52
Rdata bzip2 8 577 105 52
Rdata bzip2 9 621 104 52
Rdata xz 1 61 120 20
Rdata xz 2 77 115 20
Rdata xz 3 105 114 20
Rdata xz 4 149 116 20
Rdata xz 5 196 115 21
Rdata xz 6 314 113 20
Rdata xz 7 315 112 21
Rdata xz 8 314 112 20
Rdata xz 9 326 112 20
RDS gzip 37 126 13
RDS bzip2 638 105 67
RDS xz 317 113 20

And here is a plot which tries (and fails, I must say) to summarize all this. Shameless self-promotion : it has been generated with the wonderfully fabulous scatterD3 package.

Conclusion

So, what can we say by looking at this too-colorful tables and not-very-readable plot ?

But the good news here is that the default values for these R functions seem perfectly suitable. If you usesave or saveRDS without any more parameters, it will use gzip compression with level 6, which seems a reasonable tradeoff between speed and size. If you use xz and bzip2, it will use the maximum compression level 9, which seems sound as if you use these compression methods it is likely you prefer file size over writing speed.