A shallow benchmark of R data frame export/import methods

Date de publication

30 août 2019

Protocol

There are several reasons why you would want to save a data frame to a file :

  • sending it or putting it online for further work or public sharing
  • temporarily store long computation results for further use

In each case, your tradeoff between execution time and resulting file size will be different. When used locally, time taken to write and read is crucial. When sharing the data frame, the file size is important.

Here we propose a quick (and not that clean) benchmark of some of the ways R can save data frames. This benchmark is based on three criteria : writing time, reading time and actual file size on disk.

The benchmark is produced from a randomly generated medium-size data frame with different types of columns : integers, real numbers, short and long strings, factors, logical and date. The resulting object size is 8 million lines and around 350 MB in terms of R memory usage. Not terribly large, but enough for the benchmark to make quite some time to run.

This object is saved to disk, and then read again into R. We then note the time used to compress and write the object, the resulting file size, and the time taken to read it back. To mitigate the variations due to other running processes, the operation is run seven times for each benchmark1, and the median time has been taken.

The following data formats have been tested :

  • CSV (base, readr, data.table, vroom and arrow functions)
  • Rdata (using save and load with the three available compression algorithms (gzip, bzip2 and xz) and, for each algorithm, the nine possible compression levels)
  • RDS (using saveRDS and readRDS with the same three compression algorithms as above)
  • feather (feather and arrow functions)
  • arrow (arrow functions)
  • parquet (arrow functions)
  • fst (fst functions)

Limitations

Obviously, there are several limitations to this benchmark :

  • results only give a trend, the numbers will obviously be different on another computer. time is rounded to the second because the measurement is not precise enough to go deeper.
  • the file size may not be sufficient to show compression gain above a certain compression level, or to show speed differences between some functions.
  • with different data types (only numerical variables for example), compression results may be different.
  • some of the read methods are lazy : data are not read all at once, but only when needed. In these cases, the read time is very low because of this behaviour. This benchmark is not suitable to compare lazy and non lazy methods.

Results

Format Package Functions Compression Write time (s) Size (MB) Read time (s) Lazy ?
CSV base write.csv / read.csv 75 2357 73
CSV readr write_csv / read_csv 77 2265 32
CSV data.table fwrite / fread 6 2238 6
CSV vroom vroom_write / vroom altrep_opts = FALSE 19 2253 9
CSV vroom vroom_write / vroom 18 2253 2 Lazy
CSV arrow read_csv_arrow NA NA 8
Rdata base load / save Uncompressed 7 2245 8
Rdata base load / save bzip2 612 87 85
Rdata base load / save gzip 45 107 14
Rdata base load / save xz 301 93 20
RDS base saveRDS / readRDS Uncompressed 7 2245 8
RDS base saveRDS / readRDS bzip2 580 87 70
RDS base saveRDS / readRDS gzip 46 107 13
RDS base saveRDS / readRDS xz 286 94 21
Feather feather write_feather / read_feather 5 2128 4
Feather arrow write_feather / read_feather 4 2105 5
Arrow arrow write_arrow / read_arrow 4 2105 5
Parquet arrow write_parquet / read_parquet 4 87 6
fst fst write_fst / fst default 2 214 0 Lazy
fst fst write_fst / fst Uncompressed 3 2137 0 Lazy

Graph

Here is a plot which tries (and fails, I think) to summarize the results. Values have been standardised to a 0-100 scale. Thanks to g2r you can click on legend items to hide/show them.

Conclusions

What can we say by looking at this too-colorful tables and not-very-readable plot ?

  • CSV files are huge, much bigger than the actual object size in R memory (nothing surprising). But data.table functions are able to write and read them incredibly fast. vroom is a bit slower than data.table, but one of it’s main feature, which has been disabled for this benchmark, is to lazyly load data from the file, so the read time value may not make it justice2.
  • There is no real difference between Rdata and RDS formats. Uncompressed versions are quite fast. gzip compression, which is the default, offers a good tradeoff between size and speed.
  • Some of the new formats on the block, namely feather and arrow, are extremely fast, at the expense of a big file size, almost the same as CSV.
  • parquet gives really impressive results : it is one of the fastest formats, with the smallest file size… what kind of black magic is this ?
  • The lazy methods from vroom and fst seem really insteresting too, in particular with really big data files.

So, what format should you choose ?

  • If you’re looking for a standard format for data sharing and do not fear big files, CSV with data.table or vroom functions are very fast and readable everywhere.
  • If you want to store temporary results for your own use, uncompressed RDS is quite fast and robust, doesn’t need any additional package, and will manage any data format without conversion.
  • If you don’t fear packages with specific install instructions (arrow needs additional libraries) nor using less stable functions, you should definitely take a look as the “new” formats, especially parquet.
  • fst is also a very good candidate, as it is really fast and provides lazy read methods which can be really interesting for big data frames.

Once again, your mileage may vary depending on your data and usage.

Notes

  1. Benchmarks with varying compression levels have only been run three times.↩︎

  2. For detailed benchmarking, see the vroom benchmarks vignette↩︎