A shallow benchmark of R data frame export/import methods
There are several reasons why you would want to save a data frame to a file :
- sending it or putting it online for further work or public sharing
- temporarily store long computation results for further use
In each case, your tradeoff between execution time and resulting file size will be different. When used locally, time taken to write and read is crucial. When sharing the data frame, the file size is important.
Here we propose a quick (and not that clean) benchmark of some of the ways R can save data frames. This benchmark is based on three criteria : writing time, reading time and actual file size on disk.
The benchmark is produced from a randomly generated medium-size data frame with different types of columns : integers, real numbers, short and long strings, factors, logical and date. The resulting object size is 8 million lines and around 350 MB in terms of R memory usage. Not terribly large, but enough for the benchmark to make quite some time to run.
This object is saved to disk, and then read again into R. We then note the time used to compress and write the object, the resulting file size, and the time taken to read it back. To mitigate the variations due to other running processes, the operation is run seven times for each benchmark1, and the median time has been taken.
The following data formats have been tested :
loadwith the three available compression algorithms (
xz) and, for each algorithm, the nine possible compression levels)
readRDSwith the same three compression algorithms as above)
Obviously, there are several limitations to this benchmark :
- results only give a trend, the numbers will obviously be different on another computer. time is rounded to the second because the measurement is not precise enough to go deeper.
- the file size may not be sufficient to show compression gain above a certain compression level, or to show speed differences between some functions.
- with different data types (only numerical variables for example), compression results may be different.
- some of the read methods are lazy : data are not read all at once, but only when needed. In these cases, the read time is very low because of this behaviour. This benchmark is not suitable to compare lazy and non lazy methods.
|Format||Package||Functions||Compression||Write time (s)||Size (MB)||Read time (s)||Lazy ?|
||write.csv / read.csv|
||write_csv / read_csv|
||fwrite / fread|
||vroom_write / vroom altrep_opts = FALSE|
||vroom_write / vroom||Lazy|
||load / save||Uncompressed|
||load / save||bzip2|
||load / save||gzip|
||load / save||xz|
||saveRDS / readRDS||Uncompressed|
||saveRDS / readRDS||bzip2|
||saveRDS / readRDS||gzip|
||saveRDS / readRDS||xz|
||write_feather / read_feather|
||write_feather / read_feather|
||write_arrow / read_arrow|
||write_parquet / read_parquet|
||write_fst / fst||default||Lazy|
||write_fst / fst||Uncompressed||Lazy|
Here is a plot which tries (and fails, I think) to summarize the results. Values have been standardised to a 0-100 scale. Thanks to g2r you can click on legend items to hide/show them.
What can we say by looking at this too-colorful tables and not-very-readable plot ?
- CSV files are huge, much bigger than the actual object size in R memory (nothing surprising). But
data.tablefunctions are able to write and read them incredibly fast.
vroomis a bit slower than
data.table, but one of it’s main feature, which has been disabled for this benchmark, is to lazyly load data from the file, so the read time value may not make it justice2.
- There is no real difference between
RDSformats. Uncompressed versions are quite fast.
gzipcompression, which is the default, offers a good tradeoff between size and speed.
- Some of the new formats on the block, namely
arrow, are extremely fast, at the expense of a big file size, almost the same as
parquetgives really impressive results : it is one of the fastest formats, with the smallest file size… what kind of black magic is this ?
- The lazy methods from
fstseem really insteresting too, in particular with really big data files.
So, what format should you choose ?
- If you’re looking for a standard format for data sharing and do not fear big files, CSV with
vroomfunctions are very fast and readable everywhere.
- If you want to store temporary results for your own use, uncompressed
RDSis quite fast and robust, doesn’t need any additional package, and will manage any data format without conversion.
- If you don’t fear packages with specific install instructions (
arrowneeds additional libraries) nor using less stable functions, you should definitely take a look as the “new” formats, especially
fstis also a very good candidate, as it is really fast and provides lazy read methods which can be really interesting for big data frames.
Once again, your mileage may vary depending on your data and usage.