There are several reasons why you would want to save R objects, such as data frames, to a file :
- sending them to somebody else, for further work or inspection
- putting them online for public sharing
- temporarily store long computation results for further use
In each case, your tradeoff between execution time and resulting file size will be different.
Here we propose a quick (and not that clean) benchmark of some of the ways R can save data. This benchmark is based on three criteria :
- writing time
- reading time
- actual file size on disk
First, we randomly generate a medium-size data frame (10 million lines) with different types of columns :
The resulting object size is 343.3 MB (in terms of R memory usage). Not terribly large, but enough for the benchmark to make quite some time to run.
This object is saved to disk, and then read again into R. We then note the time used to compress and write the object, the resulting file size, and the time taken to read it back. To mitigate the variations due to other processes on my laptop, the operation has been run four times for each benchmark, and the median time has been taken. Furthermore, the benchmark has been run with as few other processes as possible, and no active network connection.
The following data formats have been tested :
CSVfile, saved and read with
Rdatafile, saved and read with
loadwith the three available compression algorithms (
xz) and, for each algorithm, the nine possible compression levels.
RDSfile, saved and read with
readRDSwith the three compression algorithms above (there is no compression level option for RDS files).
Of course, there are several limitations to this benchmark :
- results only give a trend, the numbers will obviously be different on another computer
- the file size may not be sufficient to show compression gain above a certain compression level
- with different data types (only numerical variables for example), compression results may be different
These limitations in mind, let’s show the results anyway.
Here are the results for the
CSV file format :
|Format||Method / Compression||Write time (s)||Size (MB)||Read time (s)|
|CSV||write.csv + read.csv||41||2885||105|
|CSV||readr::write_csv + readr::read_csv||81||2748||22|
|CSV||readr::write_csv + data.table::fread||92||2748||13|
And the results for
RDS. Yes, I know, I shouldn’t be allowed to play with colors :
|Format||Method / Compression||Compression level||Write time (s)||Size (MB)||Read time (s)|
And here is a plot which tries (and fails, I must say) to summarize all this. Shameless self-promotion : it has been generated with the wonderfully fabulous scatterD3 package.
So, what can we say by looking at this too-colorful tables and not-very-readable plot ?
- First, CSV files are huge. They can be quite fast with
fread, but they stay huge, much bigger than the actual object size in R memory (nothing surprising).
- There is no difference between
RDSformats. The only difference is that you can’t specify a compression level for
RDS, so R uses the default ones (
xz), but for these levels the results are strictly equivalent.
- Reading times are not affected by compression levels.
gzipis the fastest,
xzis fast too,
xzare able to give smaller files at high compression levels, but they are slow, especially
gzipseems a good compromise between file size and read/write speed.
But the good news here is that the default values for these R functions seem perfectly suitable. If you use
saveRDS without any more parameters, it will use
gzip compression with level
6, which seems a reasonable tradeoff between speed and size. If you use
bzip2, it will use the maximum compression level
9, which seems sound as if you use these compression methods it is likely you prefer file size over writing speed.