Format | Package | Functions | Compression | Write time (s) | Size (MB) | Read time (s) | Lazy ? |
---|---|---|---|---|---|---|---|
CSV |
base
|
write.csv / read.csv | 75 | 2357 | 73 | ||
CSV |
readr
|
write_csv / read_csv | 77 | 2265 | 32 | ||
CSV |
data.table
|
fwrite / fread | 6 | 2238 | 6 | ||
CSV |
vroom
|
vroom_write / vroom altrep_opts = FALSE | 19 | 2253 | 9 | ||
CSV |
vroom
|
vroom_write / vroom | 18 | 2253 | 2 | Lazy | |
CSV |
arrow
|
read_csv_arrow | NA | NA | 8 | ||
Rdata |
base
|
load / save | Uncompressed | 7 | 2245 | 8 | |
Rdata |
base
|
load / save | bzip2 | 612 | 87 | 85 | |
Rdata |
base
|
load / save | gzip | 45 | 107 | 14 | |
Rdata |
base
|
load / save | xz | 301 | 93 | 20 | |
RDS |
base
|
saveRDS / readRDS | Uncompressed | 7 | 2245 | 8 | |
RDS |
base
|
saveRDS / readRDS | bzip2 | 580 | 87 | 70 | |
RDS |
base
|
saveRDS / readRDS | gzip | 46 | 107 | 13 | |
RDS |
base
|
saveRDS / readRDS | xz | 286 | 94 | 21 | |
Feather |
feather
|
write_feather / read_feather | 5 | 2128 | 4 | ||
Feather |
arrow
|
write_feather / read_feather | 4 | 2105 | 5 | ||
Arrow |
arrow
|
write_arrow / read_arrow | 4 | 2105 | 5 | ||
Parquet |
arrow
|
write_parquet / read_parquet | 4 | 87 | 6 | ||
fst |
fst
|
write_fst / fst | default | 2 | 214 | 0 | Lazy |
fst |
fst
|
write_fst / fst | Uncompressed | 3 | 2137 | 0 | Lazy |
A shallow benchmark of R data frame export/import methods
Protocol
There are several reasons why you would want to save a data frame to a file :
- sending it or putting it online for further work or public sharing
- temporarily store long computation results for further use
In each case, your tradeoff between execution time and resulting file size will be different. When used locally, time taken to write and read is crucial. When sharing the data frame, the file size is important.
Here we propose a quick (and not that clean) benchmark of some of the ways R can save data frames. This benchmark is based on three criteria : writing time, reading time and actual file size on disk.
The benchmark is produced from a randomly generated medium-size data frame with different types of columns : integers, real numbers, short and long strings, factors, logical and date. The resulting object size is 8 million lines and around 350 MB in terms of R memory usage. Not terribly large, but enough for the benchmark to make quite some time to run.
This object is saved to disk, and then read again into R. We then note the time used to compress and write the object, the resulting file size, and the time taken to read it back. To mitigate the variations due to other running processes, the operation is run seven times for each benchmark1, and the median time has been taken.
The following data formats have been tested :
CSV
(base
,readr
,data.table
,vroom
andarrow
functions)Rdata
(usingsave
andload
with the three available compression algorithms (gzip
,bzip2
andxz
) and, for each algorithm, the nine possible compression levels)RDS
(usingsaveRDS
andreadRDS
with the same three compression algorithms as above)feather
(feather
andarrow
functions)arrow
(arrow
functions)parquet
(arrow
functions)fst
(fst
functions)
Limitations
Obviously, there are several limitations to this benchmark :
- results only give a trend, the numbers will obviously be different on another computer. time is rounded to the second because the measurement is not precise enough to go deeper.
- the file size may not be sufficient to show compression gain above a certain compression level, or to show speed differences between some functions.
- with different data types (only numerical variables for example), compression results may be different.
- some of the read methods are lazy : data are not read all at once, but only when needed. In these cases, the read time is very low because of this behaviour. This benchmark is not suitable to compare lazy and non lazy methods.
Results
Graph
Here is a plot which tries (and fails, I think) to summarize the results. Values have been standardised to a 0-100 scale. Thanks to g2r you can click on legend items to hide/show them.
Conclusions
What can we say by looking at this too-colorful tables and not-very-readable plot ?
- CSV files are huge, much bigger than the actual object size in R memory (nothing surprising). But
data.table
functions are able to write and read them incredibly fast.vroom
is a bit slower thandata.table
, but one of it’s main feature, which has been disabled for this benchmark, is to lazyly load data from the file, so the read time value may not make it justice2. - There is no real difference between
Rdata
andRDS
formats. Uncompressed versions are quite fast.gzip
compression, which is the default, offers a good tradeoff between size and speed. - Some of the new formats on the block, namely
feather
andarrow
, are extremely fast, at the expense of a big file size, almost the same asCSV
. parquet
gives really impressive results : it is one of the fastest formats, with the smallest file size… what kind of black magic is this ?- The lazy methods from
vroom
andfst
seem really insteresting too, in particular with really big data files.
So, what format should you choose ?
- If you’re looking for a standard format for data sharing and do not fear big files, CSV with
data.table
orvroom
functions are very fast and readable everywhere. - If you want to store temporary results for your own use, uncompressed
RDS
is quite fast and robust, doesn’t need any additional package, and will manage any data format without conversion. - If you don’t fear packages with specific install instructions (
arrow
needs additional libraries) nor using less stable functions, you should definitely take a look as the “new” formats, especiallyparquet
. fst
is also a very good candidate, as it is really fast and provides lazy read methods which can be really interesting for big data frames.
Once again, your mileage may vary depending on your data and usage.
Notes
Benchmarks with varying compression levels have only been run three times.↩︎
For detailed benchmarking, see the vroom benchmarks vignette↩︎