A shallow benchmark of R data frame export/import methods

Date de publication

30 août 2019

Protocol

There are several reasons why you would want to save a data frame to a file :

sending it or putting it online for further work or public sharing
temporarily store long computation results for further use

In each case, your tradeoff between execution time and resulting file size will be different. When used locally, time taken to write and read is crucial. When sharing the data frame, the file size is important.

Here we propose a quick (and not that clean) benchmark of some of the ways R can save data frames. This benchmark is based on three criteria : writing time, reading time and actual file size on disk.

The benchmark is produced from a randomly generated medium-size data frame with different types of columns : integers, real numbers, short and long strings, factors, logical and date. The resulting object size is 8 million lines and around 350 MB in terms of R memory usage. Not terribly large, but enough for the benchmark to make quite some time to run.

This object is saved to disk, and then read again into R. We then note the time used to compress and write the object, the resulting file size, and the time taken to read it back. To mitigate the variations due to other running processes, the operation is run seven times for each benchmark¹, and the median time has been taken.

The following data formats have been tested :

CSV (base, readr, data.table, vroom and arrow functions)
Rdata (using save and load with the three available compression algorithms (gzip, bzip2 and xz) and, for each algorithm, the nine possible compression levels)
RDS (using saveRDS and readRDS with the same three compression algorithms as above)
feather (feather and arrow functions)
arrow (arrow functions)
parquet (arrow functions)
fst (fst functions)

Limitations

Obviously, there are several limitations to this benchmark :

results only give a trend, the numbers will obviously be different on another computer. time is rounded to the second because the measurement is not precise enough to go deeper.
the file size may not be sufficient to show compression gain above a certain compression level, or to show speed differences between some functions.
with different data types (only numerical variables for example), compression results may be different.
some of the read methods are lazy : data are not read all at once, but only when needed. In these cases, the read time is very low because of this behaviour. This benchmark is not suitable to compare lazy and non lazy methods.

Results

Format	Package	Functions	Compression	Write time (s)	Size (MB)	Read time (s)	Lazy ?
CSV	`base`	write.csv / read.csv		75	2357	73
CSV	`readr`	write_csv / read_csv		77	2265	32
CSV	`data.table`	fwrite / fread		6	2238	6
CSV	`vroom`	vroom_write / vroom altrep_opts = FALSE		19	2253	9
CSV	`vroom`	vroom_write / vroom		18	2253	2	Lazy
CSV	`arrow`	read_csv_arrow		NA	NA	8
Rdata	`base`	load / save	Uncompressed	7	2245	8
Rdata	`base`	load / save	bzip2	612	87	85
Rdata	`base`	load / save	gzip	45	107	14
Rdata	`base`	load / save	xz	301	93	20
RDS	`base`	saveRDS / readRDS	Uncompressed	7	2245	8
RDS	`base`	saveRDS / readRDS	bzip2	580	87	70
RDS	`base`	saveRDS / readRDS	gzip	46	107	13
RDS	`base`	saveRDS / readRDS	xz	286	94	21
Feather	`feather`	write_feather / read_feather		5	2128	4
Feather	`arrow`	write_feather / read_feather		4	2105	5
Arrow	`arrow`	write_arrow / read_arrow		4	2105	5
Parquet	`arrow`	write_parquet / read_parquet		4	87	6
fst	`fst`	write_fst / fst	default	2	214	0	Lazy
fst	`fst`	write_fst / fst	Uncompressed	3	2137	0	Lazy

Format	Package	Functions	Compression	Compression level	Write time (s)	Size (MB)	Read time (s)	Lazy ?
CSV	`base`	write.csv / read.csv			75	2357	73
CSV	`readr`	write_csv / read_csv			77	2265	32
CSV	`data.table`	fwrite / fread			6	2238	6
CSV	`vroom`	vroom_write / vroom altrep_opts = FALSE			19	2253	9
CSV	`vroom`	vroom_write / vroom			18	2253	2	Lazy
CSV	`arrow`	read_csv_arrow			NA	NA	8
Rdata	`base`	load / save	Uncompressed		7	2245	8
Rdata	`base`	load / save	bzip2	1	363	111	51
Rdata	`base`	load / save	bzip2	2	414	98	61
Rdata	`base`	load / save	bzip2	3	467	94	67
Rdata	`base`	load / save	bzip2	4	487	91	66
Rdata	`base`	load / save	bzip2	5	511	90	72
Rdata	`base`	load / save	bzip2	6	545	89	76
Rdata	`base`	load / save	bzip2	7	484	88	63
Rdata	`base`	load / save	bzip2	8	504	88	63
Rdata	`base`	load / save	bzip2	9	537	87	63
Rdata	`base`	load / save	gzip	1	21	147	15
Rdata	`base`	load / save	gzip	2	21	142	14
Rdata	`base`	load / save	gzip	3	23	137	14
Rdata	`base`	load / save	gzip	4	34	116	14
Rdata	`base`	load / save	gzip	5	34	112	13
Rdata	`base`	load / save	gzip	6	43	107	13
Rdata	`base`	load / save	gzip	7	57	106	14
Rdata	`base`	load / save	gzip	8	109	103	14
Rdata	`base`	load / save	gzip	9	176	102	13
Rdata	`base`	load / save	xz	1	62	102	21
Rdata	`base`	load / save	xz	2	76	98	20
Rdata	`base`	load / save	xz	3	96	96	20
Rdata	`base`	load / save	xz	4	142	98	22
Rdata	`base`	load / save	xz	5	198	96	20
Rdata	`base`	load / save	xz	6	323	94	20
Rdata	`base`	load / save	xz	7	288	93	20
Rdata	`base`	load / save	xz	8	283	93	20
Rdata	`base`	load / save	xz	9	284	93	21
RDS	`base`	saveRDS / readRDS	Uncompressed		7	2245	8
RDS	`base`	saveRDS / readRDS	bzip2		580	87	70
RDS	`base`	saveRDS / readRDS	gzip		46	107	13
RDS	`base`	saveRDS / readRDS	xz		286	94	21
Feather	`feather`	write_feather / read_feather			5	2128	4
Feather	`arrow`	write_feather / read_feather			4	2105	5
Arrow	`arrow`	write_arrow / read_arrow			4	2105	5
Parquet	`arrow`	write_parquet / read_parquet			4	87	6
fst	`fst`	write_fst / fst	default		2	214	0	Lazy
fst	`fst`	write_fst / fst	Uncompressed		3	2137	0	Lazy

Graph

Here is a plot which tries (and fails, I think) to summarize the results. Values have been standardised to a 0-100 scale. Thanks to g2r you can click on legend items to hide/show them.

Conclusions

What can we say by looking at this too-colorful tables and not-very-readable plot ?

CSV files are huge, much bigger than the actual object size in R memory (nothing surprising). But data.table functions are able to write and read them incredibly fast. vroom is a bit slower than data.table, but one of it’s main feature, which has been disabled for this benchmark, is to lazyly load data from the file, so the read time value may not make it justice².
There is no real difference between Rdata and RDS formats. Uncompressed versions are quite fast. gzip compression, which is the default, offers a good tradeoff between size and speed.
Some of the new formats on the block, namely feather and arrow, are extremely fast, at the expense of a big file size, almost the same as CSV.
parquet gives really impressive results : it is one of the fastest formats, with the smallest file size… what kind of black magic is this ?
The lazy methods from vroom and fst seem really insteresting too, in particular with really big data files.

So, what format should you choose ?

If you’re looking for a standard format for data sharing and do not fear big files, CSV with data.table or vroom functions are very fast and readable everywhere.
If you want to store temporary results for your own use, uncompressed RDS is quite fast and robust, doesn’t need any additional package, and will manage any data format without conversion.
If you don’t fear packages with specific install instructions (arrow needs additional libraries) nor using less stable functions, you should definitely take a look as the “new” formats, especially parquet.
fst is also a very good candidate, as it is really fast and provides lazy read methods which can be really interesting for big data frames.

Once again, your mileage may vary depending on your data and usage.

Notes

Benchmarks with varying compression levels have only been run three times.↩︎
For detailed benchmarking, see the vroom benchmarks vignette↩︎