Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 809

R: The RData File Format

$
0
0
R has a quite efficient way to store data in a file using its own RData file format. You can save objects to the file using save() and load data from a file using load(). The file format is largely undocumented, and as a result it is not much used as a way to exchange data with other software. In many cases CSV files are used for this. Here I make the argument to use a SQLite database for this purpose.

So what is this RData file format? It is a binary format not so easy to inspect, but there is an option to save a file in ASCII:

> ivec <- 1:3
>
str(ivec)
int [1:3] 1 2 3
>
save(ivec,file="ivec.ascii",ascii=T)

So how does this file look like? Here is an annotated listing:

RDA2        Header: file type
A          
Header: Ascii format
2          
Header: Format version 2
197123     
Header: R version information
131840     
Header: more R version information
1026       
LISTSXP object: whole thing is packaged in a dotted pair list
1          
SYMSXP object: symbol
262153     
CHARSXP object: string
4          
Length of string
ivec       
String: symbol name
13         
INTSXP: integer vector
3          
Length of integer vector
1          
First element
2          
Second element
3          
Third element
254        
NILVALUESXP: end of information

Using this information we could re-engineer writing R objects to an RData file. E.g. writing a string vector looks like:

image

(The tRDataBase name reflects this is a base class; we derive tRDataAscii, tRDataBinary and tRDataNetwork from this).

When we save objects without the “ascii=TRUE” flag, basically a compressed binary network format is used. The idea behind a network format is to write all binary data in a standardized big endian network byte ordering. This will allow a binary file written on one machine (e.g. with an Intel architecture) to be read on a different machine (actually there are not that many big-endian computer architectures left). This whole thing is then compressed using gzip.

Using an RDB2 header I can write a pure native binary format (that is without reordering bytes to a network byte ordering). It looks like R has decided not to support this format any more:

> load("test.bin")
Warning message:
file ‘test.bin’ has magic number 'RDB2'
Use of save versions prior to 2 is deprecated

So binary files always use the network byte ordering and have an RDX2 header.

Notes
  1. The load() function works perfectly fine with remote Rdata files:
    > load(url("http://www.amsterdamoptimization.com/downloads/rvec.rdata"),verbose=T)
    Loading objects:
    x
  2. The goal of this exercise is to be able to generate .Rdata data sets from other environments. We don’t use R itself for this but rather write .Rdata files directly. Another approach would be to launch R, import the data set (e.g. using a CSV file) and then call save() to generate the .Rdata file. When doing this from a different programming language, it is possible to automate this using the R.dll. This is in fact how this interface in F# works. In my setup I don’t need an R DLL and write .Rdata directly from the Delphi and C programming languages.
  3. It is time for RData files to become the standard for Data Transfer.

Viewing all articles
Browse latest Browse all 809

Trending Articles