Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 809

R: Factors vs Strings on large data sets

$
0
0

When exporting data sets in R’s .Rdata format, one of the things to consider is how string vectors are exported. I can now write factors in my .Rdata writer, so we can do some experiments on exporting a string column just as strings or as a factor.

image

image

Here is an illustration how things are stored in an .Rdata file:

image

There is some overhead with each string: 8 bytes. This can add up. An integer vector takes much less space.

When we generate a large dataframe using gdx2r we see the following.

image

The “short” data sets are as follows. For the “short” data set:

set i /i1*i200/;
alias
(i,j,k);
parameter
p(i,j,k);
p(i,j,k) = uniform(0,1);

which exported to an .Rdata file (with StringsAsFactors=F) imported in R will look like:

> load("p.rdata")
>
head(p)
i j k value
1 i1 i1 i1 0.1717471
2 i1 i1 i2 0.8432667
3 i1 i1 i3 0.5503754
4 i1 i1 i4 0.3011379
5 i1 i1 i5 0.2922121
6 i1 i1 i6 0.2240529
>
str(p)
'data.frame': 8000000 obs. of 4 variables:
$ i : chr "i1""i1""i1""i1" ...
$ j : chr "i1""i1""i1""i1" ...
$ k : chr "i1""i2""i3""i4" ...
$ value: num 0.172 0.843 0.55 0.301 0.292 ...

The “large” data sets look like:

set i /amuchlongernamefortesting1*amuchlongernamefortesting200/;
alias
(i,j,k);
parameter
p(i,j,k);
p(i,j,k) = uniform(0,1);

Here we export to an .Rdata file with StringsAsFactors=T:

> load("p2.rdata")
>
head(p)
i j k value
1 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting1 0.1717471
2 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting2 0.8432667
3 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting3 0.5503754
4 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting4 0.3011379
5 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting5 0.2922121
6 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting6 0.2240529
>
str(p)
'data.frame': 8000000 obs. of 4 variables:
$ i : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ j : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ k : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 2 3 4 5 6 7 8 9 10 ...
$ value: num 0.172 0.843 0.55 0.301 0.292 ...

The timings are indeed what we expected:

  • If not compressed then the .Rdata files are much smaller when using factors. Longer strings make this effect more pronounced.
  • If compressed the .Rdata files are about the same size whether using strings or factors. But with factors we can do the compression faster (fewer bytes to compress).
  • Dataframes with StringsAsFactors=T use up less memory inside R. They also load faster.
  • Conclusion: making StringsAsFactors=T the default make sense (just as with R’s read.csv).

I updated the defaults:

image


Viewing all articles
Browse latest Browse all 809

Trending Articles