R: Factors vs Strings on large data sets

When exporting data sets in R’s .Rdata format, one of the things to consider is how string vectors are exported. I can now write factors in my .Rdata writer, so we can do some experiments on exporting a string column just as strings or as a factor.

Here is an illustration how things are stored in an .Rdata file:

There is some overhead with each string: 8 bytes. This can add up. An integer vector takes much less space.

When we generate a large dataframe using gdx2r we see the following.

The “short” data sets are as follows. For the “short” data set:

set i /i1*i200/;
alias (i,j,k);
parameter p(i,j,k);
p(i,j,k) = uniform(0,1);

which exported to an .Rdata file (with StringsAsFactors=F) imported in R will look like:

> load("p.rdata")
> head(p)
   i  j  k     value
1 i1 i1 i1 0.1717471
2 i1 i1 i2 0.8432667
3 i1 i1 i3 0.5503754
4 i1 i1 i4 0.3011379
5 i1 i1 i5 0.2922121
6 i1 i1 i6 0.2240529
> str(p)
'data.frame':	8000000 obs. of  4 variables:
 $ i    : chr  "i1""i1""i1""i1" ...
 $ j    : chr  "i1""i1""i1""i1" ...
 $ k    : chr  "i1""i2""i3""i4" ...
 $ value: num  0.172 0.843 0.55 0.301 0.292 ...

The “large” data sets look like:

set i /amuchlongernamefortesting1*amuchlongernamefortesting200/;
alias (i,j,k);
parameter p(i,j,k);
p(i,j,k) = uniform(0,1);

Here we export to an .Rdata file with StringsAsFactors=T:

> load("p2.rdata")
> head(p)
                           i                          j                          k     value
1 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting1 0.1717471
2 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting2 0.8432667
3 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting3 0.5503754
4 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting4 0.3011379
5 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting5 0.2922121
6 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting6 0.2240529
> str(p)
'data.frame':	8000000 obs. of  4 variables:
 $ i    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ j    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ k    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ value: num  0.172 0.843 0.55 0.301 0.292 ...

The timings are indeed what we expected:

If not compressed then the .Rdata files are much smaller when using factors. Longer strings make this effect more pronounced.
If compressed the .Rdata files are about the same size whether using strings or factors. But with factors we can do the compression faster (fewer bytes to compress).
Dataframes with StringsAsFactors=T use up less memory inside R. They also load faster.
Conclusion: making StringsAsFactors=T the default make sense (just as with R’s read.csv).

I updated the defaults:

R: Factors vs Strings on large data sets

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...