Reading CSV files in R: read.csv vs read

There are a number of very fast CSV file readers available in R and Python. Lets have a quick test to see how they compare.

Generating CSV file

I generated a very simple, but large CSV file with 100 million records using a GAMS script as follows:

set
i /a1*a100/
j /b1*b100/
k /c1*c100/
l /d1*d100/
;

parameter d(i,j,k,l);
d(i,j,k,l) = uniform(0,1);

$setenv gdxcompress 1
execute_unload"d.gdx",d;
execute"gdxdump d.gdx output=d.csv symb=d delim=comma format=csv";

The generated CSV file looks like:

D:\tmp\csv>head d.csv
"i","j","k","l","Val"
"a1","b1","c1","d1",0.171747132
"a1","b1","c1","d2",0.843266708
"a1","b1","c1","d3",0.550375356
"a1","b1","c1","d4",0.301137904
"a1","b1","c1","d5",0.292212117
"a1","b1","c1","d6",0.224052867
"a1","b1","c1","d7",0.349830504
"a1","b1","c1","d8",0.856270347
"a1","b1","c1","d9",0.067113723

D:\tmp\csv>dir d.*
Volume in drive D is My Passport
Volume Serial Number is 74B7-6DCC

Directory of D:\tmp\csv

12/08/2016 03:42 PM     3,656,869,678 d.csv
12/08/2016 03:30 PM       806,199,476 d.gdx
               2 File(s) 4,463,069,154 bytes
               0 Dir(s) 1,099,214,491,648 bytes free

D:\tmp\csv>

We also see the CSV file is much larger than the intermediate (compressed) GAMS GDX file.

R read.csv

This is the default CSV reader in R.

> system.time(d<-read.csv("d.csv"))
   user  system elapsed 
1361.61   50.56 1434.39

R read_csv

read_csv is from the readr package, and it is much faster for large CSV files:

> system.time(d<-read_csv("d.csv"))
Parsed with column specification:
cols(
  i = col_character(),
  j = col_character(),
  k = col_character(),
  l = col_character(),
  Val = col_double()
)
|================================================================================| 100% 3487 MB
   user  system elapsed 
 186.23    5.66  196.20

Would it help to read a compressed CSV file?

> system.time(d<-read_csv("d2.csv.gz"))
Error in .Call("readr_read_connection_", PACKAGE = "readr", con, chunk_size) : 
  negative length vectors are not allowed
Timing stopped at: 57.53 4.43 62.29

Bummer. I have no idea what went wrong here. May be we hit some size limit (note the CSV file is larger than 2 gb; other compression formats gave the same result).

Python pandas.read_csv

Quite fast:

t0=pc()
df=pd.read_csv("d.csv")
print(pc()-t0)

158.2270488541103

The paratext library should be even faster.

References

readr 1.0.0, https://blog.rstudio.org/2016/08/05/readr-1-0-0/
Damian Eads, ParaText: CSV parsing at 2.5 GB per second, http://www.wise.io/tech/paratext

Reading CSV files in R: read.csv vs read_csv

Generating CSV file

R read.csv

R read_csv

Python pandas.read_csv

References

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...