There are a number of very fast CSV file readers available in R and Python. Lets have a quick test to see how they compare.
Generating CSV file
I generated a very simple, but large CSV file with 100 million records using a GAMS script as follows:
set |
The generated CSV file looks like:
D:\tmp\csv>head d.csv D:\tmp\csv>dir d.* Directory of D:\tmp\csv 12/08/2016 03:42 PM 3,656,869,678 d.csv D:\tmp\csv> |
We also see the CSV file is much larger than the intermediate (compressed) GAMS GDX file.
R read.csv
This is the default CSV reader in R.
> system.time(d<-read.csv("d.csv")) |
R read_csv
read_csv is from the readr package, and it is much faster for large CSV files:
|
Would it help to read a compressed CSV file?
> system.time(d<-read_csv("d2.csv.gz")) |
Bummer. I have no idea what went wrong here. May be we hit some size limit (note the CSV file is larger than 2 gb; other compression formats gave the same result).
Python pandas.read_csv
Quite fast:
t0=pc() |
158.2270488541103 |
The paratext library should be even faster.
References
- readr 1.0.0, https://blog.rstudio.org/2016/08/05/readr-1-0-0/
- Damian Eads, ParaText: CSV parsing at 2.5 GB per second, http://www.wise.io/tech/paratext