GAMS has three easy ways to populate a SQLite database:
- Using the tool gdx2sqlite. This tool populates a SQLite database with data from a GDX file. This means we first have to export GAMS data to a GDX file. As there is quite some file I/O going on here (writing GDX file, reading GDX file, writing database), I would expect this to be slower than the next method.
- The new GAMS-connect facility. This does not use intermediate files, and directly copies records from in-memory data. This should be the fastest.
- Old fashioned CSV files. We first export data as a GDX file, and then use gdxdump to convert the data to a CSV file. Then sqlite can import the CSV file, and populate the database. There is much file I/O here, so this should be slow.
---- 86 PARAMETER timings elapsed time (seconds)
gdx2sqlite 6.329, GAMSConnect 39.480, CSV 17.053
GDX2SQLITE v 0.7
GDX Library 41.4.0 caab8bc0 Dec 14, 2022 WEI x86 64bit/MS Window
Current directory:C:\Users\erwin\Downloads
InputFile:pdata.gdx
OutputFile:p1.sqlite
Symbols:1
Uels:50
Loading Uels
Processing Symbols
p(6250000 records)
Inserts:6250000
Elapsed:5.96 seconds
Inserts/second:1048130.14
Done
About a million inserts per second. Most other databases are way slower. That may also mean that these timing issues are likely not to be very pronounced when using other databases: the inserts may be much slower.
Some notes
Another smaller issue is that GAMS-connect generates slightly strange column names:
C:\Users\erwin\Downloads>sqlite3 p2.sqlite
SQLite version 3.8.52014-06-0414:06:34
Enter ".help" for usage hints.
sqlite> .schema p
CREATE TABLE p (
i1_0 TEXT,
i2_1 TEXT,
i3_2 TEXT,
i4_3 TEXT,
value FLOAT
);
sqlite>
As the names for the GAMS indices are already unique, there is no need to mangle them.
Finally, I am not sure how to tell GAMS-connect to use a database in the current working directory. I think we always need to specify a full path using URL notation (with a lot of / characters).
The CSV method always uses TEXT as column type:
C:\Users\erwin\Downloads>sqlite3 p3.sqlite
SQLite version 3.8.52014-06-0414:06:34
Enter ".help" for usage hints.
sqlite> .schema p
CREATE TABLE p(
"i1" TEXT,
"i2" TEXT,
"i3" TEXT,
"i4" TEXT,
"Val" TEXT
);
sqlite>
In SQLite that is not a problem: column types are merely hints. Double-precision numbers are stored as floats. SQLite can have a different type for each value. Putting it differently: types are associated with values instead of columns [2]. However, it may cause problems with other software. If we want to force the column type, we can do something like:
$onecho > import.sql CREATE TABLE p(i1 TEXT, i2 TEXT, i3 TEXT, i4 TEXT, val FLOAT); .import --csv --skip 1 p.csv p $offecho |
Conclusion
SQLite can be very fast. Achieving a million inserts per second is a phenomenal performance. Only gdx2sqlite seems to achieve this. I would have guessed that the new tool GAMS-connect would be a little bit faster.
GAMS-connect is quite new, so it is understandable that things are not fully optimized yet. Future releases may be much faster.
References
- SQLAlchemy, The Python SQL Toolkit and Object Relational Mapper, https://www.sqlalchemy.org/.
- Data types in SQLite, https://www.sqlite.org/datatype3.html
Appendix: GAMS code
$onText
Populate a SQLite database with a large parameter Number of records: 6.25m
$offText
*---------------------------------------------------- * clean up before we start *---------------------------------------------------- $call rm -f p1.sqlite $call rm -f p2.sqlite $call rm -f p3.sqlite $call rm -f p.csv
*---------------------------------------------------- * random data *---------------------------------------------------- Set i /index1*index50/; alias(i,i1,i2,i3,i4); Parameter p(i1,i2,i3,i4); p(i1,i2,i3,i4) = uniform(0,1);
*---------------------------------------------------- * timing *----------------------------------------------------
Parameter timings(*) 'elapsed time (seconds)'; Scalar t 'current time';
*---------------------------------------------------- * method1: gdx2sqlite via GDX file *----------------------------------------------------
t = timeElapsed;
execute_unload "pdata",p; execute "gdx2sqlite -i pdata.gdx -o p1.sqlite";
timings('gdx2sqlite') = timeElapsed - t;
*---------------------------------------------------- * method2: new GAMS Connect facility *----------------------------------------------------
t = timeElapsed;
embeddedCode Connect:
- GAMSReader: symbols: - name: p - PandasSQLWriter: connection: "sqlite:////users//erwin//downloads//p2.sqlite" symbols: - name: p tableName: p endEmbeddedCode
timings('GAMSConnect') = timeElapsed - t;
*---------------------------------------------------- * method3: CSV files *----------------------------------------------------
$onecho > import.sql .mode csv .import p.csv p $offecho
t = timeElapsed;
execute_unload "pdata",p; execute "gdxdump pdata.gdx symb=p format=csv output=p.csv"; execute "sqlite3 p3.sqlite < import.sql";
timings('CSV') = timeElapsed - t;
display timings; |