Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 804

Compression and large tables (Excel and CSV)

$
0
0




Here is an interesting little experiment: load a large CSV file into Excel. My original Powerpoint slide was not as complete as it should be. Here is more size info:



I argue that a column store database can better compress data. Let's see if I can put some meat on the bone.
  • The data file is psd_alldata.csv which comes from [1]. The size of the file is 190 MB. It has 1.97 million records (rows).
  • The downloaded zipped version is psd_alldata_csv.zip. This file is 9.8 MB.
  • When we try to load this into an Excel sheet, only 1 million rows are loaded. This incomplete xlsx file (zipped XML) is 50 MB (Partial load to sheet.xlsx). My guess is that Excel does not zip at a high compression level when saving spreadsheets to save time. 
  • When we load this into the Excel datamodel (PowerQuery data), the database engine will highly compress things. When this is saved to an xlsx file, it is just 5.4 MB (psd_alldata.xlsx). 
So indeed, the in-memory database in Excel is actually compressing things much better compared to just using zipped XML.


References


  1. US Department of Agriculture, Foreign Agricultural Service, FAS Home / Market and Trade Data / PSD Online / Reports and Data / PSD Data Sets, https://apps.fas.usda.gov/psdonline/app/index.html#/app/downloads

Viewing all articles
Browse latest Browse all 804

Trending Articles