CSV readers mutilating my data

R and CSV files

When I deal with regional codes such as FIPS[1] and HUC[2], CSV file readers often mutilate my regions. Here is an example in R:

The leading zeros of the 5-digit FIPS codes are gone, as the CSV reader interpreted them as integers. This type conversion is applied regardless of whether the column is quoted. Obviously, these codes were never integers: writing integers doesn't lead to leading zeros.

We get a warning about a missing end-of-line for the last line. That is not so important. However, there is no warning about the more pressing issue: data is being mangled.

The warning message can be fixed by inserting a newline after the last line. Under Windows, that newline would be the two characters CR-LF (Carriage Return + Linefeed). This is R, so we can also eliminate the warning message by changing the first line! Predictability is just boring.

We can fix the most important problem -- don't mutilate my data -- using the colClasses option.

However, I am more interested in default behavior here. Why? I receive a lot of data sets (including derived data such as shape files) with damaged region codes, so this seems to be a more structural problem. There is good reason to believe this is caused by using CSV files somewhere along the workflow. This is certainly a big issue when using data from different sources, some of which have correct region codes and others don't.

Excel and CSV files

Excel is doing a bit better. It doesn't truncate data without notice but gives a proper warning:

When reading into Excel's data model, we need to remove the Changed Type step:

After Changed Type step

Dropping Changed Type step

Python

The Pandas CSV reader is ruthless.

The fix is to use the dtype option: pandas.read_csv(R"\tmp\csv\data2.csv",dtype=str).

The duckdb embedded database [3] is doing much better:

duckdb infers from the leading zeros that this is not an integer. Note that duckdb is also available under R.

Julia

No mercy for my data.

To fix this: CSV.read(raw"\tmp\csv\data2.csv",DataFrame,types=String).

SQLite

SQLite is handling leading zeros like duckdb.

Conclusion

The only CSV readers that correctly import my example CSV file out of the box are the database systems SQLite and duckdb.

From this simple example, we can conclude that computing is really still in the Stone Age. CSV readers are happily ignoring leading zeros.

References

FIPS County Code, https://en.wikipedia.org/wiki/FIPS_county_code
Hydrologic Units Maps, https://water.usgs.gov/GIS/huc.html
https://duckdb.org/

CSV readers mutilating my data

R and CSV files

Excel and CSV files

Python

Julia

SQLite

Conclusion

References

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...