February 5, 2023

NewsRoomUG

Technology Room

Why isn’t there an honest file format for tabular information?

3 min read

Tabular information is in all places. I help studying and writing tabular information in numerous codecs in all 3 of my software program utility. It is a vital a part of my data transformation software. However all of the tabular information codecs suck. There doesn’t appear to be something that’s moderately house environment friendly, easy and fast to parse and textual content based mostly (not binary) so you’ll be able to view and edit it with an ordinary editor.

Most tabular information presently will get exchanged as: CSV, Tab separated, XML, JSON or Excel. And they’re all extremely sub-optimal for the job.

CSV is a large number. One quote within the unsuitable place and the file is invalid. It’s troublesome to parse effectively utilizing a number of cores, because of the quoting (you’ll be able to’t begin parsing from half approach by a file). Completely different quoting schemes are in use. You don’t know what encoding it’s in. Use of separators and line endings are inconsistent (generally comma, generally semicolon). Writing a parser to deal with all of the completely different dialects is by no means trivial. Microsoft Excel and Apple Numbers don’t even agree on methods to interpret some edge circumstances for CSV.

Tab separated is a bit higher than CSV. However can’t retailer tabs and nonetheless has points with line endings, encodings and so forth.

XML and JSON are tree buildings and never appropriate for effectively storing tabular information (plus different points).

There may be Parquet. It is vitally environment friendly with it’s columnar storage and compression. However it’s binary, so can’t be considered or edited with customary instruments, which is a ache.

Don’t even get me began on Excel’s proprietary, ghastly binary format.

Why can’t we have now a format the place:

  • Encoding is all the time UTF-8
  • Values saved in row main order (row 1, row2 and so forth)
  • Columns are separated by u001F (ASCII unit separator)
  • Rows are separated by u001E (ASCII file separator)
  • Er, that’s all the specification.

No escaping. If you wish to put u001F or u001E in your information – robust you’ll be able to’t. Use a unique format.

It will be moderately compact, environment friendly to parse and simple to manually edit (Notepad++ exhibits the unit separator as a ‘US’ image). You could possibly write a quick parser for it in minutes. Typing u001F or u001E in some editors is likely to be a faff, however it’s hardly a showstopper.

It might be known as one thing like “unicode separated worth” (hat tip to @fakeunicode on Twitter for the title) or “unit separated worth” with file extension .usv. Perhaps a unique extension might used when values are saved in column main order (column1, column 2 and so forth).

Is there nothing like this already? Perhaps there may be and I simply haven’t heard of it. If not, shouldn’t there be?

And sure I’m conscious of the related XKCD cartoon ( https://xkcd.com/927/ ).

** Edit 4-Could-2022 **

“Javascript” -> “JSON” in para 5.

It has been pointed on the above provides you with a single line of textual content in an editor, which isn’t nice for human readability. A fast repair for this might be to make the file delimiter a u001E character adopted by an LF character. Any LF that comes instantly after an u001E can be ignored when parsing. Any LF not instantly after an u001E is a part of the information. I don’t find out about different editors, however it’s simple to view and edit in Notepad++.

Copyright © All rights reserved. | Newsphere by AF themes.