On my recent project, fsql, I want to detect (preferably cheaply) whether a (short) piece of text is JSON, YAML, Perl, CSV, TSV, or LTSV. The JSON/YAML/Perl would contain an AOA (array of arrays) or AOH (array of hashes).
File::LibMagic gives me "text/plain" all the time, so that's useless.
Here's currently the algorithm which I'm using. It's far from perfect, but perfection is not a goal here. Good enough (or even half as good as that) is okay, since most of the time, the file type will be inferred from the filename extension. And it's okay to be wrong as long as we can be right often enough.
1. If first line contains tab character, it's either a TSV or LTSV. Type is LTSV if first line contains colon (:).
2. If text contains
=> it's probably Perl (but this is useless for AOA).
3. If text contains a bare word element in a comma-separated list, e.g.
foo, bar, it's probably YAML, since YAML allows that kind of thing.
4. If we have something like
"key":value then it's probably JSON, since YAML requires a whitespace after the colon.
I'm not particularly happy with #2. Some other ideas which can be added:
* Comment (#) signifies YAML or Perl, since JSON and the other formats don't allow for comments.
* Dangling commas (e.g.
[1, 2, 3, ]) are not allowed in JSON but allowed in YAML and Perl (BTW, I've recently read that this is even required in Go).
* YAML requires a whitespace after comma in a list (likewise for the above-mentioned whitespace after key name in a mapping).
BTW, since last announcement, fsql has become more usable and convenient. There is now an
-a) option to quickly add table files and autodetect their type. fsql now by default outputs result in the same format as the input tables, allowing you to filter CSV or other types more conveniently. And there is a
--show-schema to let you see the schema before you write SQL queries.