Detecting JSON/YAML/Perl/CSV/TSV/LTSV
On my recent project, fsql, I want to detect (preferably cheaply) whether a (short) piece of text is JSON, YAML, Perl, CSV, TSV, or LTSV. The JSON/YAML/Perl would contain an AOA (array of arrays) or AOH (array of hashes).
File::LibMagic gives me "text/plain" all the time, so that's useless.
Here's currently the algorithm which I'm using. It's far from perfect, but perfection is not a goal here. Good enough (or even half as good as that) is okay, since most of the time, the file type will be inferred from the filename extension. And it's okay to be wrong as long as we can be right often enough.
1. If first line contains tab character, it's either a TSV or LTSV. Type is LTSV if first line contains colon (:).
2. If text contains =>
it's probably Perl (but this is useless for AOA).
3. If text contains a bare word element in a comma-separated list, e.g. foo, bar
, it's probably YAML, since YAML allows that kind of thing.
4. If we have something like "key":value
then it's probably JSON, since YAML requires a whitespace after the colon.
I'm not particularly happy with #2. Some other ideas which can be added:
* Comment (#) signifies YAML or Perl, since JSON and the other formats don't allow for comments.
* Dangling commas (e.g. [1, 2, 3, ]
) are not allowed in JSON but allowed in YAML and Perl (BTW, I've recently read that this is even required in Go).
* YAML requires a whitespace after comma in a list (likewise for the above-mentioned whitespace after key name in a mapping).
BTW, since last announcement, fsql has become more usable and convenient. There is now an --add
(-a
) option to quickly add table files and autodetect their type. fsql now by default outputs result in the same format as the input tables, allowing you to filter CSV or other types more conveniently. And there is a --show-schema
to let you see the schema before you write SQL queries.
You may want to take a two-step approach of first rejecting the guaranteed negatives, then taking the remaining set and piecewise comparing, of the given remaining formats it might be, find a test that would suggest which it is more likely to be.
Consider the following sample (assuming that ⇥ represents a literal tab character):
It's JSON, but it contains a tab on the first line, contains a colon, contains
=>
, and has something that looks like it might be a list of barewords.If you want something vaguely reliable, the best way seems to be to start with a list of all these formats, and use your heuristics to re-order the list so the most likely candidates are in front. Then step though the list, attempting to parse the data as each format, in an
eval { ... }
block. The first format which doesn't throw an exception wins!Would "assert_valid_json" from JSON::Parse be any use? It validates JSON without creating Perl structures, so it's up to ten times faster than JSON::XS. For invalid JSON, it stops reading at the very first byte which isn't valid JSON, and returns the exact byte location as an error. Then the bad byte can be used to determine what other structure it might be. (The undocumented variable
$JSON::Parse::json_diagnostics
makes it specify the error byte in JSON.)@Ben: JSON::Parse looks interesting, bookmarked it for later.
@Paul, @Toby: Thanks for the ideas, for now I'll wait for more fail cases to see if detection improvement is really needed.
You can see it as a classification problem and try feature detection.
A feature can be a matching pattern, e.g. reserved words:
The above is very simple (matches inside quoted strings--false positive), it's only to show the principle.
The method has the advantage, that the patterns and grammars can be incomplete, redundant.
If the patterns do not match false positives the surviving grammars have an overlap-score of exactly 1.
Just a test, if posting works.
I don't know of any other validators like this for Perl. If you want to throw the first part of the JSON at it and see if just the chunk is correct JSON, that can be done too by catching the "unexpected end of input" error. Please see this example.
I really like R's sqldf( https://code.google.com/p/sqldf/ ).
I tried fsql...
Hmm I'm still expecting perl's sqldf equivalent.
CSV, TSV, and LTSV don't have a lot of "patterns" though.
From what I glance, DBD::AnyData and DBD::RAM might be closer to what you want, presumably a library to perform SQL queries against your data structure in memory. fsql is written to be a CLI tool and is strictly file-based.