Detecting JSON/YAML/Perl/CSV/TSV/LTSV

On my recent project, fsql, I want to detect (preferably cheaply) whether a (short) piece of text is JSON, YAML, Perl, CSV, TSV, or LTSV. The JSON/YAML/Perl would contain an AOA (array of arrays) or AOH (array of hashes).

File::LibMagic gives me "text/plain" all the time, so that's useless.

Here's currently the algorithm which I'm using. It's far from perfect, but perfection is not a goal here. Good enough (or even half as good as that) is okay, since most of the time, the file type will be inferred from the filename extension. And it's okay to be wrong as long as we can be right often enough.

1. If first line contains tab character, it's either a TSV or LTSV. Type is LTSV if first line contains colon (:).

2. If text contains => it's probably Perl (but this is useless for AOA).

3. If text contains a bare word element in a comma-separated list, e.g. foo, bar, it's probably YAML, since YAML allows that kind of thing.

4. If we have something like "key":value then it's probably JSON, since YAML requires a whitespace after the colon.

I'm not particularly happy with #2. Some other ideas which can be added:

* Comment (#) signifies YAML or Perl, since JSON and the other formats don't allow for comments.

* Dangling commas (e.g. [1, 2, 3, ]) are not allowed in JSON but allowed in YAML and Perl (BTW, I've recently read that this is even required in Go).

* YAML requires a whitespace after comma in a list (likewise for the above-mentioned whitespace after key name in a mapping).

BTW, since last announcement, fsql has become more usable and convenient. There is now an --add (-a) option to quickly add table files and autodetect their type. fsql now by default outputs result in the same format as the input tables, allowing you to filter CSV or other types more conveniently. And there is a --show-schema to let you see the schema before you write SQL queries.

10 Comments

You may want to take a two-step approach of first rejecting the guaranteed negatives, then taking the remaining set and piecewise comparing, of the given remaining formats it might be, find a test that would suggest which it is more likely to be.

Consider the following sample (assuming that ⇥ represents a literal tab character):

{"foo, bar, baz":⇥"=>"}

It's JSON, but it contains a tab on the first line, contains a colon, contains =>, and has something that looks like it might be a list of barewords.

If you want something vaguely reliable, the best way seems to be to start with a list of all these formats, and use your heuristics to re-order the list so the most likely candidates are in front. Then step though the list, attempting to parse the data as each format, in an eval { ... } block. The first format which doesn't throw an exception wins!

Would "assert_valid_json" from JSON::Parse be any use? It validates JSON without creating Perl structures, so it's up to ten times faster than JSON::XS. For invalid JSON, it stops reading at the very first byte which isn't valid JSON, and returns the exact byte location as an error. Then the bad byte can be used to determine what other structure it might be. (The undocumented variable $JSON::Parse::json_diagnostics makes it specify the error byte in JSON.)

You can see it as a classification problem and try feature detection.

A feature can be a matching pattern, e.g. reserved words:

my $patterns = {
  'true'  => qr/ \b true \b /xms,
  'false' => qr/ \b false \b /xms,
  'null'  => qr/ \b null \b /xms,
  'undef' => qr/ \b undef \b /xms,
  'Yes'   => qr/ \b Yes \b /xms,
  'No'    => qr/ \b No \b /xms,
};

# grammars have features
my $grammars = {
JSON => [ qw/ true false null / ],
Perl => [ qw/ true false undef / ],
YAML => [ qw/ Yes No / ],
};

my $features = {};

my $example = 'some string ....';

for my $pattern (keys %$patterns) {
if ($example =~ m/$patterns->{$pattern}/) {
$features->{$pattern}++;
}
}

use Set::Similarity::Overlap;

my $scores = {};

for my $grammar (keys %$grammars) {
$scores->{$grammar} =
'Set::Similarity::Overlap'->similarity(
[keys %$features],
$grammars->{$grammar},
);
}

The above is very simple (matches inside quoted strings--false positive), it's only to show the principle.

The method has the advantage, that the patterns and grammars can be incomplete, redundant.

If the patterns do not match false positives the surviving grammars have an overlap-score of exactly 1.

Just a test, if posting works.

JSON::Parse looks interesting, bookmarked it for later.

I don't know of any other validators like this for Perl. If you want to throw the first part of the JSON at it and see if just the chunk is correct JSON, that can be done too by catching the "unexpected end of input" error. Please see this example.

I really like R's sqldf( https://code.google.com/p/sqldf/ ).
I tried fsql...
Hmm I'm still expecting perl's sqldf equivalent.

Leave a comment

About Steven Haryanto

user-pic A programmer (mostly Perl 5 nowadays). My CPAN ID: SHARYANTO. I'm sedusedan on perlmonks. My twitter is stevenharyanto (but I don't tweet much). Follow me on github: sharyanto.