Detecting JSON/YAML/Perl/CSV/TSV/LTSV

By Steven Haryanto on April 29, 2014 11:21 PM

On my recent project, fsql, I want to detect (preferably cheaply) whether a (short) piece of text is JSON, YAML, Perl, CSV, TSV, or LTSV. The JSON/YAML/Perl would contain an AOA (array of arrays) or AOH (array of hashes).

File::LibMagic gives me "text/plain" all the time, so that's useless.

Here's currently the algorithm which I'm using. It's far from perfect, but perfection is not a goal here. Good enough (or even half as good as that) is okay, since most of the time, the file type will be inferred from the filename extension. And it's okay to be wrong as long as we can be right often enough.

1. If first line contains tab character, it's either a TSV or LTSV. Type is LTSV if first line contains colon (:).

2. If text contains => it's probably Perl (but this is useless for AOA).

3. If text contains a bare word element in a comma-separated list, e.g. foo, bar, it's probably YAML, since YAML allows that kind of thing.

4. If we have something like "key":value then it's probably JSON, since YAML requires a whitespace after the colon.

I'm not particularly happy with #2. Some other ideas which can be added:

* Comment (#) signifies YAML or Perl, since JSON and the other formats don't allow for comments.

* Dangling commas (e.g. [1, 2, 3, ]) are not allowed in JSON but allowed in YAML and Perl (BTW, I've recently read that this is even required in Go).

* YAML requires a whitespace after comma in a list (likewise for the above-mentioned whitespace after key name in a mapping).

BTW, since last announcement, fsql has become more usable and convenient. There is now an --add (-a) option to quickly add table files and autodetect their type. fsql now by default outputs result in the same format as the input tables, allowing you to filter CSV or other types more conveniently. And there is a --show-schema to let you see the schema before you write SQL queries.

10 comments

10 Comments

Paul "LeoNerd" Evans | April 30, 2014 3:39 AM | Reply

You may want to take a two-step approach of first rejecting the guaranteed negatives, then taking the remaining set and piecewise comparing, of the given remaining formats it might be, find a test that would suggest which it is more likely to be.

Toby Inkster | April 30, 2014 6:36 AM | Reply

Consider the following sample (assuming that ⇥ represents a literal tab character):

{"foo, bar, baz":⇥"=>"}

It's JSON, but it contains a tab on the first line, contains a colon, contains =>, and has something that looks like it might be a list of barewords.

If you want something vaguely reliable, the best way seems to be to start with a list of all these formats, and use your heuristics to re-order the list so the most likely candidates are in front. Then step though the list, attempting to parse the data as each format, in an eval { ... } block. The first format which doesn't throw an exception wins!

Ben Bullock | April 30, 2014 7:09 AM | Reply

Would "assert_valid_json" from JSON::Parse be any use? It validates JSON without creating Perl structures, so it's up to ten times faster than JSON::XS. For invalid JSON, it stops reading at the very first byte which isn't valid JSON, and returns the exact byte location as an error. Then the bad byte can be used to determine what other structure it might be. (The undocumented variable $JSON::Parse::json_diagnostics makes it specify the error byte in JSON.)

Steven Haryanto | April 30, 2014 9:39 PM | Reply

@Ben: JSON::Parse looks interesting, bookmarked it for later.

@Paul, @Toby: Thanks for the ideas, for now I'll wait for more fail cases to see if detection improvement is really needed.

Helmut Wollmersdorfer | April 30, 2014 10:04 PM | Reply

You can see it as a classification problem and try feature detection.

A feature can be a matching pattern, e.g. reserved words:

my $patterns = {
  'true'  => qr/ \b true \b /xms,
  'false' => qr/ \b false \b /xms,
  'null'  => qr/ \b null \b /xms,
  'undef' => qr/ \b undef \b /xms,
  'Yes'   => qr/ \b Yes \b /xms,
  'No'    => qr/ \b No \b /xms,
};

# grammars have features

my $grammars = {

  JSON => [ qw/ true false null / ],

  Perl => [ qw/ true false undef / ],

  YAML => [ qw/ Yes No / ],

};

my $features = {};

my $example = 'some string ....';

for my $pattern (keys %$patterns) {

  if ($example =~ m/$patterns->{$pattern}/) {

     $features->{$pattern}++;  

  }

}

use Set::Similarity::Overlap;

my $scores = {};

for my $grammar (keys %$grammars) {

  $scores->{$grammar} =

    'Set::Similarity::Overlap'->similarity(

      [keys %$features],

      $grammars->{$grammar},

    );

}

The above is very simple (matches inside quoted strings--false positive), it's only to show the principle.

The method has the advantage, that the patterns and grammars can be incomplete, redundant.

If the patterns do not match false positives the surviving grammars have an overlap-score of exactly 1.

Helmut Wollmersdorfer | May 1, 2014 3:47 AM | Reply

Just a test, if posting works.

Ben Bullock replied to comment from Steven Haryanto | May 1, 2014 8:22 AM | Reply

JSON::Parse looks interesting, bookmarked it for later.

I don't know of any other validators like this for Perl. If you want to throw the first part of the JSON at it and see if just the chunk is correct JSON, that can be done too by catching the "unexpected end of input" error. Please see this example.

https://me.yahoo.com/a/_4SYpzMbx8FkwKtRdxoaAb3yeyaQ#19673 | May 1, 2014 6:49 PM | Reply

I really like R's sqldf( https://code.google.com/p/sqldf/ ).
I tried fsql...
Hmm I'm still expecting perl's sqldf equivalent.

Steven Haryanto replied to comment from Helmut Wollmersdorfer | May 1, 2014 11:09 PM | Reply

CSV, TSV, and LTSV don't have a lot of "patterns" though.

Steven Haryanto replied to comment from https://me.yahoo.com/a/_4SYpzMbx8FkwKtRdxoaAb3yeyaQ#19673 | May 1, 2014 11:13 PM | Reply

From what I glance, DBD::AnyData and DBD::RAM might be closer to what you want, presumably a library to perform SQL queries against your data structure in memory. fsql is written to be a CLI tool and is strictly file-based.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Steven Haryanto

A programmer (mostly Perl 5 nowadays). My CPAN ID: SHARYANTO. I'm sedusedan on perlmonks. My twitter is stevenharyanto (but I don't tweet much). Follow me on github: sharyanto.

More info »

Of course I still use Perl