Introducing Sah, another data validation framework

This blog post introduces Sah, my data validation framework (or data validation language and validator generator, to be more exact). The very first work on Sah began almost 4 years ago as Data::Schema. The name change to Sah and the first release of Data::Sah happened in late 2011.

To validate data, first you write a Sah schema. Sah schemas are also data structures and are very much similar to JSON schemas, except that they are more featureful and (at least to to me) more convenient to write.

The design and implementation principle are mostly laziness and DRY:


  • I want concise syntax for common things; I don't want to have to type a lot especially if they are mostly the same things;
  • I want to use it for everything, from validating function arguments to web requests; I don't want to have to use two different validation languages (e.g. one for function arguments, another for class attributes, another for JSON data);
  • I want to generate JavaScript client-side validation code from the schema; I dont' want to have to write JavaScript code manually;
  • If someday there needs to be server-side validation performed in languages other than Perl, I want to generate that code too instead of having to write them manually;
  • I want good default error message, most of the time I don't want to have to write error messages manually;
  • I want translation of error messages;

You can easily see the above pattern of "I don't want to have to do XXX". It's my favorite mantra nowadays.

Some examples of Sah schemas (in Perl notation):


["int", {min => 1}]   # an optional integer, greater than or equal to 1
["int", {min => 1, max => 10}]   # an optional integer between 1 and 10
["int", min => 1, max => 10]   # shortcut notation for the above
["int", req=>1, min => 1, max => 10]   # a required integer between 1 and 10
["int*", min => 1, max => 10]   # * after type name is shortcut notation for req=>1
["int*"]   # a required integer, with no extra clauses
"int*"   # shortcut notation for the above
["array*", of => ["int*" => between => [0,10]]]  # a slightly more complex example

Sah is available on CPAN in the form of two distributions (and certainly more in the future). Sah contains the specification for the language/schema. Data::Sah is the compiler. There will be plugins/extensions in the future.

Speed

Instead of evaluating your schema against input data directly, Data::Sah compiles your schema into some Perl code. This Perl code is the one that will be used to validate data. Compiled validator has the advantage of being one or two orders of magnitude faster than interpreted one. A casual benchmarking done about three months ago demonstrates this.

Since I want to validate function arguments, speeds in the order of thousands/sec (like that of Data::Domain or Data::Verifier or even Data::FormValidator) are inadequate. Functions sometimes need to be called much more often than that. Data::Sah is faster than all other validation modules that I tested, except Params::Validate (PV). PV is very barebones though; even for some simple testing you need to supply a callback routine. This degrades performance rather quickly.

An example of Perl code generated by Data::Sah, for schema [int => min => 1, max=>100, div_by => 3]:

$ perl -MData::Sah -E'$sah = Data::Sah->new; $c = $sah->get_compiler("perl"); $res = $c->expr_validator_sub(schema => [int => min=>1, max=>100, div_by=>3]); say $res'
do {
    require Scalar::Util;
    sub {
        my($data) = @_;
        my $_sahv_res = 
        
            # skip if undef
            (!defined($data) ? 1 : 
            
            (# check type 'int'
            (Scalar::Util::looks_like_number($data) =~ /^(?:1|2|9|10|4352)$/)
            
            &&
            
            (# clause: div_by
            ($data % 3 == 0))
            
            &&
            
            (# clause: min
            ($data >= 1))
            
            &&
            
            (# clause: max
            ($data <= 100))));
        
        return($_sahv_res);
    }}

JavaScript output

Compilation can also produce other targets, like JavaScript. A simple demo will be provided in subsequent blog posts, as I will be writing a form rendering/processing library soon.

The previous example, outputing JS instead of Perl:

$ perl -MData::Sah -E'$sah = Data::Sah->new; $c = $sah->get_compiler("js"); $res = $c->expr_validator_sub(schema => [int => min=>1, max=>100, div_by=>3]); say $res'
function(data) {
    var tmp_data = [];
    var _sahv_res = 
    
        // skip if undef
        (!!(data === undefined || data === null) ? true : 
        
        (// check type 'int'
        (typeof(data)=='number' && Math.round(data)==data || parseInt(data)==data)
        
        &&
        
        (tmp_data[0] = typeof(data)=='number' ? data : parseFloat(data), true)
        
        &&
        
        (// clause: div_by
        (tmp_data[0] % 3 == 0))
        
        &&
        
        (// clause: min
        (tmp_data[0] >= 1))
        
        &&
        
        (// clause: max
        (tmp_data[0] <= 100))
        
        &&
        
        // clause: max
        ((tmp_data).pop(), true)));
    
    return(_sahv_res);
}

(Translatable) error messages

Producing error message (and human language specification) from the schema is another compilation process.


$ perl -MData::Sah -E'$sah = Data::Sah->new; $c = $sah->get_compiler("human"); $res = $c->compile(schema => [int => min=>1, max=>100, div_by=>3]); say $res->{result}'
integer, must be divisible by 3, must be at least 1, must be at most 100

The Perl/JS compiler utilitizes the human compiler to produce its error message.


$ perl -MData::Sah -E'$sah = Data::Sah->new; $v = $sah->gen_validator([int => min=>1, max=>100, div_by=>3], {return_type=>"str"}); say "$_ => ", $v->($_) for ("x", 2, -3, 3)'
x => Input is not of type integer
2 => Must be divisible by 3
-3 => Must be at least 1
3 => 

Setting the output language can be done easily using the environment LANG or lang argument.

$ LANG=id_ID perl -MData::Sah -E'$sah = Data::Sah->new; $v = $sah->gen_validator([int => min=>1, max=>100, div_by=>3], {return_type=>"str"}); say "$_ => ", $v->($_) for ("x", 2, -3, 3)'
x => Masukan tidak bertipe bilangan bulat
2 => Harus dapat dibagi oleh 3
-3 => Harus minimal 1
3 => 

A custom error message can also be specified in the schema to override the default message.

Summary

The Data::Sah module is still early in development. There are lots of clauses not yet implemented and some functionalities are still missing: schema can contain expressions, functions, and other definitions. But I'm already using it in the Perinci framework. I write schemas for function arguments once, in the function metadata, then the schemas can be used to validate function arguments as well as generate usage information.

4 Comments

Hi

You have 'int*' as a shorcut for ['int', req => 1].

I use ' rather than " because the latter is optically dense.

I would argue that's a mistake, since you use of '*' clashes with '*' used in regexps.

Just make ['int'] mean ['int', reguired => 1]. I.e. make required the default.

And yes, don't abbreviate. That style of programming went out (or should have) 40 years ago :-).

Cheers
Ron

Fwiw Perl 6 uses a trailing ! for required parameters, ? for optional ones: http://feather.perl6.nl/syn/S06.html#Required_parameters

Leave a comment

About Steven Haryanto

user-pic A programmer (mostly Perl 5 nowadays). My CPAN ID: SHARYANTO. I'm sedusedan on perlmonks. My twitter is stevenharyanto (but I don't tweet much). Follow me on github: sharyanto.