March 2016 Archives

YAML patches done

I've finished most of the needed YAML work, esp. for YAML::XS.

libyaml upstream has now a patch with the new options NonStrict and IndentlessMap https://github.com/yaml/libyaml/pull/8

YAML::XS in https://github.com/ingydotnet/yaml-libyaml-pm/pull/43

YAML::XS writes now proper YAML which can be read with YAML.pm, and passes the CPAN::Meta validation tests. See https://github.com/Perl-Toolchain-Gang/CPAN-Meta/pull/107

For CPAN::Meta I've added validation tests for all existing YAML loaders, so you can see what's going on, and which version is conforming or fails. YAML::XS, YAML::Tiny, CPAN::Meta::YAML, YAML::Syck do pass now, YAML fails.

I have also patches for Parse::CPAN::META and CPAN-Meta-YAML to use the new versions, but only in cperl, no PR yet.

I've also started merging libsyck from upstream into YAML::Syck, which came up with some horrible private extensions, and made them mergable to upstream. But this work is still ongoing at https://github.com/rurban/syck/commits/0.71 and https://github.com/rurban/YAML-Syck/commits/merge-upstream.

TODO (optional)

  • YAML.pm would need a patch to read old YAML::XS with $IndentlessMap=1.
  • YAML::Syck
  • syck

More on YAML, syck looks much better

In my last YAML post I said libsyck is not maintained anymore. I had a look, and this is wrong. Even if _why does not work on it anymore, (he came back btw recently), it is maintained and made some progress in libsyck, which is not reflected in the YAML::Syck perl part.

It is a mess, I admit, but easier fixable than the YAML::XS mess. So I took libsyck upstream, which is at 0.70, and merged it with our changes which are at 0.61. Our perl-specific changes are a complete mess, so I cleaned that up to be acceptable upstream into a new 0.71.

merge back various changes from upstream (my own WIP version 0.71)

  • const char*, recompiled grammar, ...
  • alloc +1 for the final \0
  • add proper type casts

    sanify various unmergable hacks into proper flags, which can be set perl-specific:

  • add emitter->nocomplexkey flag, default=0, 1 for perl.

  • rename syckemit2quoted1 to syckemit1quotedesc
  • and rename scalar2quote1 to scalar1quoteesc (JSON singlequote as single-quoted with dq-like escapes)

    remove some other unmergable hacks:

  • syck_base64enc requires an ending \n

YAML::Syck has many advantaged over YAML::XS. It does support reading and writing to file streams, which means it does not need to slurp each file into a buffer and process that. It can process streamable buffers. libyaml can do that also, but YAML::XS never implemented that. I only added a LoadFile method, but not DumpFile.

YAML::XS doesn't really use the nice architecture libyaml provides, it rather does it's own perl-specific callbacks, bypassing many advantages of libyaml.

libsyck is much better written than libyaml, no question about that. It has much less bugs, much more options to handle, but it got stuck at YAML 1.1 Anybody really needs YAML 1.2? I haven't checked the changes yet.

My changes (still WIP) are at:

So now I'm pondering to convince everybody to ditch YAML and YAML::XS completely in favor of YAML::Syck. Let's see how this will turn out... In fact it's only a tiny patch to CPAN, and I can do that by my own, since CPAN is in core.

My core integration for YAML::XS is at:

What I need now a is good YAML testsuite which merges the validators required by core (CPAN::Meta) and various interop testing as I did with Cpanel::JSON::XS, esp. roundtrips, add the perl module back to syck to give it into sane hands (this migth be tricky as it involves testing with ruby, php, python, ...), do benchmarks and going over the tickets.

What I know is that YAML.pm processing over my cpan prefs is ~10x slower than with YAML::Syck. The current performance is unacceptable, and YAML::XS emitting unindented seq elements for a map child ditto. Maybe I have to fork YAML::XS to a Cpanel::YAML::XS, but most of the fixes need to be done in libyaml itself, and let's see how fixing syck turns out.

On YAML and YAML::XS inconsistencies

Personally I'd prefer YAML over JSON for local config data anytime. Even if JSON is secure by default, and YAML is insecure. YAML is readable and writable. It's better than .ini, .json and .xml.

But Houston we have a problem. For a long time. I'll fix it.

We have the unique advantage that the spec author and maintainer is from the perl world, Ingy, and maintains the two standard libraries YAML, the PP (pure perl) variant, and YAML::XS, the fast XS variant, based on LibYAML.

This would be an advantage if those two libraries would agree on their interpretation and implementation of the specs. They do not.

Historically the YAML library is used as the default reader for CPAN .yml preferences and a fork of YAML::Tiny, CPAN::Meta::YAML which is in core is used to read and write the package META.yml files.

The basic idea is to use the fastest library available and use a PP fallback for systems which don't have the fast variant. perl5 core does not ship a proper fast library for JSON and YAML, so you have to stick to the 10x slower PP variants. cperl will ship with YAML::XS and Cpanel::JSON::XS in core, so there this problem is gone.

But we still have the YAML problem:

YAML, the default reader for CPAN, refuses to read .yml files produced by YAML::XS. You can only set YAML::Syck as yaml_module in ~/.cpan/CPAN/MyConfig.pm, using YAML::XS will get you into trouble. But YAML::Syck is not maintained anymore. It was written by _why the lucky stiff, also the author of potion, the VM for p2. It still kinda works, and it behaves better than YAML::XS, but it would be better to replace libsyck by libyaml afterall, and get the YAML maintainers to fix their mess.

The fault is in the YAML::XS (i.e. LibYAML) dumper and in the YAML loader.

YAML::XS writes .yml files, which YAML cannot read. YAML supports scalars, arrays (called sequences) and hashes (called map). The current problem is the interpretation of the Spec in the current version 1.2, 6.1 Indentation Spaces.

YAML::XS writes the elements of sequence without indent, and YAML all other YAML libraries expect an indent.

I.e. YAML::XS writes for {author => ['perl5-porters@perl.org']}

author:
- perl5-porters@perl.org

while all other libraries and the spec insist on at least a space before the -, the seq sibling.

author:
    - perl5-porters@perl.org

"Each node must be indented further than its parent node. All sibling nodes must use the exact same indentation level. However the content of each sibling node may be further indented independently." http://yaml.org/spec/1.2/spec.html#id2777534

But in the meantime all other YAML loaders came to accept Ingy's interpretation on the seq indentation level, and do accept the missing seq indent. Just YAML not. YAML throws a MAP error. This is certainly a YAML loader bug.

Remember that YAML is the default reader in the CPAN config, all it needs to do is to load the yaml. Which is broken.

All this is known for a long time, Szabo wrote about inconsistencies, p5p put a variant of the better YAML::Tiny into core as CPAN::Meta::YAML. This is fine, but in the long run a fast library in core is preferred, and that's what I'm doing for cperl.

So what needs to be done:

  1. Change yaml_module in ~/.cpan/CPAN/MyConfig.pm to either CPAN::Meta::YAML, YAML::Syck or YAML::XS. All these can read those YAML files. YAML can not, until it's fixed.

  2. Fix YAML::XS to dump seq elements with intentation, as all the others YAML libraries, and as the specs says. I'm working on that.

  3. Fix YAML to accept seq elements with intentation to be able to read old YAML::XS files. I'm working on that.

  4. Fix YAML::XS to accept spec-violating elements in a new NonStrict mode, because the other libraries write those elements, and a YAML loader should be optionally non-fatal on illegal control chars, illegal utf-8 characters and such. All other YAML loaders silently replace illegal elements with undef. I'm working on that in https://github.com/ingydotnet/yaml-libyaml-pm/issues/44

Ingy insists that all other libraries are broken, they produce wrong YAML. Which would be acceptable if the libraries and the spec at least would be consistent. They are not. And historically all successful YAML readers are non-fatal.

cpanel_json_xs has now the options yaml, yaml-xs, yaml-tiny, and yaml-syck to use those libraries for readine and writing from the command line. This way you can easily prove the various inconsistencies.

cpanel_json_xs -f yaml -t yaml-xs <META.yml >XSMETA.yml
cpanel_json_xs -f yaml -t yaml    <XSMETA.yml

YAML Error: Invalid element in map
  Code: YAML_LOAD_ERR_BAD_MAP_ELEMENT

And you can try all other variants, which do work mostly.

For YAML::XS the following needs to be done: With NonStrict allow character errors (control, unicode), throw a warning, replace by the partial read or undef, and continue parsing. This way you loose data, but NonStrict is optional and a fallback for local configuration files, which are better read partially than not at all. We cannot loose everything on roundtrips.

About Reini Urban

user-pic Working at cPanel on cperl, B::C (the perl-compiler), parrot, B::Generate, cygwin perl and more guts, keeping the system alive.