YAML::PP Grant Report August/September 2017
For the last two months I have been working over 100 hours on YAML::PP, a project I started in January.
It aims to parse YAML 1.2.
I got the confirmation for the TPF grant application at the end of August, but I already started to log my work in August, so I'll include this work here.
Some of the things listed here aren't yet in the latest release 0.004, so you might want to check out the code on github.
First I'd like to give a short introduction why there is a need for a new YAML framework.
Introduction
For details you can always have a look at the YAML Test Matrix at matrix.yaml.io.
YAML.pm
Ingy's YAML.pm was the first implementation, and it was written for YAML 1.0. Although it was never adapted to 1.1 or 1.2, it will parse a lot of simple YAML files.
Because it also has a lot of configuration options, it's used very often.
You can sort keys when dumping, you can set the indentation, and since version 1.20 you can also preserve the original sorting. Since version 1.23, numbers are loaded as "real" numbers instead of strings.
But you only have a limited set of YAML that can be parsed with it. If you are used to write YAML 1.1/1.2, you might get errors and have to adjust it until YAML.pm can parse it.
Trailing comments and multiline quoted strings are impossible, and zero-indented sequences like in this example:
---
key1:
- a
- b
- c
key2: string
It would be too much for this article to list everything, but I'd like to write a document at some point that lists the most important incompatibilities.
YAML::XS
Ingy's YAML::XS combines Kirill Simonov's libyaml with XS code to load YAML into a data structure.
The C libyaml library parses YAML and provides a list of events. The XS code takes these events, resolves aliases and tags and creates a data structure out of it. (For dumping, libyaml takes a list of events and creates a YAML stream.)
libyaml is based on YAML 1.1, and since there aren't that many parsers out there using YAML 1.2, and 1.2 is actually not that different from 1.1, you will be able to parse most common YAML documents with it. PyYAML (Python) and lubyk/yaml (Lua) are based on it.
YAML::XS didn't implement the standard tags like !!str
, !!map
and
!!seq
, so it will die if your input contains those.
YAML::XS doesn't have as many options as YAML.pm, and since the data constructor is written in C/XS, it's harder to implement them than in YAML.pm.
YAML::Syck
YAML::Syck uses libsyck, which is based on YAML 1.0. If in doubt, you should use YAML::XS instead. There are also some cases where libsyck coredumps.
YAML::Tiny
YAML::Tiny is, you can guess by its name, supposed to parse only a subset of YAML, so it can't do aliases and tags, for example. For simple YAML, it works well and fast.
Common problems
YAML is designed to be able to serialize arbitrary objects. YAML.pm, YAML::XS
and YAML::Syck will load nodes that have an appropriate tag into classes.
YAML.pm and YAML::XS provide no way to disable that, so you should never
load untrusted YAML. YAML::Syck lets you deactivate this with
$YAML::Syck::LoadBlessed = 0
.
PyYAML, for example, has a SafeLoad
function for that.
Because Perl has no booleans, all these modules will load YAML booleans into 0 or 1. That means, a plain "true" in YAML will result in a 1 in perl, which is sufficient for many use cases.
If you dump the loaded data again, only YAML::XS will be able to preserve the booleans by using some special perl internal, but you can't modify them, and I don't know a way to insert new booleans.
That means you usually cannot create real booleans in YAML output. Also, if you have a schema, for example JSON Schema, that you want to validate your data against, it won't recognize booleans. Jan Henning Thorsens's JSON::Validator does some extra logic to account for that.
JSON::XS/JSON::PP/Mojo::JSON are using JSON::PP::Boolean objects for that.
Enter YAML::PP
My first goal was to see if I'm able to write a parser that can parse YAML 1.2. I've written two parsers so far, HTML::Template::Compiled and Parse::BBCode. While these weren't exactly trivial to write, it was comparably easy. YAML introduces a couple of new problems. Indentation based parsing is very different from open/close tag parsing, and YAML has a few other rules that makes it more complicated to parse.
While making progress on that, my next goal was to provide an API like libyaml does. YAML::PP::Parser provides events that the constructor takes and creates data. With that, the new constructor should be also able to load other parser backends. So at some point I want it to be able to use libyaml as a backend.
I also wanted to have boolean support and a possibility to syntax highlight.
Work done in August and September
YAML::PP::Emitter
I added YAML::PP::Emitter to YAML::PP.
It takes events from the Dumper. It's already able to emit all kinds of structures. One exception are block scalars, because they have some special rules. The format (layout) is mostly like libyaml/YAML::XS::Dump.
YAML::PP::Dumper
I added YAML::PP::Dumper which creates events for the Emitter.
It can dump arrays, hashes, strings and booleans (via
JSON::PP::Boolean). Dumping of strings is very simple so far. Anything
not matching a-zA-Z0-9.-
will be doublequoted.
It also dumps references as aliases and can dump cyclic structures.
Many little bugfixes
I fixed many little bugs in the parser that would be too much to list them here in detail.
Unicode
Input can contain literal unicode characters also now.
Refactoring
Because I actually learned YAML while implementing, I did lots of refactorings.
Fun fact:
% git log | grep -i refactor | wc -l
34
I still need to refactor more to be able to implement some features.
For example, flow style has some special cases:
---
[a, b, c]: value
This is a block mapping which has a flow sequence as a key. Another example is:
---
{ [a, b, c]: value }
Taking the first example, while parsing I first get a sequence. Once I see the colon, I know, that this is actually a mapping key, so before firing the sequence events, I have to insert a mapping event. This means I have to save events somewhere and can only fire them when I know if it is actually a mapping key or not.
This is quite theoretical for perl, usually, since perl can have only strings as hash keys. Still, I have to implement it, and it doesn't make sense to implement flow style without taking this into account.
YAML::PP::Lexer
Before parsing, the YAML::PP::Lexer turns the input into tokens like ANCHOR, ALIAS, SCALAR, DOUBLEQUOTED etc. This makes parsing a bit easier.
YAML::PP::Highlight
With the new Lexer, I am able to create syntax highlighted YAML. The distribution contains a little script that outputs ANSI colored YAML. YAML::PP::Highlight can also create HTML.
All test cases as HTML
At YAML-PP-p5/test-suite.html you can see all test cases from YAML Test Suite, highlighted with YAML::PP::Highlight, and run through the Loader and Dumper.
YAML::PP::Reader / Load Files
You can also Load files now additionally to strings. While strings must be utf-8 decoded, files will be opened with utf-8 decoding automatically.
This is not on CPAN yet.
Line Numbers
It keeps now track of line numbers and will show them in error messages. They might be incorrect in some cases though.
yaml-test-suite
I added 30 tests to yaml-test-suite and fixed or added data to existing tests.
TPC in Amsterdam
I did a 45 minute talk on "The State of the YAML" in August at TPCiA. It contained a lot of theory on the edge cases. Feedback showed that it was probably a bit too much theory. If you're interested in some weird YAML examples, you can find the slides here.
I'm thinking of doing a more practical talk at the London Perl Workshop this year.
YAML::XS
I applied a pull request from vti++ which fixes a bug in YAML::XS::Dump. The Dumper was modifying original data, converting numbers to strings.
I added a test for this.
Future
Like explained above, flow style needs to be implemented. I have some ideas in my head on how to do it.
A long term goal is to be able to round-trip a complete YAML stream including comments. The only framework that I'm aware of that can do this is Ruamel (python). I haven't tested it yet.
Tags and loading of objects need to be implemented. I have some ideas on that. I don't just want a choice to load or not load objects. This should be highly configurable. Also the user should be able to choose if they want to load the Failsafe, the JSON or the Core Schema.
Thanks
Ingy made some useful suggestions about the
API when we were at the Toolchain Summit in Lyon this year. So I split the
Loader into Loader and Constructor. And the parser doesn't read from a string,
but gets a Reader object and simply calls readline
on it. This way you can
easily add a file reader or any reader you wish, for example if you want to read
from a Socket.
Felix "flyx" Krause, author of the quite complete NimYAML parser, tirelessly answers questions about the specification on IRC.
Thanks also to all who told me in person or on the TPF grant application site, that they endorse this work and think I can do it.
Help!
You can help in various ways.
If you have YAML data that you think should be valid, but YAML::PP can't parse it (or the other way around), please create an issue or send it to me (of course, you have to take into account the features not implemented yet).
You can simply do:
% yamlpp5-highlight < file.yaml
# ANSI colored YAML
% yamlpp5-load < file.yaml
# Data::Dumper output
% yamlpp5-load-dump < file.yaml
# Load and Dump back into YAML
% yamlpp5-events < file.yaml
# Show parsing events in yaml-test-suite format
I'm happy to receive suggestions and comments on the API.
If you want to know more or join development, join #yaml on irc.perl.org, or for general YAML questions, on irc.freenode.net.
Great work, thanks for your report. I hope your module will be able to parse the 1.2 specification. Could you estimate a date for that?
Of course that depends on how much time I get to work on this in the future.
If like expected, I would say, before Christmas =)