Pure-Perl XML

In the past I sometimes used XML::Tiny and I found it perfect for the job. Agreed, I had to struggle only with very little and under-control XML, so I knew I could do without a full-fledged XML Parser.

On the flip side, this rating by Aristotle has always bugged me. I respect Aristotle's opinion a lot, hence this was sufficient for me to look for alternatives... just in case XML::Tiny failed me (which didn't happen so far, anyway).

I've been quite disappointed by XML::Parser::Lite as suggested though. Here's a little example script condensing my findings:

#!/usr/bin/env wrapperl
use strict;
use warnings;

use XML::Parser::Lite;
use XML::Parser;
use XML::Tiny ();

my $xml = <<'END';
<?xml version="1.0"?>
<what>
   <ever>&lt;<![CDATA[&foo <=> &bar]]>&gt;</ever>
</what>
END

$" = '], [';
my ($what_ever, $collect);
my %handler_for = (
   Init => sub { $collect = $what_ever = ''; },
   Start => sub { $collect = ($_[1] eq 'ever'); },
   Char => sub { $what_ever .= $_[1] if $collect; },
   End => sub { $collect = '' },
);

print "perl $]\n";
print "XML----------------\n$xml-------------------\n";

for my $class (qw< XML::Parser XML::Parser::Lite >) {
   my $version = do {
      no strict 'refs';
      ${$class . '::VERSION'};
   };
   print "$class $version\n";
   my $parser = $class->new();
   $parser->setHandlers(%handler_for);
   $parser->parse($xml);
   print "   what/ever => [$what_ever]\n";
} ## end for my $class (qw< XML::Parser XML::Parser::Lite >)

open my $fh, '<', \$xml or die "$!";
my $doc = XML::Tiny::parsefile($fh);
print "XML::Tiny $XML::Tiny::VERSION\n";
print "   what/ever => [$doc->[0]{content}[0]{content}[0]{content}]\n";

(If you're wondering about what's that thing in the hash-bang, you can read about it here)

I threw XML::Parser in just to have a control group. Let's run it:

perl 5.018001
XML----------------
<?xml version="1.0"?>
<what>
   <ever>&lt;<![CDATA[&foo <=> &bar]]>&gt;</ever>
</what>
-------------------
XML::Parser 2.44
   what/ever => [<&foo <=> &bar>]
XML::Parser::Lite 0.721
   what/ever => [&lt;&gt;]
XML::Tiny 2.06
   what/ever => [<&foo <=> &bar>]

So, it seems that CDATA sections aren't handled well by XML::Parser::Lite, which is a bit surprising considering that it is considered the implementation of a complete XML parser.

The module is based on this article from 1998, which seems to support CDATA (at least at a shallow inspection). Maybe the translation into Perl code failed at some point?

Update 1 (2016-01-24 09:02:25): Looking at the big regexp in XML::Parser::Lite, it is matching the CDATA but simply discarding it away. Compare the following lines:

my $CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>";
#...
my $PI_CE = "($Name(?:$PI_Tail))>(?{${package}::_xmldecl(\$5)})";

Where there is a callback in $PI_CE to call _xmldecl(), there is no callback in $CDATA_CE, which makes the regexp accept the CDATA but just throw it away. XML::Parser::LiteCopy seems to address this via a CData handler.

10 Comments

Have you considered and/or tried Mojo::DOM? If you have decided against it can I ask why?

It really is too bad that people don't realize how small the Mojolicious namespace (the web framework) is even relative to the Mojo namespace (the web toolkit), which is pretty tiny itself.

When fat packing is the goal then yes you may indeed be right. More often the complaint is simply "why would I want to pull in a web framework when I want to parse xml?" to which my answer is "why not".

To your other point. I'm sorry to hear that you weren't happy with your experience with asking a question. Like all communities we have our up and down days. I hope you'll try is again sometime if you need help, I personally as well as the community as a whole, have been trying to make sure the community id's welcoming.

And those typos are what I get for trying to comment from my phone :p

I should point out that I wrote XML::Tiny with the explicit aim of annoying and trolling the hell out of people who take the religion of XML far too damned seriously.

It worked.

> I really can't see it as a general purpose library of functions, I didn't see it advertised as such anywhere

While this has always been a goal of the Mojo namespace, perhaps we haven't been clear/loud enough about it. We have made several changes in our documentation to reflect this, most prominently and explicitly in our official mission statement seen here at http://mojolicious.org/perldoc/Mojolicious/Guides/Contributing#Mission-statement but also sprinkled elsewhere.

Leave a comment

About Flavio Poletti

user-pic I blog about Perl.