Posting utf8 data using LWP::UserAgent

Yesterday we had some troubles setting up a client that should post some utf8 JSON to a web API. The problem was, that the data kept showing up as latin-1. After lots of "fun" with hexdump, wireshark, the debugger and Devel::Peek we verified that we were in fact passing a properly utf8 encoded JSON to LWP::UserAgent like so:

    my $response = $ua->post(
        $self->api_base_url . '/' . $args->{action},
        'Content-type'   => 'application/json;charset=utf-8',
        Content          => $args->{data},
    );

Still we didn't receive utf8 on the server.

After adding encode_utf8 in random places, it worked. So I dug into google and found this posting containing this advice:

If you want the string UTF8 encoded, then say so explicitly:

$req = HTTP::Request->new(POST => $endpoint);
$req->content_type("text/plain; charset='utf8'");
$req->content(Encode::encode_utf8($utf8));

It seems that HTTP::Request, when stitching together the actual text representation of the request, will downgrade the content to latin1, no matter what you say in Content-Type, unless one explicitly calls encode_utf8 on the content.

Now that I think about it again, I guess we've fallen into the trap that Perl uses something very much like utf8 for the internal representation (which is why we can get the utf8 encoded data out of the database, send it to the client, and it get's displayed correctly). But sometimes you just have to be very explicit, especially when dealing with picky modules like HTTP::Request.

BTW, the working code we're using now looks like this:

    my $response = $ua->post(
        $self->api_base_url . '/' . $args->{action},
        'Content-type'   => 'application/json;charset=utf-8',
        Content          => encode_utf8($args->{data}),
    );

12 Comments

Hmm?

---------------

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use utf8;
my $utf8 = 'さる';
my $ua = LWP::UserAgent->new ();
my $response =
$ua->post('http://mikan/b/cgi/cgi.cgi',
'Content-type' => 'application/json;charset=utf-8',
Content => $utf8,
);
print $response->decoded_content ();

----------

$ ./test-lwp.pl
HTTP::Message content must be bytes at /home/ben/software/install/lib/perl5/site_perl/5.10.0/HTTP/Request/Common.pm line 91

> But sometimes you just have to be very explicit, especially when dealing with picky modules like HTTP::Request.

It's not "sometimes" - you should always decode in the input (when reading files, databases or web request data) and encode in the output (writing files, databases or web response). It's that simple.

HTTP::Request, in your use case, is supposedly the output, so you have to explicitly encode the strings into utf-8.

However, as the other commenter say, if your JSON data is a decoded string you should be geting an error, but it seems you don't. That means your JSON data is already encoded (like you say in the post), so you're sending doubly encoded UTF-8 to the server. I wonder if the server side decoder makes evil things or do something /too smart/ to hide that bug.

If I remove the line

use utf8;

from the example code which I posted, I get back the output (UTF-8 encoded Japanese text さる) without problems (this cgi.cgi is just something which reflects back its standard input). If I include a line "encode_utf8" in there, I get garbage. If I turn on both the use utf8 and the encode_utf8 I get the output back without problems.


> I looked at it's source, and HTTP::Message calls utf8::downgrade on the content. Which explains why my utf8-string was downgraded to latin1.

Right, but utf8::downgrade does NOT do anything if the target strings are already encoded. Try this:


use Encode;
my $str = "\x{30c6}\x{30b9}\x{30c8}"; # test in Unicode
my $bytes = encode_utf8($str); # utf-8 encoded bytes
my $copy = $bytes; # copy it
utf8::downgrade($copy);
print "OK\n" if $bytes eq $copy;

> I still don't quite get what is going on here, even though I have quite a good understanding of encodings, utf8 etc (or so I though...)

What happens here is, your JSON data has upgraded Unicode string octets that only consist of latin-1 range characters. Because they're in latin-1 range, HTTP::Message successfully downgrades them into latin-1, which is unfortunate.

However it should NOT be considered as a bug in HTTP::Message. The problem is on your side (of course!) - your JSON data has the upgraded (decoded) strings. It should be encoded byte strings instead.

I'm curious how you generate that JSON data, but it could be reproduced with something like:


use Encode;
use JSON;
my $json = JSON::encode_json({ foo => "L\x{e9}on" });
$json = decode_utf8($json);

Maybe your JSON generating code reads the JSON data from a file with binmode on it, or reading JSON saved in an XML file with encoding="utf-8", or you're generating the JSON using Template-Toolkit. I don't know.

Anyway, now $json *happens to* contain utf8 octet strings, and you take that as "it is correctly encoded in utf-8". That's wrong. It's just a side effect of how perl stores its data as in utf-8: they're just a mark to contain high bit characters, which in this case are actually all

So, calling encode_utf8 on $json makes it *correctly* encoded into utf-8 octets. Nothing's wrong with your end result, but the actual problem is that your generated JSON data is in the bad state in the first place.

> which in this case are actually all

Er, something is cut off: i meant 'which in this case are actually all less than 255 (aka in latin-1 range)'

It's not "sort of broken", the contents of the string $y in the case

use utf8;
my $x = 'さる';
my $y = encode_utf8 $x;

are the same as in $z in

no utf8;
my $z = 'さる';

Try it:

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use Encode;
use utf8;
my $x = 'さる';
my $y = encode_utf8 $x;
no utf8;
my $z = 'さる';
if ($y eq $z) {
    print "same\n";
}
else {
    print "different\n";
}
[ben@mikan] ~ 527 $ ./moo.pl 
same

Anyway, what Tatsuhiko Miyagawa said is correct:


It's not "sometimes" - you should always decode in the input (when reading files, databases or web request data) and encode in the output (writing files, databases or web response). It's that simple.

You should probably consult Moose developers about the problem you have. JSON is a data, so it should be encoded into utf-8 bytes instead of strings.

This gotcha often happens in the web API programming, since creating URI and HTTP requests require some *data*, which should explicitly be encoded into utf-8 or other encodings.

I have a similar problem. I use the JSON module from CPAN along with LWP. It was a bit confusing, because it works if i use latin1 in JSON but the problems seens to be that i must tell the perl parser with the utf8 pragma that i want to use utf8 in my source code...

This works (maybe it helps someone...)


use utf8;
use JSON;
use use LWP::UserAgent;

$requestValues->{jsonrpc} = "2.0";
$requestValues->{params}{text} = "ÄÜÖ"
...
$ua = LWP::UserAgent->new;
...
$req->content_type('application/json; charset=utf-8');
$req->content(encode_json($requestValues));
...
or
...
$req->content(JSON->new->utf8->encode($requestValues));
...

Leave a comment

About domm

user-pic Just in case you like to know, I'm currently full-time father of 2 kids, half-time Perl hacker, sort-of DJ, bicyclist, no longer dreadlocked and more than 34 years old but too lazy to update my profile once a year. I'm also head of Vienna.pm, no longer maintainer of the CPANTS project, member of the TPF Grants Commitee and the YAPC Europe Foundation. I've got stuff on CPAN, held various talks and organise the Austrian Perl Workshops and YAPC::Europe 2007.