Posting utf8 data using LWP::UserAgent

By domm on November 11, 2010 7:33 AM

Yesterday we had some troubles setting up a client that should post some utf8 JSON to a web API. The problem was, that the data kept showing up as latin-1. After lots of "fun" with hexdump, wireshark, the debugger and Devel::Peek we verified that we were in fact passing a properly utf8 encoded JSON to LWP::UserAgent like so:

    my $response = $ua->post(
        $self->api_base_url . '/' . $args->{action},
        'Content-type'   => 'application/json;charset=utf-8',
        Content          => $args->{data},
    );

Still we didn't receive utf8 on the server.

After adding encode_utf8 in random places, it worked. So I dug into google and found this posting containing this advice:

If you want the string UTF8 encoded, then say so explicitly:

 $req = HTTP::Request->new(POST => $endpoint);

 $req->content_type("text/plain; charset='utf8'");

 $req->content(Encode::encode_utf8($utf8));

It seems that HTTP::Request, when stitching together the actual text representation of the request, will downgrade the content to latin1, no matter what you say in Content-Type, unless one explicitly calls encode_utf8 on the content.

Now that I think about it again, I guess we've fallen into the trap that Perl uses something very much like utf8 for the internal representation (which is why we can get the utf8 encoded data out of the database, send it to the client, and it get's displayed correctly). But sometimes you just have to be very explicit, especially when dealing with picky modules like HTTP::Request.

BTW, the working code we're using now looks like this:

    my $response = $ua->post(
        $self->api_base_url . '/' . $args->{action},
        'Content-type'   => 'application/json;charset=utf-8',
        Content          => encode_utf8($args->{data}),
    );

12 comments

Tagged as:

utf8

12 Comments

Ben Bullock | November 11, 2010 9:00 AM | Reply

Hmm?

---------------

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use utf8;
my $utf8 = 'さる';
my $ua = LWP::UserAgent->new ();
my $response =
$ua->post('http://mikan/b/cgi/cgi.cgi',
'Content-type' => 'application/json;charset=utf-8',
Content => $utf8,
);
print $response->decoded_content ();

----------

$ ./test-lwp.pl
HTTP::Message content must be bytes at /home/ben/software/install/lib/perl5/site_perl/5.10.0/HTTP/Request/Common.pm line 91

Tatsuhiko Miyagawa | November 11, 2010 9:29 AM | Reply

> But sometimes you just have to be very explicit, especially when dealing with picky modules like HTTP::Request.

It's not "sometimes" - you should always decode in the input (when reading files, databases or web request data) and encode in the output (writing files, databases or web response). It's that simple.

HTTP::Request, in your use case, is supposedly the output, so you have to explicitly encode the strings into utf-8.

However, as the other commenter say, if your JSON data is a decoded string you should be geting an error, but it seems you don't. That means your JSON data is already encoded (like you say in the post), so you're sending doubly encoded UTF-8 to the server. I wonder if the server side decoder makes evil things or do something /too smart/ to hide that bug.

domm replied to comment from Ben Bullock | November 11, 2010 6:08 PM | Reply

Sorry, I don't quite get what you're trying to say. To me it seems that you're reproducing the problem I described.

But please also see my reply to miyagawa

domm replied to comment from Tatsuhiko Miyagawa | November 11, 2010 6:20 PM | Reply

I know that I always have to en/decode on the borders of my programs. But I thought that my data was already properly encoded..

When I first added the final encode_utf8 I suspected to see double encoded data, but I didn't.

HTTP::Message says: Note that the content should be a string of bytes. Strings in perl can contain characters outside the range of a byte. The Encode module can be used to turn such strings into a string of bytes.

I looked at it's source, and HTTP::Message calls utf8::downgrade on the content. Which explains why my utf8-string was downgraded to latin1.

I still don't quite get what is going on here, even though I have quite a good understanding of encodings, utf8 etc (or so I though...)

Ben Bullock replied to comment from domm | November 11, 2010 10:29 PM | Reply

If I remove the line

use utf8;

from the example code which I posted, I get back the output (UTF-8 encoded Japanese text さる) without problems (this cgi.cgi is just something which reflects back its standard input). If I include a line "encode_utf8" in there, I get garbage. If I turn on both the use utf8 and the encode_utf8 I get the output back without problems.

domm replied to comment from Ben Bullock | November 12, 2010 7:49 AM | Reply

ok, I think if you include utf8 characters in your source code without doing use utf8, Perl will treat them as bytes, not characters. As HTTP::Message only accepts bytes (not characters), this works. But is sort of broken, because you shouldn't use utf8 in your source code without use utf8.

Tatsuhiko Miyagawa replied to comment from domm | November 12, 2010 9:19 AM | Reply

> I looked at it's source, and HTTP::Message calls utf8::downgrade on the content. Which explains why my utf8-string was downgraded to latin1.

Right, but utf8::downgrade does NOT do anything if the target strings are already encoded. Try this:

use Encode; my $str = "\x{30c6}\x{30b9}\x{30c8}"; # test in Unicode my $bytes = encode_utf8($str); # utf-8 encoded bytes my $copy = $bytes; # copy it utf8::downgrade($copy); print "OK\n" if $bytes eq $copy;

> I still don't quite get what is going on here, even though I have quite a good understanding of encodings, utf8 etc (or so I though...)

What happens here is, your JSON data has upgraded Unicode string octets that only consist of latin-1 range characters. Because they're in latin-1 range, HTTP::Message successfully downgrades them into latin-1, which is unfortunate.

However it should NOT be considered as a bug in HTTP::Message. The problem is on your side (of course!) - your JSON data has the upgraded (decoded) strings. It should be encoded byte strings instead.

I'm curious how you generate that JSON data, but it could be reproduced with something like:

use Encode; use JSON; my $json = JSON::encode_json({ foo => "L\x{e9}on" }); $json = decode_utf8($json);

Maybe your JSON generating code reads the JSON data from a file with binmode on it, or reading JSON saved in an XML file with encoding="utf-8", or you're generating the JSON using Template-Toolkit. I don't know.

Anyway, now $json *happens to* contain utf8 octet strings, and you take that as "it is correctly encoded in utf-8". That's wrong. It's just a side effect of how perl stores its data as in utf-8: they're just a mark to contain high bit characters, which in this case are actually all

So, calling encode_utf8 on $json makes it *correctly* encoded into utf-8 octets. Nothing's wrong with your end result, but the actual problem is that your generated JSON data is in the bad state in the first place.

Tatsuhiko Miyagawa | November 12, 2010 9:21 AM | Reply

> which in this case are actually all

Er, something is cut off: i meant 'which in this case are actually all less than 255 (aka in latin-1 range)'

Ben Bullock | November 12, 2010 9:22 AM | Reply

It's not "sort of broken", the contents of the string $y in the case

use utf8;
my $x = 'さる';
my $y = encode_utf8 $x;

are the same as in $z in

no utf8;
my $z = 'さる';

Try it:

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use Encode;
use utf8;
my $x = 'さる';
my $y = encode_utf8 $x;
no utf8;
my $z = 'さる';
if ($y eq $z) {
    print "same\n";
}
else {
    print "different\n";
}

[ben@mikan] ~ 527 $ ./moo.pl 
same

Anyway, what Tatsuhiko Miyagawa said is correct:

It's not "sometimes" - you should always decode in the input (when reading files, databases or web request data) and encode in the output (writing files, databases or web response). It's that simple.

domm replied to comment from Tatsuhiko Miyagawa | November 12, 2010 1:04 PM | Reply

I suspected something like this (but had troubles putting into words :-)

Anyway, I'm generating the JSON using MooseX::Storage (::Format::JSON), so it's kind of hard to actually change the way the JSON is generated. The object I'm serializing is filled with data coming from Postgres, where it's stored in utf8. The DB connections uses pg_enable_utf8.

Tatsuhiko Miyagawa | November 12, 2010 5:06 PM | Reply

You should probably consult Moose developers about the problem you have. JSON is a data, so it should be encoded into utf-8 bytes instead of strings.

This gotcha often happens in the web API programming, since creating URI and HTTP requests require some *data*, which should explicitly be encoded into utf-8 or other encodings.

ARistipp | December 30, 2010 12:33 PM | Reply

I have a similar problem. I use the JSON module from CPAN along with LWP. It was a bit confusing, because it works if i use latin1 in JSON but the problems seens to be that i must tell the perl parser with the utf8 pragma that i want to use utf8 in my source code...

This works (maybe it helps someone...)



use utf8;

use JSON;

use use LWP::UserAgent;

$requestValues->{jsonrpc} = "2.0";
$requestValues->{params}{text} = "ÄÜÖ"
...
$ua = LWP::UserAgent->new;
...
$req->content_type('application/json; charset=utf-8');
$req->content(encode_json($requestValues));
...
or
...
$req->content(JSON->new->utf8->encode($requestValues));
...

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About domm

Just in case you like to know, I'm currently full-time father of 2 kids, half-time Perl hacker, sort-of DJ, bicyclist, no longer dreadlocked and more than 34 years old but too lazy to update my profile once a year. I'm also head of Vienna.pm, no longer maintainer of the CPANTS project, member of the TPF Grants Commitee and the YAPC Europe Foundation. I've got stuff on CPAN, held various talks and organise the Austrian Perl Workshops and YAPC::Europe 2007.

More info »

domm