Posting utf8 data using LWP::UserAgent
Yesterday we had some troubles setting up a client that should post some utf8 JSON to a web API. The problem was, that the data kept showing up as latin-1. After lots of "fun" with hexdump, wireshark, the debugger and Devel::Peek we verified that we were in fact passing a properly utf8 encoded JSON to LWP::UserAgent like so:
my $response = $ua->post( $self->api_base_url . '/' . $args->{action}, 'Content-type' => 'application/json;charset=utf-8', Content => $args->{data}, );
Still we didn't receive utf8 on the server.
After adding encode_utf8
in random places, it worked. So I dug into google and found this posting containing this advice:
If you want the string UTF8 encoded, then say so explicitly:$req = HTTP::Request->new(POST => $endpoint);
$req->content_type("text/plain; charset='utf8'");
$req->content(Encode::encode_utf8($utf8));
It seems that HTTP::Request, when stitching together the actual text representation of the request, will downgrade the content to latin1, no matter what you say in Content-Type, unless one explicitly calls encode_utf8 on the content.
Now that I think about it again, I guess we've fallen into the trap that Perl uses something very much like utf8 for the internal representation (which is why we can get the utf8 encoded data out of the database, send it to the client, and it get's displayed correctly). But sometimes you just have to be very explicit, especially when dealing with picky modules like HTTP::Request.
BTW, the working code we're using now looks like this:
my $response = $ua->post( $self->api_base_url . '/' . $args->{action}, 'Content-type' => 'application/json;charset=utf-8', Content => encode_utf8($args->{data}), );
Hmm?
---------------
#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use utf8;
my $utf8 = 'さる';
my $ua = LWP::UserAgent->new ();
my $response =
$ua->post('http://mikan/b/cgi/cgi.cgi',
'Content-type' => 'application/json;charset=utf-8',
Content => $utf8,
);
print $response->decoded_content ();
----------
$ ./test-lwp.pl
HTTP::Message content must be bytes at /home/ben/software/install/lib/perl5/site_perl/5.10.0/HTTP/Request/Common.pm line 91
> But sometimes you just have to be very explicit, especially when dealing with picky modules like HTTP::Request.
It's not "sometimes" - you should always decode in the input (when reading files, databases or web request data) and encode in the output (writing files, databases or web response). It's that simple.
HTTP::Request, in your use case, is supposedly the output, so you have to explicitly encode the strings into utf-8.
However, as the other commenter say, if your JSON data is a decoded string you should be geting an error, but it seems you don't. That means your JSON data is already encoded (like you say in the post), so you're sending doubly encoded UTF-8 to the server. I wonder if the server side decoder makes evil things or do something /too smart/ to hide that bug.
Sorry, I don't quite get what you're trying to say. To me it seems that you're reproducing the problem I described.
But please also see my reply to miyagawa
I know that I always have to en/decode on the borders of my programs. But I thought that my data was already properly encoded..
When I first added the final
encode_utf8
I suspected to see double encoded data, but I didn't.HTTP::Message says: Note that the content should be a string of bytes. Strings in perl can contain characters outside the range of a byte. The Encode module can be used to turn such strings into a string of bytes.
I looked at it's source, and HTTP::Message calls
utf8::downgrade
on the content. Which explains why my utf8-string was downgraded to latin1.I still don't quite get what is going on here, even though I have quite a good understanding of encodings, utf8 etc (or so I though...)
If I remove the line
use utf8;
from the example code which I posted, I get back the output (UTF-8 encoded Japanese text さる) without problems (this cgi.cgi is just something which reflects back its standard input). If I include a line "encode_utf8" in there, I get garbage. If I turn on both the use utf8 and the encode_utf8 I get the output back without problems.
ok, I think if you include utf8 characters in your source code without doing
use utf8
, Perl will treat them as bytes, not characters. AsHTTP::Message
only accepts bytes (not characters), this works. But is sort of broken, because you shouldn't use utf8 in your source code withoutuse utf8
.> I looked at it's source, and HTTP::Message calls utf8::downgrade on the content. Which explains why my utf8-string was downgraded to latin1.
Right, but utf8::downgrade does NOT do anything if the target strings are already encoded. Try this:
use Encode;
my $str = "\x{30c6}\x{30b9}\x{30c8}"; # test in Unicode
my $bytes = encode_utf8($str); # utf-8 encoded bytes
my $copy = $bytes; # copy it
utf8::downgrade($copy);
print "OK\n" if $bytes eq $copy;
> I still don't quite get what is going on here, even though I have quite a good understanding of encodings, utf8 etc (or so I though...)
What happens here is, your JSON data has upgraded Unicode string octets that only consist of latin-1 range characters. Because they're in latin-1 range, HTTP::Message successfully downgrades them into latin-1, which is unfortunate.
However it should NOT be considered as a bug in HTTP::Message. The problem is on your side (of course!) - your JSON data has the upgraded (decoded) strings. It should be encoded byte strings instead.
I'm curious how you generate that JSON data, but it could be reproduced with something like:
use Encode;
use JSON;
my $json = JSON::encode_json({ foo => "L\x{e9}on" });
$json = decode_utf8($json);
Maybe your JSON generating code reads the JSON data from a file with binmode on it, or reading JSON saved in an XML file with encoding="utf-8", or you're generating the JSON using Template-Toolkit. I don't know.
Anyway, now $json *happens to* contain utf8 octet strings, and you take that as "it is correctly encoded in utf-8". That's wrong. It's just a side effect of how perl stores its data as in utf-8: they're just a mark to contain high bit characters, which in this case are actually all
So, calling encode_utf8 on $json makes it *correctly* encoded into utf-8 octets. Nothing's wrong with your end result, but the actual problem is that your generated JSON data is in the bad state in the first place.
> which in this case are actually all
Er, something is cut off: i meant 'which in this case are actually all less than 255 (aka in latin-1 range)'
It's not "sort of broken", the contents of the string
$y
in the caseare the same as in $z in
Try it:
Anyway, what Tatsuhiko Miyagawa said is correct:
I suspected something like this (but had troubles putting into words :-)
Anyway, I'm generating the JSON using MooseX::Storage (::Format::JSON), so it's kind of hard to actually change the way the JSON is generated. The object I'm serializing is filled with data coming from Postgres, where it's stored in utf8. The DB connections uses
pg_enable_utf8
.You should probably consult Moose developers about the problem you have. JSON is a data, so it should be encoded into utf-8 bytes instead of strings.
This gotcha often happens in the web API programming, since creating URI and HTTP requests require some *data*, which should explicitly be encoded into utf-8 or other encodings.
I have a similar problem. I use the JSON module from CPAN along with LWP. It was a bit confusing, because it works if i use latin1 in JSON but the problems seens to be that i must tell the perl parser with the utf8 pragma that i want to use utf8 in my source code...
This works (maybe it helps someone...)
$requestValues->{jsonrpc} = "2.0";
$requestValues->{params}{text} = "ÄÜÖ"
...
$ua = LWP::UserAgent->new;
...
$req->content_type('application/json; charset=utf-8');
$req->content(encode_json($requestValues));
...
or
...
$req->content(JSON->new->utf8->encode($requestValues));
...