December 2023 Archives

Migrating from DBD::mysql to DBD::MariaDB

DBD::mysql has long provided the interface to connect to the MySQL database from Perl code. But as old as it is, it was implemented with some critical bugs that cannot simply be corrected without breaking mountains of existing code relying on those bugs. For this same reason, though DBD::MariaDB corrects these bugs, care must be taken when migrating existing code to avoid subtle breakage.

This blog post is far too short to explain Unicode and encodings like UTF-8; for anyone seeking a more solid grasp on the concepts, I recommend a read through The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

In particular, like much XS code of its time, DBD::mysql suffers "The Unicode Bug" by improperly ignoring the (poorly-named) internal UTF8 flag of strings it works with, which for XS code is a necessary indication of how the string was stored. Perl's string implementation has two internal formats: a downgraded format, where the characters of the string are stored directly as the bytes representing each ordinal, and an upgraded format, where the characters of the string are stored roughly as the UTF-8 encoding of the Unicode characters representing each ordinal (roughly, because it is more permissive than UTF-8 in order to allow encoding any character allowed in a Perl string).

The reasons for these two distinct formats involve both legacy and performance, and the system works rather well, as long as everything interfacing with these internals pays attention to which format the string has been stored in, as indicated by the "UTF8 flag". And importantly, Perl does not provide guarantees as to when it will or will not upgrade or downgrade a string (converting its internal format while keeping the logical contents of the string the same), and does so as needed or as an optimization; this behavior may change between versions of Perl or based on what features are enabled, and commonly occurs during various normal string operations.

The practical effect of this is that strings passed into MySQL that are considered string-equal in Perl (according to the eq operator and other common mechanisms) may get stored or retrieved as different strings, due to factors that may be surprising, such as string concatenation operations, or the version of Perl in use, or whether the unicode_strings feature was enabled when the string was created or modified.

An attempt was made to correct this in DBD::mysql 4.042, but was reverted in the next version due to the danger of subtly breaking existing code in a way that cannot be easily caught or explained. Subsequently, PALI released DBD::MariaDB as a fork of DBD::mysql with these fixes, and development continues separately on that fork.

DBD::MariaDB is not specific to the MariaDB database, and in fact both DBD::mysql and DBD::MariaDB support interfacing with either MySQL or MariaDB clients and servers for the most part, as their protocols are compatible. However, DBD::mysql recently has removed support for building against MySQL clients older than 8.0 or newer MariaDB clients, providing additional impetus to migrate to DBD::MariaDB.

The mysql_enable_utf8(mb4) option

In DBD::mysql, there is an option called mysql_enable_utf8 (and the preferred mysql_enable_utf8mb4 which will be omitted here for brevity) which purports to encode and decode parameters from UTF-8 and tell the database server to expect and return parameters in UTF-8, allowing the code to work with general Unicode text. However, this option suffers "The Unicode Bug" whether it is enabled or disabled. In common cases, it works just enough to fool the user into thinking their code will continue working correctly.

If a string contains codepoints with an ordinal of 256 or greater, it can be safely assumed to be a text string, because byte strings by definition cannot include ordinals of 256 or greater. Such characters that force interpretation as Unicode text are commonly called "wide characters", and these strings can only be stored in Perl's internal upgraded string format, which happens to produce the bytes of the string's UTF-8 encoding. Characters in the ASCII range of 0-127 are identical to the byte ordinals of their UTF-8 encoding (a precise way to say that the encode operation does not affect them), so this distinction is practically meaningless for ASCII strings.

But in between ASCII and wide characters are the ordinals from 128-255 that prove ambiguous; both byte strings and text strings can contain these ordinals, and when encoded or decoded to UTF-8, a valid but different string will result. Perl can use either of its internal representations to store such a string, and DBD::mysql will send it however Perl has stored it without checking.

A string of bytes treated as text and encoded to UTF-8 will generally result in another valid string of bytes, and it is possible (though rare, outside of ASCII) that a string of text treated as bytes can be interpreted as valid UTF-8 and decoded to another valid string of text. This is commonly how text data ends up as "mojibake" or mangled characters, or a byte string ends up unusable. There are a couple common ways that this can result in corrupted or incorrect data using DBD::mysql:

 use utf8;
 my $str1 = "résumé";
 warn length $str1; # 6, thanks to "use utf8"
 utf8::downgrade $str1;
 warn length $str1; # still 6
 # but mysql_enable_utf8 would wrongly send the Latin-1 encoding
 # as a binary string

 my $str2 = "\N{U+2603}\N{U+2603}\N{U+2603}"; # snowman party
 warn length $str2; # 3, regardless
 use Encode 'encode';
 my $str3 = encode 'UTF-8', $str2;
 warn length $str3; # 9,
 # but mysql_enable_utf8 would wrongly send the same as $str2

Binary data can suffer the opposite issue; in most cases, it will be stored in Perl's downgraded internal string format which simply stores the bytes as-is, and in this case the string DBD::mysql sends will match the intended binary data, even with mysql_enable_utf8 enabled. But such a string may be upgraded in the normal course of Perl string operations, in which case the binary data DBD::mysql sends to the database server will be mangled.

 use File::Slurper 'read_binary';
 my $binary_data = read_binary $filepath;
 warn length $binary_data; # file size
 utf8::upgrade $binary_data;
 warn length $binary_data; # still the file size
 # but mysql_enable_utf8 would send an incorrect (likely longer) string

When retrieving strings, DBD::mysql knows whether it is a text or binary column, and so (with mysql_enable_utf8) returns text data as an upgraded Perl string, and binary data as a downgraded Perl string. In the common case, this also works, but in particular if binary data was mangled upon storage, that mangled data will be returned as stored, rather than reversing the mangle.

It is possible to work around this mysql_enable_utf8 bug to write reliable code with DBD::mysql if necessary, as explained by the mysql_enable_utf8 documentation in DBD::mysql, where the examples shown also will seamlessly convert to DBD::MariaDB (though the upgrades and downgrades are unnecessary for DBD::MariaDB).

The DBD::MariaDB correction

In DBD::MariaDB, there is no such option as mysql_enable_utf8 (or mysql_enable_utf8mb4); input parameters are always treated as Unicode strings and encoded to UTF-8 unless indicated otherwise, and text data in the response is always decoded from UTF-8 and returned as a Unicode string. (Note that this still works fine if your tables do not use utf8/mb4 charsets for storage, because MySQL/MariaDB convert automatically between the connection charset and the storage charsets.) This along with correcting "The Unicode Bug" in the XS code results in much more reliable behavior, but code that only accidentally worked due to the mysql_enable_utf8 bugs must be updated for use with DBD::MariaDB or it may consistently mangle the data.

Due to the ambiguities previously discussed, these issues cannot be automatically detected or fixed. Instead, all instances of inserting text and particularly binary data must be audited to ensure that the data is being provided as the database driver expects. Text strings are expected to be provided in decoded Unicode character form; for example, whether the string is upgraded or downgraded, a string representing 3 Unicode characters (codepoints) should have a length() of 3 when provided to the database query.

 use utf8;
 my $str1 = "résumé"; # length 6
 my $str2 = "\N{U+2603}\N{U+2603}\N{U+2603}"; # length 3
 $dbh->do('INSERT INTO `some_table` SET `x`=?, `y`=?', undef, $str1, $str2);

For binary data, DBD::MariaDB requires that the parameter be marked as binary, so that it does not encode it to UTF-8, as the database driver does not have a way of intuiting whether the query will ultimately interpret the data as binary or text. This marking is done using the parameter binding interface:

 my $sth = $dbh->prepare($query);
 $sth->bind_param(1, $text_string);
 $sth->bind_param(2, $binary_data, DBI::SQL_BINARY);
 $sth->execute;

Unfortunately, there is no interface built into DBI to allow specifying parameter types in the more convenient methods such as do() and selectall_arrayref(); the full prepare/bind_param/execute flow must be used. I commonly write a wrapper function to execute these steps for my application in a single function call while being able to specify parameters as binary. Alternatively, Mojo::mysql provides a similar wrapper, and can transparently use DBD::MariaDB in place of DBD::mysql.

For more information, check out the DBD::MariaDB Unicode documentation.

About Grinnz

user-pic I blog about Perl.