Branchless UTF-8 Length
I recently read an article about Aha! – A Hacker’s Assistant, a superoptimizer used to find branchless algorithms with brute force. There's a problem for which I always wanted a short branchless solution: finding the length of a UTF-8 byte sequence without a lookup table. So I gave Aha a try.
The length of a UTF-8 byte sequence is determined by its first byte. The possible sequences are:
- 1-byte sequences start with a byte in the range 0x00-0x7F
- 2-byte sequences start with a byte in the range 0xC0-0xDF
- 3-byte sequences start with a byte in the range 0xE0-0xEF
- 4-byte sequences start with a byte in the range 0xF0-0xF7
(There are a couple of other restrictions but I'm only interested in valid UTF-8 strings and don't care about the results for invalid sequences.)
Aha is straight-forward to use. You write a simple C fragment that contains the function you want to make branchless, compile Aha using that fragment, and run the resulting program. You only have to provide a command-line parameter with the number of operations Aha should try. Other options can be set in a C header file. Since Aha uses brute force, it's only useful for short algorithms. Exhausting all programs with three operations takes under a second. Four operations about 1-10 minutes. Five operations probably in the range of hours (I didn't have to try).
Since the length of a UTF-8 byte sequence is determined by the upper four bits of the first byte, I figured it would be better to provide only these bits as input to my function. So I ended up with the following C fragment:
#include "aha.h"
int userfun(int x) {
if (x == 0xC || x == 0xD) return 2;
if (x == 0xE) return 3;
if (x == 0xF) return 4;
return 1;
}
Via the config.h
file I told Aha to only try values 0-7 and 0xC-0xF. As intermediates I chose the range -16 to 16, for shift intermediates values 1-3. Unfortunately, Aha didn't find any programs with three operations, but a couple with four operations. Many of the programs are essentially the same, but here are the different classes of solutions (I use >>>
for signed (arithmetic) right shift):
((x - 10) >>> (16 - x)) + 2
This is probably the nicest result. Only a single signed right shift and three adds/subtracts. Two operations can also be executed in parallel. The final addition of a constant can sometimes come for free. The "undefined" input values 0x8 and 0x9 yield 1, 0xA and 0xB yield 2.
((10 >> (15 - x)) + 7) >> 2
Two right shifts and two adds/subs. Values 0x08-0x0B all yield 1 which might be useful in some situations. All operations must run sequentially.
(x >> 3) - (-3 >>> (15 - x))
Two right shifts and two adds/subs. Values 0x08-0x0B all yield 2. Two operations can be executed in parallel.
It's interesting that all solutions rely on right-shifting by a value derived from x
. Try the above formulas with a few sample values and you'll see how it works.
The expression 15 - x
can also be computed as x ^ 15
which should be slightly faster on platforms without reverse subtract. Here's small Perl script that verifies the results and adjusts the expressions to work directly with the first byte of a sequence:
#!/usr/bin/perl
use strict;
use warnings;
use integer; # Enables signed right shift.
for my $c (0..255) {
my $hi = $c >> 4;
my $len1 = ((10 >> ($hi ^ 15)) + 7) >> 2;
my $len2 = (($c - 160) >> (20 - $hi)) + 2;
my $len3 = ($c >> 7) - (-3 >> ($hi ^ 15));
printf("%02X: %d %d %d\n", $c, $len1, $len2, $len3);
}
Leave a comment