## Branchless UTF-8 Length

I recently read an article about Aha! – A Hacker’s Assistant, a superoptimizer used to find branchless algorithms with brute force. There's a problem for which I always wanted a short branchless solution: finding the length of a UTF-8 byte sequence without a lookup table. So I gave Aha a try.

The length of a UTF-8 byte sequence is determined by its first byte. The possible sequences are:

• 1-byte sequences start with a byte in the range 0x00-0x7F
• 2-byte sequences start with a byte in the range 0xC0-0xDF
• 3-byte sequences start with a byte in the range 0xE0-0xEF
• 4-byte sequences start with a byte in the range 0xF0-0xF7

(There are a couple of other restrictions but I'm only interested in valid UTF-8 strings and don't care about the results for invalid sequences.)

Aha is straight-forward to use. You write a simple C fragment that contains the function you want to make branchless, compile Aha using that fragment, and run the resulting program. You only have to provide a command-line parameter with the number of operations Aha should try. Other options can be set in a C header file. Since Aha uses brute force, it's only useful for short algorithms. Exhausting all programs with three operations takes under a second. Four operations about 1-10 minutes. Five operations probably in the range of hours (I didn't have to try).

Since the length of a UTF-8 byte sequence is determined by the upper four bits of the first byte, I figured it would be better to provide only these bits as input to my function. So I ended up with the following C fragment:

``````#include "aha.h"

int userfun(int x) {
if (x == 0xC || x == 0xD) return 2;
if (x == 0xE) return 3;
if (x == 0xF) return 4;
return 1;
}
``````

Via the `config.h` file I told Aha to only try values 0-7 and 0xC-0xF. As intermediates I chose the range -16 to 16, for shift intermediates values 1-3. Unfortunately, Aha didn't find any programs with three operations, but a couple with four operations. Many of the programs are essentially the same, but here are the different classes of solutions (I use `>>>` for signed (arithmetic) right shift):

``````((x - 10) >>> (16 - x)) + 2
``````

This is probably the nicest result. Only a single signed right shift and three adds/subtracts. Two operations can also be executed in parallel. The final addition of a constant can sometimes come for free. The "undefined" input values 0x8 and 0x9 yield 1, 0xA and 0xB yield 2.

``````((10 >> (15 - x)) + 7) >> 2
``````

Two right shifts and two adds/subs. Values 0x08-0x0B all yield 1 which might be useful in some situations. All operations must run sequentially.

``````(x >> 3) - (-3 >>> (15 - x))
``````

Two right shifts and two adds/subs. Values 0x08-0x0B all yield 2. Two operations can be executed in parallel.

It's interesting that all solutions rely on right-shifting by a value derived from `x`. Try the above formulas with a few sample values and you'll see how it works.

The expression `15 - x` can also be computed as `x ^ 15` which should be slightly faster on platforms without reverse subtract. Here's small Perl script that verifies the results and adjusts the expressions to work directly with the first byte of a sequence:

``````#!/usr/bin/perl
use strict;
use warnings;
use integer; # Enables signed right shift.

for my \$c (0..255) {
my \$hi   = \$c >> 4;
my \$len1 = ((10 >> (\$hi ^ 15)) + 7) >> 2;
my \$len2 = ((\$c - 160) >> (20 - \$hi)) + 2;
my \$len3 = (\$c >> 7) - (-3 >> (\$hi ^ 15));
printf("%02X: %d %d %d\n", \$c, \$len1, \$len2, \$len3);
}
``````