Branchless UTF-8 Length

I recently read an article about Aha! – A Hacker’s Assistant, a superoptimizer used to find branchless algorithms with brute force. There's a problem for which I always wanted a short branchless solution: finding the length of a UTF-8 byte sequence without a lookup table. So I gave Aha a try.

The length of a UTF-8 byte sequence is determined by its first byte. The possible sequences are:

  • 1-byte sequences start with a byte in the range 0x00-0x7F
  • 2-byte sequences start with a byte in the range 0xC0-0xDF
  • 3-byte sequences start with a byte in the range 0xE0-0xEF
  • 4-byte sequences start with a byte in the range 0xF0-0xF7

(There are a couple of other restrictions but I'm only interested in valid UTF-8 strings and don't care about the results for invalid sequences.)

Aha is straight-forward to use. You write a simple C fragment that contains the function you want to make branchless, compile Aha using that fragment, and run the resulting program. You only have to provide a command-line parameter with the number of operations Aha should try. Other options can be set in a C header file. Since Aha uses brute force, it's only useful for short algorithms. Exhausting all programs with three operations takes under a second. Four operations about 1-10 minutes. Five operations probably in the range of hours (I didn't have to try).

Since the length of a UTF-8 byte sequence is determined by the upper four bits of the first byte, I figured it would be better to provide only these bits as input to my function. So I ended up with the following C fragment:

#include "aha.h"

int userfun(int x) {
    if (x == 0xC || x == 0xD) return 2;
    if (x == 0xE) return 3;
    if (x == 0xF) return 4;
    return 1;

Via the config.h file I told Aha to only try values 0-7 and 0xC-0xF. As intermediates I chose the range -16 to 16, for shift intermediates values 1-3. Unfortunately, Aha didn't find any programs with three operations, but a couple with four operations. Many of the programs are essentially the same, but here are the different classes of solutions (I use >>> for signed (arithmetic) right shift):

((x - 10) >>> (16 - x)) + 2

This is probably the nicest result. Only a single signed right shift and three adds/subtracts. Two operations can also be executed in parallel. The final addition of a constant can sometimes come for free. The "undefined" input values 0x8 and 0x9 yield 1, 0xA and 0xB yield 2.

((10 >> (15 - x)) + 7) >> 2

Two right shifts and two adds/subs. Values 0x08-0x0B all yield 1 which might be useful in some situations. All operations must run sequentially.

(x >> 3) - (-3 >>> (15 - x))

Two right shifts and two adds/subs. Values 0x08-0x0B all yield 2. Two operations can be executed in parallel.

It's interesting that all solutions rely on right-shifting by a value derived from x. Try the above formulas with a few sample values and you'll see how it works.

The expression 15 - x can also be computed as x ^ 15 which should be slightly faster on platforms without reverse subtract. Here's small Perl script that verifies the results and adjusts the expressions to work directly with the first byte of a sequence:

use strict;
use warnings;
use integer; # Enables signed right shift.

for my $c (0..255) {
    my $hi   = $c >> 4;
    my $len1 = ((10 >> ($hi ^ 15)) + 7) >> 2;
    my $len2 = (($c - 160) >> (20 - $hi)) + 2;
    my $len3 = ($c >> 7) - (-3 >> ($hi ^ 15));
    printf("%02X: %d %d %d\n", $c, $len1, $len2, $len3);

Leave a comment

About Nick Wellnhofer

user-pic I'm a freelancer looking for contract work in Munich, Germany, or remotely. Check out my profiles on MetaCPAN, GitHub, LinkedIn and StackOverflow.