Faster UTF-8 Validation

A while back, I received a pull request suggesting that I update the performance comparison with Encode.pm in my module, Unicode::UTF8. When I originally wrote Unicode::UTF8, Encode.pm used its own UTF-8 validation implementation. Since then, Karl Williamson has done extensive work improving Perl, and Encode.pm now relies on those validation routines based on Björn Höhrmann’s UTF-8 DFA decoder. It’s an elegant piece of code, widely adopted across many projects.

That said, I wasn’t particularly motivated to revisit the comparison, so I decided instead to look into faster scalar C implementations. While it has been suggested that Unicode::UTF8 should gain a SIMD implementation, and that may well be worthwhile, I wanted to first explore improvements to the scalar path, which would still be required as a fallback.

After some searching, I came across a tweet by Russ Cox showing performance numbers for a shift-based DFA implementation in Go, along with a link to a gist by Per Vognsen describing the technique.

It turned out that this shift-based DFA approach was quite popular a few years ago. Several implementations appeared in different programming languages and even in some RDBMs. However, I couldn’t find any reusable code, so I decided to implement it as a header-only C library, updating a UTF-8 validator I originally wrote in 2017.

I won’t go into the details of shift-based DFA encoding here. For a thorough explanation, I recommend Per Vognsen’s article, and I cover my own implementation in more detail here.

The main difference from a traditional DFA, such as Björn Höhrmann’s implementation, lies in how table lookups are handled. A conventional byte-class + transition-table DFA requires two lookups per byte: one to map the byte to a character class, and another to determine the next state. The shift-based DFA combines both steps into a single lookup, at the cost of a larger table. Both DFAs have a serial chain dependency where each byte’s state transition depends on the previous one, which limits instruction-level parallelism.

Two functions are provided for UTF-8 validation:

utf8_valid has no data-dependent branches. Each DFA step is a table lookup combined with a bitwise shift, with no conditional branches on the input byte value. Execution time scales with byte count, not byte values.

utf8_valid_ascii adds an ASCII fast path. There are different approaches to this: Perl, for instance, scans a leading ASCII prefix and falls back to DFA validation upon encountering a non-ASCII byte, never re-entering the fast path. I am not particularly fond of that approach, since even predominantly ASCII content today tends to contain non-ASCII bytes. Instead, I opted for a 16-byte chunk fast path that skips the DFA for chunks consisting entirely of ASCII bytes, and resumes DFA validation in a non-clean state when a non-ASCII byte is encountered.

The implementation showed promising results, but can we go faster? Yes! By splitting the UTF-8 stream in two, we can break the serial chain dependency and run two independent DFA chains in a single interleaved loop. Since the chains share no data dependencies, the CPU’s out-of-order engine can overlap their shift operations on cores with multiple shift-capable execution ports. I did try three interleaved streams but saw no improvement on the hardware available to me, the added overhead of splitting the stream further likely offsets any potential gain.

Whether utf8_valid_ascii is profitable depends heavily on the content. The following benchmark output illustrates this with ASCII-heavy content.

units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7-1.9 is common for 2-byte-heavy text, and ~2.7-3.0 for CJK-heavy text. The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the multiplier shows throughput relative to the slowest implementation.

The benchmark includes two reference implementations: hoehrmann and utf8_valid_old, the previous scalar implementation in the C library.

en.txt: 80 KB; 82K code points; 1.00 units/point
  U+0000..U+007F          82K  99.9%
  U+0080..U+07FF           18   0.0%
  U+0800..U+FFFF           49   0.1%
  hoehrmann                    572 MB/s
  utf8_valid_old              3677 MB/s  (6.42x)
  utf8_valid                  6466 MB/s  (11.30x)
  utf8_valid_ascii           41001 MB/s  (71.63x)

sv.txt: 94 KB; 93K code points; 1.04 units/point
  U+0000..U+007F          90K  96.4%
  U+0080..U+07FF           3K   3.5%
  U+0800..U+FFFF          171   0.2%
  hoehrmann                    572 MB/s
  utf8_valid_old              1536 MB/s  (2.68x)
  utf8_valid                  6600 MB/s  (11.53x)
  utf8_valid_ascii            8581 MB/s  (15.00x)

Output from multibyte-heavy content:

ja.txt: 176 KB; 65K code points; 2.79 units/point
  U+0000..U+007F           7K  10.7%
  U+0080..U+07FF           30   0.0%
  U+0800..U+FFFF          58K  89.3%
  hoehrmann                    572 MB/s
  utf8_valid_old              2046 MB/s  (3.58x)
  utf8_valid_ascii            4200 MB/s  (7.34x)
  utf8_valid                  6492 MB/s  (11.34x)

ar.txt: 25 KB; 14K code points; 1.81 units/point
  U+0000..U+007F           3K  18.9%
  U+0080..U+07FF          12K  81.1%
  hoehrmann                    572 MB/s
  utf8_valid_old               984 MB/s  (1.72x)
  utf8_valid_ascii            4163 MB/s  (7.28x)
  utf8_valid                  6486 MB/s  (11.34x)

utf8_valid reaches approximately 0.71 cycles per byte on wide-issue cores, that’s pretty good for a C scalar implementation.

I couldn’t stop at just a validator. With spare bits left in the DFA table, I went ahead and implemented a complete UTF-8 library covering validation, decoding, navigation, and transcoding. The code is available on GitHub.

Unicode::UTF8 internally uses utf8_valid_ascii in its encode and decode routines.

Next steps: I have no further ideas for improving the scalar implementation, and I am satisfied with the performance as it stands. A SIMD implementation may be worth exploring in the future, but the more immediate goal would be to get this incorporated into Perl core.

Enjoy! PS Now I can disregard the PR and tune up the numbers in the comparison ;o)

Leave a comment

About Christian Hansen

user-pic I blog about Perl.