Faster UTF-8 Validation
A couple of months ago, I received a pull request suggesting that I update the performance comparison with Encode.pm in my module, Unicode::UTF8. When I originally wrote Unicode::UTF8,
Encode.pm used its own UTF-8 validation implementation. In recent years, Karl Williamson has done extensive work improving Perl, and Encode.pm now relies on those validation routines based on Björn Höhrmann’s UTF-8 DFA decoder. It’s an elegant piece of code, widely adopted across many projects.
That said, I wasn’t particularly motivated to revisit the comparison, so I decided instead to look into faster scalar C implementations. While it has been suggested that Unicode::UTF8 should gain a SIMD implementation, and that may well be worthwhile, I wanted to first explore improvements to the scalar path, which would still be required as a fallback.
After some searching, I came across a tweet by Russ Cox showing performance numbers for a shift-based DFA implementation in Go, along with a link to a gist by Per Vognsen describing the technique.
It turned out that this shift-based DFA approach was quite popular a few years ago. Several implementations appeared in different programming languages and even in some RDBMSs. However, I couldn’t find any reusable code, so I decided to implement it as a header-only C library, updating a UTF-8 validator I originally wrote in 2017.
I won’t go into the details of shift-based DFA encoding here. For a thorough explanation, I recommend Per Vognsen’s article, and I cover my own implementation in more detail here.
The main difference from a traditional DFA, such as Björn Höhrmann’s implementation, lies in how table lookups are handled. A conventional byte-class + transition-table DFA requires two lookups per byte: one to map the byte to a character class, and another to determine the next state. The shift-based DFA combines both steps into a single lookup, at the cost of a larger table. Both DFAs have a serial chain dependency where each byte’s state transition depends on the previous one, which limits instruction-level parallelism.
Two functions are provided for UTF-8 validation:
utf8_valid has no data-dependent branches. Each DFA step is a table lookup combined with a bitwise shift, with no conditional branches on the input byte value. Execution time scales with byte count, not byte values.
utf8_valid_ascii adds an ASCII fast path. There are different approaches to this: Perl, for instance, scans a leading ASCII prefix and falls back to DFA validation upon encountering a non-ASCII byte, never re-entering the fast path. I am not particularly fond of that approach, since even predominantly ASCII content today tends to contain non-ASCII bytes. Instead, I opted for a 16-byte chunk fast path that skips the DFA for chunks consisting entirely of ASCII bytes, and resumes DFA validation in a non-clean state when a non-ASCII byte is encountered.
The implementation showed promising results, but can we go faster? Yes! By splitting the UTF-8 stream in two, we can break the serial chain dependency and run two independent DFA chains in a single interleaved loop. Since the chains share no data dependencies, the CPU’s out-of-order engine can overlap their shift operations on cores with multiple shift-capable execution ports. I did try three interleaved streams but saw no improvement on the hardware available to me, the added overhead of splitting the stream further likely offsets any potential gain.
Whether utf8_valid_ascii is profitable depends heavily on the content. The following benchmark output illustrates this with ASCII-heavy content.
units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7-1.9 is common for 2-byte-heavy text, and ~2.7-3.0 for CJK-heavy text. The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the multiplier shows throughput relative to the slowest implementation.
The benchmark includes two reference implementations: hoehrmann and utf8_valid_old, the previous scalar implementation in the C library.
en.txt: 80 KB; 82K code points; 1.00 units/point
U+0000..U+007F 82K 99.9%
U+0080..U+07FF 18 0.0%
U+0800..U+FFFF 49 0.1%
hoehrmann 572 MB/s
utf8_valid_old 3677 MB/s (6.42x)
utf8_valid 6466 MB/s (11.30x)
utf8_valid_ascii 41001 MB/s (71.63x)
sv.txt: 94 KB; 93K code points; 1.04 units/point
U+0000..U+007F 90K 96.4%
U+0080..U+07FF 3K 3.5%
U+0800..U+FFFF 171 0.2%
hoehrmann 654 MB/s
utf8_valid_old 2053 MB/s (3.14x)
utf8_valid 6540 MB/s (10.00x)
utf8_valid_ascii 9200 MB/s (14.07x)
Output from multibyte-heavy content:
ja.txt: 176 KB; 65K code points; 2.79 units/point
U+0000..U+007F 7K 10.7%
U+0080..U+07FF 30 0.0%
U+0800..U+FFFF 58K 89.3%
hoehrmann 572 MB/s
utf8_valid_old 2046 MB/s (3.58x)
utf8_valid_ascii 4200 MB/s (7.34x)
utf8_valid 6492 MB/s (11.34x)
ar.txt: 25 KB; 14K code points; 1.81 units/point
U+0000..U+007F 3K 18.9%
U+0080..U+07FF 12K 81.1%
hoehrmann 572 MB/s
utf8_valid_old 984 MB/s (1.72x)
utf8_valid_ascii 4163 MB/s (7.28x)
utf8_valid 6486 MB/s (11.34x)
utf8_valid reaches approximately 0.71 cycles per byte on wide-issue cores, that’s pretty good for a C scalar implementation.
I couldn’t stop at just a validator. With spare bits left in the DFA table, I went ahead and implemented a complete UTF-8 library covering validation, decoding, navigation, and transcoding. The code is available on GitHub.
Unicode::UTF8 internally uses utf8_valid_ascii in its encode and decode routines.
Next steps: I have no further ideas for improving the scalar implementation, and I am satisfied with the performance as it stands. A SIMD implementation may be worth exploring in the future, but the more immediate goal would be to get this incorporated into Perl core.
Enjoy! PS Now I can disregard the PR and tune up the numbers in the comparison ;o)
I blog about Perl.