Faster UTF-8 Validation
A while back, I received a pull request suggesting that I update the performance comparison with Encode.pm in my module, Unicode::UTF8. When I originally wrote Unicode::UTF8, Encode.pm used its own UTF-8 validation implementation. Since then, Karl Williamson has done extensive work improving Perl, and Encode.pm now relies on those validation routines based on Björn Höhrmann’s UTF-8 DFA decoder. It’s an elegant piece of code, widely adopted across many projects.
That said, I wasn’t particularly motivated to revisit the comparison, so I decided instead to look into faster scalar C implementations. While it has been suggested that Unicode::UTF8 should gain a SIMD implementation, and that may well be worthwhile, I wanted to first explore improvements to the scalar path, which would still be required as a fallback.
After some searching, I came across a tweet by Russ Cox showing performance numbers for a shift-based DFA implementation in Go, along with a link to a gist by Per Vognsen describing the technique.
It turned out that this shift-based DFA approach was quite popular a few years ago. Several implementations appeared in different programming languages and even in some RDBMs. However, I couldn’t find any reusable code, so I decided to implement it as a header-only C library, updating a UTF-8 validator I originally wrote in 2017.
I won’t go into the details of shift-based DFA encoding here. For a thorough explanation, I recommend Per Vognsen’s article, and I cover my own implementation in more detail here.
The main difference from a traditional DFA, such as Björn Höhrmann’s implementation, lies in how table lookups are handled. A conventional byte-class + transition-table DFA requires two lookups per byte: one to map the byte to a character class, and another to determine the next state. The shift-based DFA combines both steps into a single lookup, at the cost of a larger table. Both DFAs have a serial chain dependency where each byte’s state transition depends on the previous one, which limits instruction-level parallelism.
Two functions are provided for UTF-8 validation:
utf8_valid has no data-dependent branches. Each DFA step is a table lookup combined with a bitwise shift, with no conditional branches on the input byte value. Execution time scales with byte count, not byte values.
utf8_valid_ascii adds an ASCII fast path. There are different approaches to this: Perl, for instance, scans a leading ASCII prefix and falls back to DFA validation upon encountering a non-ASCII byte, never re-entering the fast path. I am not particularly fond of that approach, since even predominantly ASCII content today tends to contain non-ASCII bytes. Instead, I opted for a 16-byte chunk fast path that skips the DFA for chunks consisting entirely of ASCII bytes, and resumes DFA validation in a non-clean state when a non-ASCII byte is encountered.
The implementation showed promising results, but can we go faster? Yes! By splitting the UTF-8 stream in two, we can break the serial chain dependency and run two independent DFA chains in a single interleaved loop. Since the chains share no data dependencies, the CPU’s out-of-order engine can overlap their shift operations on cores with multiple shift-capable execution ports. I did try three interleaved streams but saw no improvement on the hardware available to me, the added overhead of splitting the stream further likely offsets any potential gain.
Whether utf8_valid_ascii is profitable depends heavily on the content. The following benchmark output illustrates this with ASCII-heavy content.
The benchmark includes two reference implementations: hoehrmann and utf8_valid_old, the previous scalar implementation in the C library.
en.txt: 80 KB; 82K code points; 1.00 units/point
U+0000..U+007F 82K 99.9%
U+0080..U+07FF 18 0.0%
U+0800..U+FFFF 49 0.1%
hoehrmann 572 MB/s
utf8_valid_old 3677 MB/s (6.42x)
utf8_valid 6466 MB/s (11.30x)
utf8_valid_ascii 41001 MB/s (71.63x)
sv.txt: 94 KB; 93K code points; 1.04 units/point
U+0000..U+007F 90K 96.4%
U+0080..U+07FF 3K 3.5%
U+0800..U+FFFF 171 0.2%
hoehrmann 572 MB/s
utf8_valid_old 1536 MB/s (2.68x)
utf8_valid 6600 MB/s (11.53x)
utf8_valid_ascii 8581 MB/s (15.00x)
units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7-1.9 is common for 2-byte-heavy text, and ~2.7-3.0 for CJK-heavy text. The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the multiplier shows throughput relative to the slowest implementation.
Output from multibyte-heavy content:
ja.txt: 176 KB; 65K code points; 2.79 units/point
U+0000..U+007F 7K 10.7%
U+0080..U+07FF 30 0.0%
U+0800..U+FFFF 58K 89.3%
hoehrmann 572 MB/s
utf8_valid_old 2046 MB/s (3.58x)
utf8_valid_ascii 4200 MB/s (7.34x)
utf8_valid 6492 MB/s (11.34x)
ar.txt: 25 KB; 14K code points; 1.81 units/point
U+0000..U+007F 3K 18.9%
U+0080..U+07FF 12K 81.1%
hoehrmann 572 MB/s
utf8_valid_old 984 MB/s (1.72x)
utf8_valid_ascii 4163 MB/s (7.28x)
utf8_valid 6486 MB/s (11.34x)
utf8_valid reaches approximately 0.71 cycles per byte on wide-issue cores, that’s pretty good for a C scalar implementation. See the Performance section in the GitHub repository for a more in-depth analysis and comparison across different architectures and compiler optimizations.
I couldn’t stop at just a validator. With spare bits left in the DFA table, I went ahead and implemented a complete UTF-8 library covering validation, decoding, navigation, and transcoding. The code is available on GitHub.
Unicode::UTF8 internally uses utf8_valid_ascii in its encode and decode routines.
Next steps: I have no further ideas for how to improve the scalar DFA implementation, and I am satisfied with the performance. A SIMD implementation may be worth exploring in the future, but the more immediate goal would be to get this incorporated into Perl core.
Enjoy! PS Now I can disregard the PR and tune up the numbers in the comparison ;o)
Calculating U.S Federal holidays with Time::Moment
Time::Moment comes with Time::Moment::Adjusters which currently only provides…
Simple and efficient formatting of relative date/time using Time::Moment
#!/usr/bin/perl
use warnings;
use Carp qw[];
use Time::Moment 0.19 qw[];
sub YEAR () { 365.2425 }
sub MONTH () { YEAR / 12 }
sub DAY () { 1 }
sub HOUR () { DAY / 24 }
sub MINUTE () { HOUR / 60 }
sub SECOND () { MINUTE / 60 }
sub ago {
@_ == 1 or Carp::croak(q/Usage: ago(moment)/);
my ($moment) = @_;
my $now = Time::Moment->now;
($now->compare($moment) >= 0)
or Carp::croak(q/Given moment is in the future/);
my $d = $now->mjd - $moment->mjd;
if ($d 0.75 * DAY) {
if ($d…
Let's talk about Time::Moment and round-trip of strings
In my previous blog post I wrote a lot more about Time::Moment, than appeared in the post (could have been my mistake due to a preview and error and a incomplete copy and a paste, but still very inconvenient). So I have decided to break down my original blogpost in several blogposts.
Time::Moment implements a subset of ISO 8601 (known as 4.3 Date and time of day, 4.3.2 Complete representatio…
Search this blog
Recent entries
- Introducing Time::Str
- Faster UTF-8 Validation
- Calculating U.S Federal holidays with Time::Moment
- Simple and efficient formatting of relative date/time using Time::Moment
- Let's talk about Time::Moment and round-trip of strings
- A bit more about Time::Moment
- Time::Moment vs DateTime
- What if sv_utf8_upgrade() used heuristic encoding?
- Coping with double encoded UTF-8
I blog about Perl.