Reading UTF-8 at GB/s
A couple of months ago I wrote about a UTF-8 library I implemented in C, which Unicode::UTF8 uses under the hood: Faster UTF-8 validation.
I then updated PerlIO::utf8_strict
to use the same library. PerlIO::utf8_strict is a joint project with Leon
Timmermans, who is the PerlIO wizard, while I know a bit about UTF-8.
Unfortunately, we didn’t see the throughput we expected. The bottleneck turned
out to be in the read operator itself, specifically in how it counts UTF-8
sequences. I filed perl #24511
describing the limitation. The issue has since been identified, and Karl
Williamson already has a work-in-progress PR that addresses it:
perl #24521.
That fix won’t help existing Perls, though. So I decided to implement a drop-in
replacement in Unicode::UTF8: read_utf8. It reads and validates UTF-8 in one
pass and is very fast, 7-16x faster than read(). The only caveat is that it
does not support tied filehandles.
Benchmarks
perl: 5.042001 (linux 6.17.0-14-generic) Encode: 3.21 Unicode::UTF8: 0.71 ar.txt: 25 KiB; 14K code points; 1.81 units/point U+0000..U+007F 3K 18.9% U+0080..U+07FF 12K 81.1% scalar:encoding(UTF-8) 308 MB/s scalar:utf8 686 MB/s (2.23x) scalar + read_utf8 3538 MB/s (11.48x) el.txt: 102 KiB; 59K code points; 1.77 units/point U+0000..U+007F 14K 23.1% U+0080..U+07FF 45K 76.9% U+0800..U+FFFF 38 0.1% scalar:encoding(UTF-8) 298 MB/s scalar:utf8 679 MB/s (2.28x) scalar + read_utf8 3705 MB/s (12.43x) en.txt: 80 KiB; 82K code points; 1.00 units/point U+0000..U+007F 82K 99.9% U+0080..U+07FF 18 0.0% U+0800..U+FFFF 49 0.1% scalar:encoding(UTF-8) 326 MB/s scalar:utf8 395 MB/s (1.21x) scalar + read_utf8 3844 MB/s (11.79x) ja.txt: 176 KiB; 65K code points; 2.79 units/point U+0000..U+007F 7K 10.7% U+0080..U+07FF 30 0.0% U+0800..U+FFFF 58K 89.3% scalar:encoding(UTF-8) 511 MB/s scalar:utf8 1031 MB/s (2.02x) scalar + read_utf8 3732 MB/s (7.30x) lv.txt: 135 KiB; 127K code points; 1.09 units/point U+0000..U+007F 117K 92.0% U+0080..U+07FF 9K 7.1% U+0800..U+FFFF 1K 0.9% scalar:encoding(UTF-8) 227 MB/s scalar:utf8 429 MB/s (1.89x) scalar + read_utf8 3780 MB/s (16.66x) ru.txt: 148 KiB; 85K code points; 1.78 units/point U+0000..U+007F 19K 22.6% U+0080..U+07FF 66K 77.0% U+0800..U+FFFF 364 0.4% scalar:encoding(UTF-8) 301 MB/s scalar:utf8 683 MB/s (2.27x) scalar + read_utf8 3682 MB/s (12.22x) sv.txt: 94 KiB; 93K code points; 1.04 units/point U+0000..U+007F 90K 96.4% U+0080..U+07FF 3K 3.5% U+0800..U+FFFF 171 0.2% scalar:encoding(UTF-8) 253 MB/s scalar:utf8 410 MB/s (1.62x) scalar + read_utf8 3667 MB/s (14.50x)
The benchmark is available in the Unicode::UTF8 repository.
Enjoy!
I blog about Perl.
Leave a comment