Reading UTF-8 at GB/s

A couple of months ago I wrote about a UTF-8 library I implemented in C, which Unicode::UTF8 uses under the hood: Faster UTF-8 validation.

I then updated PerlIO::utf8_strict to use the same library. PerlIO::utf8_strict is a joint project with Leon Timmermans, who is the PerlIO wizard, while I know a bit about UTF-8.

Unfortunately, we didn’t see the throughput we expected. The bottleneck turned out to be in the read operator itself, specifically in how it counts UTF-8 sequences. I filed perl #24511 describing the limitation. The issue has since been identified, and Karl Williamson already has a work-in-progress PR that addresses it: perl #24521.

That fix won’t help existing Perls, though. So I decided to implement a drop-in replacement in Unicode::UTF8: read_utf8. It reads and validates UTF-8 in one pass and is very fast, 7-16x faster than read(). The only caveat is that it does not support tied filehandles.

Benchmarks

perl:          5.042001 (linux 6.17.0-14-generic)
Encode:        3.21
Unicode::UTF8: 0.71

ar.txt: 25 KiB; 14K code points; 1.81 units/point
  U+0000..U+007F                3K  18.9%
  U+0080..U+07FF               12K  81.1%
  scalar:encoding(UTF-8)         308 MB/s
  scalar:utf8                    686 MB/s  (2.23x)
  scalar + read_utf8            3538 MB/s  (11.48x)

el.txt: 102 KiB; 59K code points; 1.77 units/point
  U+0000..U+007F               14K  23.1%
  U+0080..U+07FF               45K  76.9%
  U+0800..U+FFFF                38   0.1%
  scalar:encoding(UTF-8)         298 MB/s
  scalar:utf8                    679 MB/s  (2.28x)
  scalar + read_utf8            3705 MB/s  (12.43x)

en.txt: 80 KiB; 82K code points; 1.00 units/point
  U+0000..U+007F               82K  99.9%
  U+0080..U+07FF                18   0.0%
  U+0800..U+FFFF                49   0.1%
  scalar:encoding(UTF-8)         326 MB/s
  scalar:utf8                    395 MB/s  (1.21x)
  scalar + read_utf8            3844 MB/s  (11.79x)

ja.txt: 176 KiB; 65K code points; 2.79 units/point
  U+0000..U+007F                7K  10.7%
  U+0080..U+07FF                30   0.0%
  U+0800..U+FFFF               58K  89.3%
  scalar:encoding(UTF-8)         511 MB/s
  scalar:utf8                   1031 MB/s  (2.02x)
  scalar + read_utf8            3732 MB/s  (7.30x)

lv.txt: 135 KiB; 127K code points; 1.09 units/point
  U+0000..U+007F              117K  92.0%
  U+0080..U+07FF                9K   7.1%
  U+0800..U+FFFF                1K   0.9%
  scalar:encoding(UTF-8)         227 MB/s
  scalar:utf8                    429 MB/s  (1.89x)
  scalar + read_utf8            3780 MB/s  (16.66x)

ru.txt: 148 KiB; 85K code points; 1.78 units/point
  U+0000..U+007F               19K  22.6%
  U+0080..U+07FF               66K  77.0%
  U+0800..U+FFFF               364   0.4%
  scalar:encoding(UTF-8)         301 MB/s
  scalar:utf8                    683 MB/s  (2.27x)
  scalar + read_utf8            3682 MB/s  (12.22x)

sv.txt: 94 KiB; 93K code points; 1.04 units/point
  U+0000..U+007F               90K  96.4%
  U+0080..U+07FF                3K   3.5%
  U+0800..U+FFFF               171   0.2%
  scalar:encoding(UTF-8)         253 MB/s
  scalar:utf8                    410 MB/s  (1.62x)
  scalar + read_utf8            3667 MB/s  (14.50x)

The benchmark is available in the Unicode::UTF8 repository.

Enjoy!

Leave a comment

About Christian Hansen

user-pic I blog about Perl.