Word counting

There is a Perl module (by coincidence, also Portuguese authored) named Text::ExtractWords that performs more or less the same as the unix command toilet^H^H^H^H^H^Hwc. It returns an hash of words mapping them to their occurrence count.

The module is not bad. It is written in C making it quite fast when compared with Perl code on big strings. Unfortunately, it has a main limitation: unicode. Although it supports a 'locale' configuration parameter, it seems not to affect its behavior regarding unicode characters, that is, looking to them as single ASCII characters.

I do not have any experience on dealing with unicode from C. I remember looking to some 'w' functions (wchar.h) but not getting real good results. Probably when I have more time I will look into it.

But for now, I need a way to compute a word histogram from a unicode Perl variable. I am doing it by splicing the string with white spaces, and for each element, adding it to an hash.

It works. But it is slow.

This raises two different questions:

  • is there any faster way to do this from Perl?
  • is there any other module that I can use to perform this task?


First thought: chop your data set into chunks and process in parallel if you have a multi-processor system.

Second thought: profile it and see where the bottleneck is. If your data is huge, I wonder if splitting is causing Perl to spend a lot of time doing memory allocation.

I just did a test on a naive counting routine like you describe and for a large dataset (>10 MB), reading line-by-line and then splitting was faster than slurping the file and splitting. YMMV.

A "while (m//g) { ... }" loop will probably be quite a bit faster than "split." Also, if your data is coming from a file, memory mapping the file (like with Tim Bray's Wide Finder) will be fast.

UTF-8 ought not break that. 0x20 will always be space in UTF-8, and never part of any other character. So if it splits on 0x20 (or any other ASCII character), it shouldn't wind up splitting Unicode characters.

As long as it passed unchanged octets >= 0x80, it shouldn't break UTF-8, even if it tries to fold uppercase/lowercase together.

In order to get good results, you'd have to first normalize the Unicode to one of the standard normalization forms (otherwise, things like combining characters will give duplicate words). And you may have to fiddle around with results a little to tell Perl they're UTF-8.

Alternatively, of course, you could just call the Unix command.

Leave a comment

About Alberto Simões

user-pic I blog about Perl. D'uh!