There is a Perl module (by coincidence, also Portuguese authored) named Text::ExtractWords that performs more or less the same as the unix command toilet^H^H^H^H^H^Hwc. It returns an hash of words mapping them to their occurrence count.
The module is not bad. It is written in C making it quite fast when compared with Perl code on big strings. Unfortunately, it has a main limitation: unicode. Although it supports a 'locale' configuration parameter, it seems not to affect its behavior regarding unicode characters, that is, looking to them as single ASCII characters.
I do not have any experience on dealing with unicode from C. I remember looking to some 'w' functions (wchar.h) but not getting real good results. Probably when I have more time I will look into it.
But for now, I need a way to compute a word histogram from a unicode Perl variable. I am doing it by splicing the string with white spaces, and for each element, adding it to an hash.
It works. But it is slow.
This raises two different questions:
- is there any faster way to do this from Perl?
- is there any other module that I can use to perform this task?