Ocaml, Unicode, and Hashtables

So, Ocaml does not support Unicode out of the box. The “string” type is an 8 bit byte and that’s that. Some find this to be a major failing of the language in general and it is a pain in the ass. However, there is a unicode library for ocaml Camomile which fills the gap.

In the project that I have been working on, I had to read in a Unicode file into Ocaml and create a “seen” hash. Just as you would do in perl normally. However, because Ocaml doesn’t support Unicode natively, you cannot use the generic Hashtable type “(‘a, ‘b) t”, which stands for an arbitrary type for a key (the first ‘a) and an arbitrary type for the value (the second ‘b). The key value types will be filled in by type inference as you use the Hashtable based on what you do with it. This won’t work because the generic Hashtable depends on an equal function that will not conform to the Unicode standard compare.

All is not lost, however! This is where one of the most powerful features of Ocaml comes into its own: the functor. Modules in Ocaml can be parametrized in such a way as the user can redefine a module to meet his/her needs. For Hashtables one can parametrize the Hash on the key value. For instance:

module UTF8Hash = Hashtbl.Make(
  struct
    type t = Camomile.UTF8.t
    let equal a b =
      if (Camomile.UTF8.compare a b) = 0 then
        true
      else
        false
     let hash = Hashtbl.hash
 end
)

Using the Hashtable Make functor, I set the key type to UTF8 then set the equal function to something that makes sense for UTF8 strings. In the end, I left the hash function itself alone as I thought it would probably do the right thing and I didn’t want to get into it so I just defined it as itself.

Doing this creates a Hashtable type of ‘a UTF8Hash.t. The ‘a is now the type of the value as you already know the type of the key for the hash. In addition, doing this, you can create a hashtable that has an arbitrarily complex key type. As long as you can write an equals function, you should be fine.

2 Comments

Since Camomile.UTF8.compare is implemented as Pervasive.compare, there is no difference between UTF8Hash and polymorphic Hashtbl.

So, what you intended here? If you want to do the code point comparison, Pervasive.compare is ok since binary comparison of UTF8 strings gives you to the code point comparison. (That is an advantage of using UTF8).

If you want to do the canonical comparison, that is, a comparison which is aware of the presence of combined characters, you must use UNF.canon_compare.

Leave a comment

About cyocum

user-pic Celticist, Computer Scientist, Nerd, sometimes a poet…