Hailo: A Perl rewrite of MegaHAL
MegaHAL has numerous problems that we sought to solve:
- It keeps all of its brain in memory which for our use case of a chatbot on a small IRC channel whose logs had reached around 200,000 lines meant that it was starting to take up around 600 MB of resident memory.
- Its tokenizer is implemented with C's
ctype.hfunctions which read things on a byte-by-byte basis. It handles non-ASCII input really badly, especially since it internally normalizes tokens by capitalizing them with
toupper()before storing them.
- It has a limit of 2^16 word brains, or whatever
shorthappens to be defined as on your system.
- It would regularly corrupt its entire brain, especially as it got larger, necessitating a complete reload. I never found out why this was but I think it has something to do with it rarely checking the return values of functions like
I managed to hack it so that #2 and #3 weren't an issue. But that left the major issue of its memory use & stability unresolved. Hinrik and I started writing a replacement now called Hailo (HAL + failo, see this) which:
- Is a pluggable Moose-based Markov engine in Pure-Perl.
- Has pluggable tokenizer, engine and storage backends. The default is to split the input up by words and storing it in SQLite but it also has an in-memory engine and an alternative tokenizer which makes it easy to do things like generate Web 2.0 company names (I've already done so).
- Hovers at around 45MB resident memory usage where MegaHAL would use around 600MB. Almost all of that memory is being used by Moose and other dependencies which we liberally used.
- Is much faster than MegaHAL was, we're able to generate around 200 replies per second on a database made up of around 200,000 IRC lines
If you're interested then you can:
- Follow the best of failo on Twitter or Identi.ca. Any old quotes are from MegaHAL, new ones from Hailo.
- Chat with it on #failo on Freenode.
- Run your own Hailo-based IRC bot, it's easy with POE::Component::IRC::Plugin::Hailo.
Lastly I'd like to highly recommend Moose. This is the first significant thing we've written in Moose and it made the whole progress at least 5x easier than it otherwise would have been. It's really nice when your program has a command-line interface automatically generated from your class definition and you avoid the tedium of manual OO-management.
The downside is that most of the 50MB memory usage can be attributed to Moose & related modules and a cold start of the module can take up to 1 second, but the ease of maintenance is well worth it for this sort of program which is mean to be long running.