June 2019 Archives

Searching Perldocs

Search is a hard problem. It is the task of getting users to what they want to find, even if they don't know exactly what that is. Its requirements vary widely based on the kinds of things people will want to find and the kinds of people that want to find them. It's also an expected feature of almost anywhere on the web that is more complex than a single page. So shortly after putting together a demo for Perldoc Browser which would become the backend for perldoc.pl, I needed to make it searchable.

Since most people searching perldocs on the web would be familiar with the old perldoc.perl.org search interface, I started with meeting those expectations. First and foremost, typing any name of a documentation page will bring you to that page. This is more navigation than search, and quite straightforward. The database simply keeps a list of all such pages, and checks for name matches before doing any further searching. Like the -f option to the perldoc command line utility, it would also be able to bring you to a function's documentation extracted from perlfunc, and later a variable's documentation from perlvar like the -v option.

Then came the hard part. What if a user wanted to find something that didn't have its own documentation page, and wasn't a function or variable? Or they didn't know the name of it exactly? This is where full-text search becomes necessary. Full-text search is the practice of matching search terms against the full text of many documents rather than just the names, to find the most relevant results. Google is a great full text search engine, but it serves the entire Internet, and as a result is not so optimized for what perldocs may contain or what Perl users may look for.

I first considered the SQLite and PostgreSQL full text search capabilities. These are each great feats of engineering, but they are each designed to search in a particular way which turned out to be too restrictive for perldocs. The SQLite FTS5 extension provides a full-text search of documents by creating virtual tables indexing documents from other tables, but it is fairly limited in configurability and in particular does not support skipping stop-words. Stop-words are an important feature for searching documents with prose, because it allows ignoring the words that appear commonly but add no value to the result, such as "the" and "and". Without this feature, searching "foo and bar" will probably just find the page with the most "and"s.

PostgreSQL's full text search is more fully featured and supports stop-words, but unfortunately had a few critical issues that caused me to look elsewhere. As I had previously encountered when using it, it tokenizes both documents and search terms with a few odd rules - it will turn any "dot.separated.word" into a "hostname" term instead of into individually searchable terms, and similarly "slash/separated/words" will become a path or URL term. For perldocs, this created many cases of unsearchable terms or a confusing lack of results. It is easy to workaround though a bit ugly, and I could not find a simple way to disable the functionality. A more problematic issue for perldocs is that it also tokenizes things that look like HTML <tags> separately, and the ts_headline function which is used to display a snippet of where the search terms map would omit them entirely. This creates a lot of confusion in particular with the Perl readline operator (<>) since in many cases it looks like an HTML tag. Finding no simple way to configure or disable these tokenization options, it was time to try a much more complex solution.

I had used Elasticsearch before, and was already familiar with its myriad of customizability to every aspect of the tokenizing and searching process, though for very different requirements. Setting up an Elasticsearch node or cluster is no simple task, but once again it paid off by doing what is needed and more. In addition to the standard English stop-words filter, I was able to apply a stemming filter which allows "variable" to match "variables", and an ASCII-folding normalizer which allows Unicode words to be searchable via the most similar ASCII representation. My favorite discovery was the word delimiter filter, which is what allows "heredoc" to match "here-doc" as well as "perl" to match "perl5". Additionally, by using the simple whitespace tokenizer, Perl's myriad of symbols are included in the tokenized text and search terms rather than ignored as they are in the other full-text search implementations I tried. This means "$foo" will find uses of that variable specifically in the documentation; it will also find the word "foo" alone or in "@foo" at a lower relevance, thanks to the word delimiter filter. Most importantly, operators like <<>> with no word characters at all are searchable!

Satisfied with the capabilities and customizability of the search backend, there were a few more considerations. The perldoc utility has a great -q option, which searches the perlfaq documents for specific questions and answers. Searching the perlfaq documents as a whole is not sufficient; the user needs to be directed to the question and answer relevant to their query. So the FAQ questions are indexed individually, and these results are included as their own results section.

The last feature I added was not a feature of perldoc, but for a problem myself and others in the Perl community encounter often: to discover when a particular change was made in Perl. Whether to find out when a feature was added, or the behavior of an operator changed, the perldelta documents generally have this information, but like the FAQs it can be hard to find what you need among the whole set of documents. So I created another perldelta index and result set, which brings you to the perldelta sections most relevant to the search query.

Perldoc Browser still supports all three full-text search backends through simple plugins, so it's easy to run your own instance with serverless SQLite searching. Making Perldocs simple to search is anything but simple, but thanks to the flexibility of Elasticsearch I am able to almost always find what I need on perldoc.pl, and I hope you will too.

Reddit comments

About Grinnz

user-pic I blog about Perl.