Stupid Lucene Tricks: Storing Non-Documents

Lucene's search capabilities are so powerful that it is tempting to store more than documents -- and that is OK. Here are some hints to make storing non-documents easier:

  • Do you want to allow phrase searches on your fields? A drawback of allowing phrase searches occurs when you keep the synonyms for a field value in that same field for ease of searching (which may well be the right strategy for the Lucene default field). For example, if you are indexing information about sugar beets, you could end with many synonyms about the "sugariness" of the beets when you care mostly about the "beetiness" of the beets. An index that contains only unique words would reduce the number of extraneous words about sugar, but at the cost of disallowing phrase searches. What kind of tokenizing do you want to use? Especially for English, there are a number of different tokenizers to use. You can also pre-tokenize your inputs, letting Lucene's tokenizer handle the rest of the work. Remember to use the same tokenizing scheme when you search, or search results may not be what you want.
  • Synominization is more important with non-documents than with documents, because the natural redundancy of language leads to a good start at synonymization without any extra effort. A non- document may have little to no synonimization, or it may have a lot depending on the subject field. In either case, your indexing code will have to handle the synonymization itself.
  • If your field is just a fixed string, a fixed number, etc. (I.e. not a sentence, paragraph, page etc). -- that's OK. You just won't get the particular advantages of full-text indexing. You can still search on that field just fine with Lucene.
  • Whether to make make a field retrievable depends on how resource-constrained is your environment. It is pretty fast to retrieve the values from a Lucene search hit, but in many applications that means you have 2 copies of each set of values. (In a pre-built app, like an embedded app or a smartphone app, it is fairly easy to keep 1 of each set of values while a dynamically-updated program may take more planning especially if the natural backing store is a relational database.)

I feel like I should have more tricks here, but I don't. Please add yours in the comments -- if I get enough, I can write another, more comprehensive list of Lucene tricks when indexing non-documents

Leave a comment

About Mark Leighton Fisher

user-pic Perl/CPAN user since 1992.