Stupid Lucene Tricks: Exact Match, Starts With, Ends With

Out of the box, Lucene does not provide exact field matches, like matching "Acer Negundo Ab" and only "Acer Negundo Ab" (not also "Acer Negundo Ab IgG" ). Neither does Lucene provide "Starts With" or "Ends With" functionality. Fortunately, there are workarounds.

The trick is in the indexing. When indexing a field in Lucene, you can enclose a field's value with known delimiters, for example "lucenematch Acer Negundo Ab lucenematch" (where "lucenematch" is the delimiter string). As long as the delimiter appears only as a delimiter, you can safely assume that a search using "lucenematch Acer Negundo Ab lucenematch" on a field will return only documents whose default Lucene field exactly matches "lucenematch Acer Negundo Ab lucenematch" and not those matching "lucenematch Acer Negundo Ab IgG lucenematch".

"Starts With" and "Ends With" searches are an extension of this technique, as a "Starts With" search just requires adding the delimiter in front of your search term (like using "lucenematch Acer" when searching for terms starting with "Acer"). Similarly, "Negundo Ab lucenematch" will search for terms ending in "Negundo Ab".

.To make it easier for your users, you could:

  1. Define '^' as the first character of a search phrase meaning "Starts With".
  2. Define '$' as the last character of a search phrase meaning "Ends With".
  3. Pre-parse your search phrase, substituting '^' and '$' appropriately.

For example, with this workaround in place your users could search on "^prostate specific" to find all terms starting with "prostate specific" (using "lucenematch prostate specific" underneath). Similarly, "Negundo Ab$" will search for terms ending in "Negundo Ab" (using "Negundo Ab lucenematch" underneath).

(The Right (and hard) Way to do this would involve extending the internals of Lucene, using the term word positions to get to the start and end of a term.)

2 Comments

Hi Mark

Thanx for the tips.

And I've just put photos of my garden on-line. E.g.: Acer negundo. Nice coincidence.

Cheers
Ron

Did you have to change your SOLR tokenizr for this to work? It looks like the StandardTokenizer is eating my appended '|' character, so SOLR search requests seem to be oblivious to its presence. (I have confirmed that my appended | went into the SOLR data index)

Leave a comment

About Mark Leighton Fisher

user-pic Perl/CPAN user since 1992.