Writing a simple search engine

I'm about to start building a small search engine that will be used to search through a collection of 366 books, I should be able to use controls as exclude words, exact matches and a thesaurus search.

Indexing

First step would be to index the contents, I reckon that in here the bulk of the work should be done here to speed things up in the actual search.

Step 1: Merge the contents of all the books which are separated into individual text files into a single flat text file, which would look something like this:

1 1 Merchant of Syracuse, plead no more;
1 2 I am not partial to infringe our laws:
...

Each line will have three columns, the first column it's the book id, the second column the line id and the last column the actual content of the line. This way I have quick access to all contents of the books.

Step 2: Next step will be to make a copy of this file but with every formatting stripped out and every letter converted to lower case:

1 1 merchant of syracuse plead no more
1 2 i am not partial to infringe our laws
...

The reason I want to do this, is that I want my search to NOT be case-sensitive, so normalizing everything to lower case makes sense, again, I want to pre-process as much as I can on indexing.

Step 3: Now that I have my normalized text, I want to index every single word found on the text, i'll use the normalized version to do this, to get a result like so:

merchant 1 65 384 28364
laws 34 56 736 2534
...

For each word I'll get a list of every single line id on which I can find that word. This way when I search for a word all I have to do is find the line ids for the specific word and in no time I can have the exact locations.


Searching

We first parse the query string and get the actual words we are looking for, at this step it might be necessary to include words from our thesaurus.

Once we have our word list to be found then we move on to actually look into our word index for all those lines where we can find the words.

The idea is to end with something like this:

{
  'hello' => [13, 234, 435],
  'world' => [56,836,39]
  ...
}

Next step would be to process the stop words, for this all we have to do is remove each line from the list whose lines also include the stop words.

For example, if a stop word is included on line 13, then from the list above on 'hello' we would have to remove line 13 since it's not longer a candidate.

I want to rate the results so I can show first the results that more closely match the results, so I'm thinking on a rate system like this:

+5 for every different word in the query found
+1 for every repeated ocurrance of the word in the line

We'll process all lines and rate them.


Exact Search

We haven't used any regular expressions for the searches, but we will have to use them for the exact search. In case we find inside the query something inside quotes, then we'll have to perform an exact search, that means that we'll have to do a regexp search directly on the index file.

/\b(EXACT MATCH)\b/

I haven't found another possible solution for this.


Displaying the results

The display is really easy, all we need to do is find from our unformated index file (the first one we created with all the contents) the lines that we need, and display them in the rate order.


Things to consider

  1. I need to look into CPAN and see if there are any modules that can help me right now or give me more ideas, mainly the ones that deal with language.
  2. I'm still not sure on how to store the index files. I'm leaning towards keeping everything in plain text files, but a database might make more sense.
  3. I'm not sure also about how to handle word variations, for example nouns in plural or verb tense. I'm thinking on searching for all word variations on the actual search, for example you search for "run", then it might actually search for: run, ran, runs, etc.

I appreciate any comments or suggestions on how I might actually make the process better.

3 Comments

I think you might want to look into ElasticSearch. :)

I've just been working my way through the tutorial docs for the recently released Lucy search tool - very easy to use.

Leave a comment

About Uriel Lizama

user-pic I love Perl and I'm fortunate enough to make a living coding in it.