Naive Search with JavaScript

February 2, 2012

On a recent project I had the need to implement a basic search functionality into a web interface. This isn’t unusual, but what was different is that the data in this project was fairly static, and was kept in JSON on disk (and cached into memory). Additionally, there would be a point where it was moved off of our servers, onto an entirely different stack (hence the JSON).

There was minimal server side processing so far, and I did not want to add more overhead that would need to be ported. So, I decided to implement my search in JavaScript (with some help from Python). My idea was to do very basic string matching on a pre-built index. In this article I am going to lay out my implementation, but on a dummy data set.

The Process

This search breaks down into three steps:

  1. Build Search Index
  2. Perform Search
  3. Connect To Interface

Before we start writing that, let’s cover the data.

The Data

Here is the data I will be sifting through. To keep it simple, I’m just searching through some sentences with associated ID’s. It’s short and simple, but with some tweaks you can apply this to bigger data sets, as I did.

My data set will be represented in JSON for portability and easy interpretation from JavaScript.

data.json (92278f)

The Index

To facilitate easy matching, I’m going to build a search index that will be a dictionary, with individual words (tokens) as the keys, and arrays of ID’s as the values.

I’ve decided on Python for my index builder. Handling JSON from Python is easy, though if you have a version older than 2.6 you will need to use a different import, such as simplejson.

The core of this functionality is the tokenizer. To build this, we need to determine our rules. Since this is a simple search, I’m going to tokenize on word boundaries, it will be case insensitive, and I will only accept the characters A-Z, single quote and dash inside of a word.

Here is my implementation of the tokenizer, that follows these rules:

build_index.py (92278f)

The remainder of the Python is simply opening and parsing the JSON.

build_index.py (92278f)

Run that, and we get the dictionary we wanted:

output.txt (92278f)

JavaScript

Now that we have an index to work with, let’s convert it to JSON. This just takes a little tweak to the Python.

build_index.py (89ba77 )

That gives us a JSON file to work with.

index.json (89ba77)

Now we’re going to load that JSON with an AJAX call, so we have the index to work with.

search.js (89ba77)

Once we have that data, finding matches is just a matter of looking them up in Search.index. Below I’ve used a loop that will search through an array of terms, and then accumulate the matches.

search.js (89ba77)

It works!
It works!

Hook It Up

Just a little bit remains to connect this thing. First we need to port over the tokenizer from Python to get a consistent result. Pretty easy port.

search.js (216a15)

Tokenize!
JavaScript tokenize in action.

Last we just need to apply our functions to some inputs, which we hook up with some jQuery.

index.html (216a15)

Hooked Up

Make It Better

This is not, by any means, a complete system. You would want to grab the results and match up the ID’s to a more useful output. Additionally, you could drop the AJAX call and use a script tag to bring in the index.

There are plenty of improvements you can make, and I’d love to know if you make them!

For reference, here is the complete source: https://gist.github.com/1673557.

Categories: Geek
Tags: , , ,

Leave A Comment

Your email will not be published.