Full-text search in Air Mozilla with PostgreSQL

Peter Bengtsson

2

In a previous post I explained why and how we migrated Air Mozilla to use PostgreSQL as the default database. We did this so we can leverage PostgreSQL’s powerful full-text search feature.

First, off a tangent we go… Why not use the popular and also powerful full-text master ElasticSearch? Surely, since it’s built on top of Apache Lucene it’s bound to have some amazing full-text search and indexing features. I’m sure it does — but we don’t need them.

All we want to do is find records whose title, description or short_description contain certain words spelled in the same stem. We also want highlighting so we can display a neat search results page with the matches emphasized (something that isn’t easy to do with regular expressions in Python when the results come back).

PostgreSQL can do all of that and it’s fast. Very fast! By far, the biggest win of using the same database we already connect the Django ORM to is that we simply don’t have to worry about indexing. Like, at all. All you do is set this up as a migration:

At the moment Air Mozilla only has English content, but some day there might be more languages. How to add indexes for different languages is pretty clear; you run the same migration as above with different languages named.

That means that any inserts, updates or deletions automatically updates the full-text index for these columns in the database. We don’t have to worry about this at all, at any point in the ORM code. It just works!

Now, let’s explain how the search works. A user types in a search query. E.g. “community”.

What we want to do is to return an ORM QuerySet that:

  • contains all events that the user is allowed to see depending privacy or publishing workflow criteria and
  • whose title or short_description or description contains the search term.

And, we want it to be ranked based on matches in the title “higher” compared to matches in the short_description or description. So let’s add that to the filtering:

Now, that satisfies the “where part”. Next, we need to do something about the ranking, so we extend the code with this:

Last but not least, we want to let PostgreSQL work out the highlighting of matches so you can show extracts on the search result page with the matched words emphasized. So you extend select with some more code to look like this:

And there you have it. Note, that PostgreSQL inserts HTML markup into these title_highlit and desc_highlit extra annotations and it also escapes away any previous HTML so they’re safe to display in raw form in the Django template code. So it can look like this in the search results template:

In plain PostgreSQL SQL there are actually ways to “combine” the rank calculation with the “where criteria” so that you don’t have to do both the rank calculation and the where operation separately. However that’s way out of scope for the Django ORM API and even though it’s possible to achieve, the code will quickly get messy.

So, how long does it take to do this query? On my laptop, with a snapshot of the production database containing over 600 events, that big query takes 30-35 milliseconds. That’s fast enough.

2 responses

  1. Michael Greenwood wrote on :

    I love Air Mozilla!

  2. Ishwerdas wrote on ::

    I am in love with this blog, that’s what I needed.