Writing a custom query parser for a search engine.
Why we wrote our own custom query parser
By Ganesh Swami, April 22, 2014.

One of our most frequently asked for and used feature is a custom query parser. Beyond simple prefix and term matching, we can also parse custom queries that lets you add filters and restrict the search to specific fields. Let’s look into why we built this and how it works…

The Challenge

First, we’ll talk about how a search engine is different from a relational data store like MySQL or PostgreSQL. In a relational database, you have a schema that defines columns with specific types. Types include varchar, char, integer, etc.

In a search engine, beyond basic types, you can have something like the n-gram type. An n-gram data structure divides a word like “introduction” into a set of uni-grams, bi-grams, tri-grams, etc. This is very useful for matching autocomplete terms.

n-gram data structure

To increase the range of matches, it’s easier to preprocess the input being indexed to account for word stems, extended latin characters (like ñ), spelling corrections and synonyms. We index the possible variations of the input text into a multi-field and during query time, we include all of these fields in the search automatically. All of this is transparent to the user and is done out of the box.

The challenge is to devise a query language as close to natural language as possible and rewrite it to an abstract syntax tree that can then be transformed to perform arbitrarily complex queries.

The solution

While designing the parser, we strove to be as close to the Google query syntax as possible. Google’s syntax has a few rules:

  • the query is parsed into terms and operators
  • terms can be exact or phrases
  • exact terms are within double quotes, eg: “john smith”
  • phrases match one or more words in any order
  • boolean operators like and, or and not can be used with parenthesis to group queries
  • you can use + and - to mark specific terms as present or absent
  • you can restrict the search to specific fields with a prefix

A product search example

product database

Say you have a product search engine with the fields: SKU, name, description, price, category and availability. Here are some queries you can use:

Match any

gold ring

Match exact

"gold ring"

Restrict to a single category

"gold ring" category:ring

Restrict to one or more categories

"gold ring" category:ring category:necklace
"gold ring" (category:ring or category:necklace)

Terms present

"gold ring" +engraved color:red

What’s next

The query parser is fully documented. Read about it on the API reference page.

Photo Credit: mendhak.