Three ways to index data into Elasticsearch from your database
3 ways to Index data into Elasticsearch without melting your Database
By Ganesh Swami, September 28, 2014.

So, your boss just walked up to you and asked you to use Elasticsearch for full-text search. How would you go about introducing Elasticsearch in your stack? Some ideas…

Why you need to think about this

There are two specific problems you are trying to solve:

  1. An initial bulk load of your database so that you can search it.
  2. Keep abreast of changes in your database, so that you are not searching stale content.

In the simplest case, you can do the equivalent of SELECT ALL to pull data from your database and push it to Elasticsearch. While this works for small data sets, anything more than a few million rows will bring your database to its knees.

1. Database Triggers

A database trigger can be setup to fire off an index operation everytime a row changes. Though this seems like a valid option, some teams steer clear of triggers because:

  • triggers run in the database process and can have undesirable side-effects
  • you need to be careful keeping the trigger code in sync with the application code
  • there’s no concept of version control for triggers

2. Fork the write path

Writing to Elasticsearch and your database at the same time

Another option is to update Elasticsearch while writing to the database. If you decide to do this, you’ll have to be careful to not:

  • introduce another dependency in the critical path. If Elasticsearch were to go down or taken down for maintenence, you do not want your application to fail.
  • slow down the write path, so you want to schedule the update for later.

3. Bulk load from a database replica

The third option is to setup a replica of the primary database just for reads. As this is a predictable work-load, you can make better resource provisioning decisions. The best practice is to have a column updated_at for each table and scan the table periodically looking for updates. These updates can be then be bulk loaded into Elasticsearch.

Bottom Line

Any search application is tolerant of some latency between saving a record and taking a few minutes for the updates to show up in search results. You’ll want to take your rate of updates, indexing latency and size of data into consideration before putting Elasticsearch into production.