Elasticsearch for Ruby on Rails: A Tutorial to the Chewy Gem
Elasticsearch provides a powerful, scalable tool for indexing and querying massive amounts of structured data, built on top of the Apache Lucene library.
Building on the foundation of Elasticsearch and the Elasticsearch-Ruby client, we’ve developed and released our own improvement (and simplification) of the Elasticsearch application search architecture that also provides tighter integration with Rails. We’ve packaged it as a Ruby gem named Chewy.
This post discusses how we accomplished this, including the technical obstacles that emerged during implementation.
Elasticsearch provides a powerful, scalable tool for indexing and querying massive amounts of structured data, built on top of the Apache Lucene library.
Building on the foundation of Elasticsearch and the Elasticsearch-Ruby client, we’ve developed and released our own improvement (and simplification) of the Elasticsearch application search architecture that also provides tighter integration with Rails. We’ve packaged it as a Ruby gem named Chewy.
This post discusses how we accomplished this, including the technical obstacles that emerged during implementation.
Arkadiy is a senior Ruby on Rails developer. He enjoys working with databases and open-source initiatives on GitHub.
Expertise
Elasticsearch provides a powerful, RESTful HTTP interface for indexing and querying data, built on top of the Apache Lucene library. Right out of the box, it provides scalable, efficient, and robust search, with UTF-8 support. It’s a powerful tool for indexing and querying massive amounts of structured data and, here at Toptal, it powers our platform search and will soon be used for autocompletion as well. We’re huge fans.
Since our platform is built using Ruby on Rails, our integration of Elasticsearch takes advantage of the elasticsearch-ruby project (a Ruby integration framework for Elasticsearch that provides a client for connecting to an Elasticsearch cluster, a Ruby API for the Elasticsearch’s REST API, and various extensions and utilities). Building on this foundation, we’ve developed and released our own improvement (and simplification) of the Elasticsearch application search architecture, packaged as a Ruby gem that we’ve named Chewy (with an example app available here).
Chewy extends the Elasticsearch-Ruby client, making it more powerful and providing tighter integration with Rails. In this Elasticsearch guide, I discuss (through usage examples) how we accomplished this, including the technical obstacles that emerged during implementation.
Just a couple of quick notes before proceeding to the guide:
- Both Chewy and a Chewy demo application are available on GitHub.
- For those interested in more “under the hood” info about Elasticsearch, I’ve included a brief write-up as an Appendix to this post.
Why Chewy?
Despite Elasticsearch’s scalability and efficiency, integrating it with Rails didn’t turn out to be quite as simple as anticipated. At Toptal, we found ourselves needing to significantly augment the basic Elasticsearch-Ruby client to make it more performant and to support additional operations.
And thus, the Chewy gem was born.
A few particularly noteworthy features of Chewy include:
-
Every index is observable by all the related models.
Most indexed models are related to each other. And sometimes, it’s necessary to denormalize this related data and bind it to the same object (e.g., if you want to index an array of tags together with their associated article). Chewy allows you to specify an updatable index for every model, so corresponding articles will be reindexed whenever a relevant tag is updated.
-
Index classes are independent from ORM/ODM models.
With this enhancement, implementing cross-model autocompletion, for example, is much easier. You can just define an index and work with it in object-oriented fashion. Unlike other clients, the Chewy gem removes the need to manually implement index classes, data import callbacks, and other components.
-
Bulk import is everywhere.
Chewy utilizes the bulk Elasticsearch API for full reindexing and index updates. It also utilizes the concept of atomic updates, collecting changed objects within an atomic block and updating them all at once.
-
Chewy provides an AR-style query DSL.
By being chainable, mergable, and lazy, this enhancement allows queries to be produced in a more efficient manner.
OK, so let’s see how this all plays out in the gem…
The basic guide to Elasticsearch
Elasticsearch has several document-related concepts. The first is that of an index
(the analogue of a database
in RDBMS), which consists of a set of documents
, which can be of several types
(where a type
is a kind of RDBMS table).
Every document has a set of fields
. Each field is analyzed independently and its analysis options are stored in the mapping
for its type. Chewy utilizes this structure “as is” in its object model:
class EntertainmentIndex < Chewy::Index
settings analysis: {
analyzer: {
title: {
tokenizer: 'standard',
filter: ['lowercase', 'asciifolding']
}
}
}
define_type Book.includes(:author, :tags) do
field :title, analyzer: 'title'
field :year, type: 'integer'
field :author, value: ->{ author.name }
field :author_id, type: 'integer'
field :description
field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) }
end
{movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope|
define_type scope.includes(:director, :tags), name: type_name do
field :title, analyzer: 'title'
field :year, type: 'integer'
field :author, value: ->{ director.name }
field :author_id, type: 'integer', value: ->{ director_id }
field :description
field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) }
end
end
end
Above, we defined an Elasticsearch index called entertainment
with three types: book
, movie
, and cartoon
. For each type, we defined some field mappings and a hash of settings for the whole index.
So, we’ve defined the EntertainmentIndex
and we want to execute some queries. As a first step, we need to create the index and import our data:
EntertainmentIndex.create!
EntertainmentIndex.import
# EntertainmentIndex.reset! (which includes deletion,
# creation, and import) could be used instead
The .import
method is aware of imported data because we passed in scopes when we defined our types; thus, it will import all the books, movies, and cartoons stored in the persistent storage.
With that done, we can perform some queries:
EntertainmentIndex.query(match: {author: 'Tarantino'}).filter{ year > 1990 }
EntertainmentIndex.query(match: {title: 'Shawshank'}).types(:movie)
EntertainmentIndex.query(match: {author: 'Tarantino'}).only(:id).limit(10).load
# the last one loads ActiveRecord objects for documents found
Now our index is almost ready to be used in our search implementation.
Rails integration
For integration with Rails, the first thing we need is to be able to react to RDBMS object changes. Chewy supports this behavior via callbacks defined within the update_index
class method. update_index
takes two arguments:
- A type identifier supplied in the
"index_name#type_name"
format - A method name or block to execute, which represents a back-reference to the updated object or object collection
We need to define these callbacks for each dependent model:
class Book < ActiveRecord::Base
acts_as_taggable
belongs_to :author, class_name: 'Dude'
# We update the book itself on-change
update_index 'entertainment#book', :self
end
class Video < ActiveRecord::Base
acts_as_taggable
belongs_to :director, class_name: 'Dude'
# Update video types when changed, depending on the category
update_index('entertainment#movie') { self if movie? }
update_index('entertainment#cartoon') { self if cartoon? }
end
class Dude < ActiveRecord::Base
acts_as_taggable
has_many :books
has_many :videos
# If author or director was changed, all the corresponding
# books, movies and cartoons are updated
update_index 'entertainment#book', :books
update_index('entertainment#movie') { videos.movies }
update_index('entertainment#cartoon') { videos.cartoons }
end
Since tags are also indexed, we next need to monkey-patch some external models so that they react to changes:
ActsAsTaggableOn::Tag.class_eval do
has_many :books, through: :taggings, source: :taggable, source_type: 'Book'
has_many :videos, through: :taggings, source: :taggable, source_type: 'Video'
# Updating all tag-related objects
update_index 'entertainment#book', :books
update_index('entertainment#movie') { videos.movies }
update_index('entertainment#cartoon') { videos.cartoons }
end
ActsAsTaggableOn::Tagging.class_eval do
# Same goes for the intermediate model
update_index('entertainment#book') { taggable if taggable_type == 'Book' }
update_index('entertainment#movie') { taggable if taggable_type == 'Video' &&
taggable.movie? }
update_index('entertainment#cartoon') { taggable if taggable_type == 'Video' &&
taggable.cartoon? }
end
At this point, every object save or destroy will update the corresponding Elasticsearch index type.
Atomicity
We still have one lingering problem. If we do something like books.map(&:save)
to save multiple books, we’ll request an update of the entertainment
index every time an individual book is saved. Thus, if we save five books, we’ll update the Chewy index five times. This behavior is acceptable for REPL, but certainly not acceptable for controller actions in which performance is critical.
We address this issue with the Chewy.atomic
block:
class ApplicationController < ActionController::Base
around_action { |&block| Chewy.atomic(&block) }
end
In short, Chewy.atomic
batches these updates as follows:
- Disables the
after_save
callback. - Collects the IDs of saved books.
- On completion of the
Chewy.atomic
block, uses the collected IDs to make a single Elasticsearch index update request.
Searching
Now we’re ready to implement a search interface. Since our user interface is a form, the best way to build it is, of course, with FormBuilder and ActiveModel. (At Toptal, we use ActiveData to implement ActiveModel interfaces, but feel free to use your favorite gem.)
class EntertainmentSearch
include ActiveData::Model
attribute :query, type: String
attribute :author_id, type: Integer
attribute :min_year, type: Integer
attribute :max_year, type: Integer
attribute :tags, mode: :arrayed, type: String,
normalize: ->(value) { value.reject(&:blank?) }
# This accessor is for the form. It will have a single text field
# for comma-separated tag inputs.
def tag_list= value
self.tags = value.split(',').map(&:strip)
end
def tag_list
self.tags.join(', ')
end
end
Query and filters tutorial
Now that we have an ActiveModel-like object that can accept and typecast attributes, let’s implement search:
class EntertainmentSearch
...
def index
EntertainmentIndex
end
def search
# We can merge multiple scopes
[query_string, author_id_filter,
year_filter, tags_filter].compact.reduce(:merge)
end
# Using query_string advanced query for the main query input
def query_string
index.query(query_string: {fields: [:title, :author, :description],
query: query, default_operator: 'and'}) if query?
end
# Simple term filter for author id. `:author_id` is already
# typecasted to integer and ignored if empty.
def author_id_filter
index.filter(term: {author_id: author_id}) if author_id?
end
# For filtering on years, we will use range filter.
# Returns nil if both min_year and max_year are not passed to the model.
def year_filter
body = {}.tap do |body|
body.merge!(gte: min_year) if min_year?
body.merge!(lte: max_year) if max_year?
end
index.filter(range: {year: body}) if body.present?
end
# Same goes for `author_id_filter`, but `terms` filter used.
# Returns nil if no tags passed in.
def tags_filter
index.filter(terms: {tags: tags}) if tags?
end
end
Controllers and views
At this point, our model can perform search requests with passed attributes. Usage will look something like:
EntertainmentSearch.new(query: 'Tarantino', min_year: 1990).search
Note that in the controller, we want to load exact ActiveRecord objects instead of Chewy document wrappers:
class EntertainmentController < ApplicationController
def index
@search = EntertainmentSearch.new(params[:search])
# In case we want to load real objects, we don't need any other
# fields except for `:id` retrieved from Elasticsearch index.
# Chewy query DSL supports Kaminari gem and corresponding API.
# Also, we pass scopes for every requested type to the `load` method.
@entertainments = @search.search.only(:id).page(params[:page]).load(
book: {scope: Book.includes(:author)},
movie: {scope: Video.includes(:director)},
cartoon: {scope: Video.includes(:director)}
)
end
end
Now, it’s time to write up some HAML at entertainment/index.html.haml
:
= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f|
= f.text_field :query
= f.select :author_id, Dude.all.map { |d| [d.name, d.id] }, include_blank: true
= f.text_field :min_year
= f.text_field :max_year
= f.text_field :tag_list
= f.submit
- if @entertainments.any?
%dl
- @entertainments.each do |entertainment|
%dt
%h1= entertainment.title
%strong= entertainment.class
%dd
%p= entertainment.year
%p= entertainment.description
%p= entertainment.tag_list
= paginate @entertainments
- else
Nothing to see here
Sorting
As a bonus, we’ll also add sorting to our search functionality.
Assume that we need to sort on the title and year fields, as well as by relevance. Unfortunately, the title One Flew Over the Cuckoo's Nest
will be split into individual terms, so sorting by these disparate terms will be too random; instead, we’d like to sort by the entire title.
The solution is to use a special title field and apply its own analyzer:
class EntertainmentIndex < Chewy::Index
settings analysis: {
analyzer: {
...
sorted: {
# `keyword` tokenizer will not split our titles and
# will produce the whole phrase as the term, which
# can be sorted easily
tokenizer: 'keyword',
filter: ['lowercase', 'asciifolding']
}
}
}
define_type Book.includes(:author, :tags) do
# We use the `multi_field` type to add `title.sorted` field
# to the type mapping. Also, will still use just the `title`
# field for search.
field :title, type: 'multi_field' do
field :title, index: 'analyzed', analyzer: 'title'
field :sorted, index: 'analyzed', analyzer: 'sorted'
end
...
end
{movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope|
define_type scope.includes(:director, :tags), name: type_name do
# For videos as well
field :title, type: 'multi_field' do
field :title, index: 'analyzed', analyzer: 'title'
field :sorted, index: 'analyzed', analyzer: 'sorted'
end
...
end
end
end
In addition, we’re going to add both these new attributes and the sort processing step to our search model:
class EntertainmentSearch
# we are going to use `title.sorted` field for sort
SORT = {title: {'title.sorted' => :asc}, year: {year: :desc}, relevance: :_score}
...
attribute :sort, type: String, enum: %w(title year relevance),
default_blank: 'relevance'
...
def search
# we have added `sorting` scope to merge list
[query_string, author_id_filter, year_filter,
tags_filter, sorting].compact.reduce(:merge)
end
def sorting
# We have one of the 3 possible values in `sort` attribute
# and `SORT` mapping returns actual sorting expression
index.order(SORT[sort.to_sym])
end
end
Finally, we’ll modify our form adding sort options selection box:
= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f|
...
/ `EntertainmentSearch.sort_values` will just return
/ enum option content from the sort attribute definition.
= f.select :sort, EntertainmentSearch.sort_values
...
Error handling
If your users perform incorrect queries like (
or AND
, the Elasticsearch client will raise an error. To handle that, let’s make some changes to our controller:
class EntertainmentController < ApplicationController
def index
@search = EntertainmentSearch.new(params[:search])
@entertainments = @search.search.only(:id).page(params[:page]).load(
book: {scope: Book.includes(:author)},
movie: {scope: Video.includes(:director)},
cartoon: {scope: Video.includes(:director)}
)
rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
@entertainments = []
@error = e.message.match(/QueryParsingException\[([^;]+)\]/).try(:[], 1)
end
end
Further, we need to render the error in the view:
...
- if @entertainments.any?
...
- else
- if @error
= @error
- else
Nothing to see here
Testing Elasticsearch queries
The basic testing setup is as follows:
- Start the Elasticsearch server.
- Cleanup and create our indices.
- Import our data.
- Perform our query.
- Cross-reference the result with our expectations.
For step 1, it’s convenient to use the test cluster defined in the elasticsearch-extensions gem. Just add the following line to your project’s Rakefile
post-gem installation:
require 'elasticsearch/extensions/test/cluster/tasks'
Then, you’ll get the following Rake tasks:
$ rake -T elasticsearch
rake elasticsearch:start # Start Elasticsearch cluster for tests
rake elasticsearch:stop # Stop Elasticsearch cluster for tests
Elasticsearch and Rspec
First, we need to make sure that our index is updated to be in-sync with our data changes. Luckily, the Chewy gem comes with the helpful update_index
rspec matcher:
describe EntertainmentIndex do
# No need to cleanup Elasticsearch as requests are
# stubbed in case of `update_index` matcher usage.
describe 'Tag' do
# We create several books with the same tag
let(:books) { create_list :book, 2, tag_list: 'tag1' }
specify do
# We expect that after modifying the tag name...
expect do
ActsAsTaggableOn::Tag.where(name: 'tag1').update_attributes(name: 'tag2')
# ... the corresponding type will be updated with previously-created books.
end.to update_index('entertainment#book').and_reindex(books,
with: {tags: ['tag2']})
end
end
end
Next, we need to test that the actual search queries are performed properly and that they return the expected results:
describe EntertainmentSearch do
# Just defining helpers for simplifying testing
def search attributes = {}
EntertainmentSearch.new(attributes).search
end
# Import helper as well
def import *args
# We are using `import!` here to be sure all the objects are imported
# correctly before examples run.
EntertainmentIndex.import! *args
end
# Deletes and recreates index before every example
before { EntertainmentIndex.purge! }
describe '#min_year, #max_year' do
let(:book) { create(:book, year: 1925) }
let(:movie) { create(:movie, year: 1970) }
let(:cartoon) { create(:cartoon, year: 1995) }
before { import book: book, movie: movie, cartoon: cartoon }
# NOTE: The sample code below provides a clear usage example but is not
# optimized code. Something along the following lines would perform better:
# `specify { search(min_year: 1970).map(&:id).map(&:to_i)
# .should =~ [movie, cartoon].map(&:id) }`
specify { search(min_year: 1970).load.should =~ [movie, cartoon] }
specify { search(max_year: 1980).load.should =~ [book, movie] }
specify { search(min_year: 1970, max_year: 1980).load.should == [movie] }
specify { search(min_year: 1980, max_year: 1970).should == [] }
end
end
Test cluster troubleshooting
Finally, here is a guide for troubleshooting your test cluster:
-
To start, use an in-memory, one-node cluster. It will be much faster for specs. In our case:
TEST_CLUSTER_NODES=1 rake elasticsearch:start
-
There are some existing issues with the
elasticsearch-extensions
test cluster implementation itself related to one-node cluster status check (it’s yellow in some cases and will never be green, so the green-status cluster start check will fail every time). The issue has been fixed in a fork, but hopefully it will be fixed in the main repo soon. -
For each dataset, group your request in specs (i.e., import your data once and then perform several requests). Elasticsearch warms up for a long time and uses a lot of heap memory while importing data, so don’t overdo it, especially if you’ve got a bunch of specs.
-
Make sure your machine has sufficient memory or Elasticsearch will freeze (we required around 5GB for each testing virtual machine and around 1GB for Elasticsearch itself).
Wrapping up
Elasticsearch is self-described as “a flexible and powerful open source, distributed, real-time search, and analytics engine.” It’s the gold standard in search technologies.
With Chewy, our rails developers have packaged these benefits as a simple, easy-to-use, production quality, open source Ruby gem that provides tight integration with Rails. Elasticsearch and Rails – what an awesome combination!
Appendix: Elasticsearch internals
Here’s a very brief introduction to Elasticsearch “under the hood”…
Elasticsearch is built on Lucene, which itself uses inverted indices as its primary data structure. For example, if we have the strings “the dogs jump high”, “jump over the fence”, and “the fence was too high”, we get the following structure:
"the" [0, 0], [1, 2], [2, 0]
"dogs" [0, 1]
"jump" [0, 2], [1, 0]
"high" [0, 3], [2, 4]
"over" [1, 1]
"fence" [1, 3], [2, 1]
"was" [2, 2]
"too" [2, 3]
Thus, every term contains both references to, and positions in, the text. Furthermore, we choose to modify our terms (e.g., by removing stop-words like “the”) and apply phonetic hashing to every term (can you guess the algorithm?):
"DAG" [0, 1]
"JANP" [0, 2], [1, 0]
"HAG" [0, 3], [2, 4]
"OVAR" [1, 1]
"FANC" [1, 3], [2, 1]
"W" [2, 2]
"T" [2, 3]
If we then query for “the dog jumps”, it’s analyzed in the same way as the source text, becoming “DAG JANP” after hashing (“dog” has the same hash as “dogs”, as is true with “jumps” and “jump”).
We also add some logic between the individual words in the string (based on configuration settings), choosing between (“DAG” AND “JANP”) or (“DAG” OR “JANP”). The former returns the intersection of [0] & [0, 1]
(i.e., document 0) and the latter, [0] | [0, 1]
(i.e., documents 0 and 1). The in-text positions can be used for scoring results and position-dependent queries.