An Elasticsearch Tutorial for .NET Developers
Elasticsearch is one of the most powerful full-text search engine solutions out there. Using the NEST package, you can easily leverage the power of Elasticsearch in your .NET projects.
In this article, Toptal Freelance Software Engineer Ivan Cesar shows how Elasticsearch can solve real-world full-text search problems in your .NET projects.
Elasticsearch is one of the most powerful full-text search engine solutions out there. Using the NEST package, you can easily leverage the power of Elasticsearch in your .NET projects.
In this article, Toptal Freelance Software Engineer Ivan Cesar shows how Elasticsearch can solve real-world full-text search problems in your .NET projects.
Ivan has a decade of work on projects of all sizes, mostly using .NET technologies. He also likes to take part in algorithm competitions.
Expertise
Should a .NET developer use Elasticsearch in their projects? Although Elasticsearch is built on Java, I believe it offers many reasons why Elasticsearch is worth a shot for full-text searching for any project.
Elasticsearch, as a technology, has come a long way over the past few years. Not only does it make full-text search feel like magic, it offers other sophisticated features, such as text autocompletion, aggregation pipelines, and more.
If the thought of introducing a Java-based service to your neat .NET ecosystem makes you uncomfortable, then worry not, as once you have installed and configured Elasticsearch, you will be spending most of your time with one of the coolest .NET packages out there: NEST.
In this article, you will learn how you can use the amazing search engine solution, Elasticsearch, in your .NET projects.
Installing and Configuring
Installing Elasticsearch itself to your development environment comes down to downloading Elasticsearch and, optionally, Kibana.
When unzipped, a bat file like this comes in handy:
cd "D:\elastic\elasticsearch-5.2.2\bin"
start elasticsearch.bat
cd "D:\elastic\kibana-5.0.0-windows-x86\bin"
start kibana.bat
exit
After starting both services, you can always check the local Kibana server (usually available at http://localhost:5601), play around with indexes and types, and search using pure JSON, as extensively described here.
The First Step
Being a thorough and good developer, with complete support and understanding from management, you start off by adding a unit test project and writing a SearchService with at least 90% code coverage.
The first step is clearly configuring the app.config
file to provide a sort-of-connection string for the Elasticsearch server.
The cool thing about Elasticsearch is that it is completely free. But, I would still advise using the Elastic Cloud service provided by Elastic.co. The hosted service makes all the maintenance and configuration fairly easy. Even more, you have two weeks of free trial, which should be more than enough to try out all the examples here!
Since here we are running locally, a configuration key like this should do:
<add key="Search-Uri" value="http://localhost:9200" />
Elasticsearch installation runs on port 9200 by default, but you can change it if you like.
ElasticClient and the NEST Package
ElasticClient is a nice little fellow which will do most of the work for us, and it comes with the NEST package.
Let us first install the package.
To configure the client, something like this can be used:
var node = new Uri(ConfigurationManager.AppSettings["Search-Uri"]);
var settings = new ConnectionSettings(node);
settings.ThrowExceptions(alwaysThrow: true); // I like exceptions
settings.PrettyJson(); // Good for DEBUG
var client = new ElasticClient(settings);
Indexing and Mapping
To be able to search something, we must store some data into ES. The term used is “indexing.”
The term “mapping” is used for mapping our data in the database to objects which will be serialized and stored in Elasticsearch. We will be using Entity Framework (EF) in this tutorial.
Generally, when using Elasticsearch, you are probably looking for a site-wide search engine solution. You will either use some sort of feed or digest, or Google-like search which returns all the results from various entities, such as users, blog entries, products, categories, events, etc.
These will probably not just be one table or entity in your database, but rather, you will want to aggregate diverse data and maybe extract or derive some common properties like title, description, date, author/owner, photo, and so on. Another thing is, you probably won’t do it in one query, but if you are using an ORM, you will have to write a separate query for each of those blog entries, users, products, categories, events, or something else.
I structured my projects by creating an index for each “big” type, e.g., blog post or product. Some Elasticsearch types can then be added for more specific types which would fall under the same index. For instance, if an article can be a story, video article, or podcast, it would still be in the “article” index, but we would have those four types within that index. However, it is still likely to be the same query in the database.
Keep in mind that you do need at least one type for each index—probably a type which has the same name as the index.
To map your entities, you will want to create some additional classes. I usually use the DocumentSearchItemBase
class, from which each of the specialized classes will inherit BlogPostSearchItem
, ProductSearchItem
, and so on.
I like to have mapper expressions within those classes. I can always modify the expressions if needed down the road.
In one of my earliest projects with Elasticsearch, I wrote a fairly big SearchService class with mappings and indexing done with nice and lengthy switch-case statements: For each entity type I want to throw into Elasticsearch, there was a switch and query with mapping which did that.
However, throughout the process, I learned that it is not the best way, at least not for me.
A more elegant solution is to have some sort of smart IndexDefinition
class and a specific index definition class for each index. This way, my base IndexDefinition
class can store a list of all available indexes and some helper methods like required analyzers and status reports, while derived index-specific classes handle querying the database and mapping the data for each index specifically. This is useful especially when you have to add an additional entity to ES sometime later. It comes down to adding another SomeIndexDefinition
class which inherits from IndexDefinition
and requires you to just implement a few methods which query the data you will want in your index.
The Elasticsearch Speak
At the core of everything you can do with Elasticsearch is its query language. Ideally, all you need to be able to communicate with Elasticsearch is know how to construct a query object.
Behind the scenes, Elasticsearch exposes its functionalities as a JSON-based API over HTTP.
Although the API itself and structure of the query object is fairly intuitive, dealing with many real-life scenarios can still be a hassle.
Generally, a search request to Elasticsearch requires the following information:
-
Which index and which types are searched
-
Pagination information (how many items to skip, and how many items to return)
-
A concrete type selection (when doing an aggregation, like we are about to do here)
-
The query itself
-
Highlight definition (Elasticsearch can automatically highlight hits if we want it to)
For instance, you may want to implement a search feature where only some of the users can see the premium content on your site, or you may want some content to be visible to only the “friends” of its authors, and so on.
Being able to construct the query object is at the core of the solutions to these problems, and it can really be a problem when trying to cover a lot of scenarios.
From all of the above, the most important and most difficult to set up is, naturally, the query segment—and here, we will be focusing mainly on that.
Queries are recursive constructs combined of BoolQuery
and other queries, such as MatchPhraseQuery
, TermsQuery
, DateRangeQuery
, and ExistsQuery
. Those were enough to fulfill any basic requirements, and should be good for a start.
A MultiMatch
query is quite important since it enables us to specify fields on which we want to do the search and tweak results a bit more—which we will return to later.
A MatchPhraseQuery
can filter results by what would be a foreign key in conventional SQL databases or static values such as enums—for instance, when matching results by specific author (AuthorId
), or matching all public articles (ContentPrivacy=Public
).
TermsQuery
would be translated as “in” into conventional SQL language. For instance, it can return all articles written by one of the user’s friends or get products exclusively from a fixed set of merchants. As with SQL, one should not overuse this and put 10,000 members in this array since it will have performance impact, but it generally handles reasonable amounts fairly well.
DateRangeQuery
is self-documenting.
ExistsQuery
is an interesting one: It enables you to ignore or return documents which do not have a specific field.
These, when combined with BoolQuery
, allow you to define complex filtering logic.
Think of a blog site, for example, where blog posts can have an AvailableFrom
field which denotes when they should become visible.
If we apply a filter like AvailableFrom <= Now
, then we will not get documents which do not have that particular field at all (we aggregate data, and some documents might not have that field defined). To solve the problem, you would combine ExistsQuery
with DateRangeQuery
and wrap it within BoolQuery
with the condition that at least one element in BoolQuery
is fulfilled. Something like this:
BoolQuery
Should (at least one of the following conditions should be fulfilled)
DateRangeQuery with AvailableFrom condition
Negated ExistsQuery for field AvailableFrom
Negating queries is not such a straightforward out-of-the-box job. But with the help of BoolQuery
, it is possible nonetheless:
BoolQuery
MustNot
ExistsQuery
Automation and Testing
To make things easier, the recommended method is definitely writing tests as you go.
This way, you will be able to experiment more efficiently and—even more importantly—you will make sure that any new changes you introduce (like more complex filters) will not break the existing functionality. I explicitly did not want to say “unit tests,” since I’m not a fan of mocking something like Elasticsearch’s engine—the mock will almost never be a realistic approximation of how ES really behaves—hence, this could be integration tests, if you are a terminology fan.
Real-world Examples
After all the groundwork is done with indexing, mapping, and filtering, we are now ready for the most interesting part: tweaking the search parameters to yield better results.
In my last project, I used Elasticsearch to provide a user feed: all of the content aggregated to one place ordered by creation date and full text search with some of the options. The feed itself is quite straightforward; just ensure that there is a date field somewhere in your data and order by that field.
Search, on the other hand, will not work amazingly well out of the box. That is because, naturally, Elasticsearch cannot know what the important things are in your data. Let’s say that we have some data which (among other fields) has Title
, Tags
(array), and Body
fields. The body field can be HTML content (to make things a bit more realistic).
Spelling Errors
The requirement: Our search should return results even if spelling errors occur or if the word ending is different. For instance, if there is an article with the title “Magnificent Things You Can Do with a Wooden Spoon,” when I search for “thing” or “wood,” I would still want to get a match.
To deal with this, we will have to be acquainted with analyzers, tokenizers, char filters, and token filters. Those are the transformations which are applied at the time of indexing.
-
Analyzers need to be defined. This can be defined per index.
-
Analyzers can be applied to some fields in our documents. This can be done using attributes or fluent API. In our example, we are using attributes.
-
Analyzers are a combination of filters, char filters, and tokenizers.
To fulfill the requirement (partial word match), we will create the “autocomplete” analyzer, which consists of:
-
An English stopwords filter: the filter which removes all common words in English, such as “and” or “the.”
-
Trim filter: removes white space around each token
-
Lowercase filter: converts all characters to lowercase. This does not mean that when we fetch our data, it will be converted to lowercase, but instead enables case-invariant search.
-
Edge-n-gram tokenizer: this tokenizer enables us to have partial matches. For example, if we have a sentence “My granny has a wooden chair,” when looking for term “wood,” we would still like to get a hit on that sentence. What edge-n-gram does, is store “woo,” “wood,” “woode,” and “wooden” so that any partial word match with at least three letters is found. Parameters MinGram and MaxGram define the minimum and maximum number of characters to be stored. In our case, we will have minimum of three and a maximum of 15 letters.
In the following section, all those are bound together:
analysis.Analyzers(a => a
.Custom("autocomplete", cc => cc
.Filters("eng_stopwords", "trim", "lowercase")
.Tokenizer("autocomplete")
)
.Tokenizers(tdesc => tdesc
.EdgeNGram("autocomplete", e => e
.MinGram(3)
.MaxGram(15)
.TokenChars(TokenChar.Letter, TokenChar.Digit)
)
)
.TokenFilters(f => f
.Stop("eng_stopwords", lang => lang
.StopWords("_english_")
)
);
And, when we want to use this analyzer, we should just annotate the fields we want like this:
public class SearchItemDocumentBase
{
...
[Text(Analyzer = "autocomplete", Name = nameof(Title))]
public string Title { get; set; }
...
}
Now, let’s take a look at few examples that demonstrate quite common requirements in almost any application with lots of content.
Cleaning HTML
The requirement: Some of our fields may have HTML text inside.
Naturally, you wouldn’t want searching for “section” to return something like “<section>…</section>” or “body” returning the HTML element “<body>.” To avoid that, during indexing, we will strip out the HTML and leave just the content inside.
Luckily, you are not the first one with that problem. Elasticsearch comes with a useful char filter for that:
analysis.Analyzers(a => a
.Custom("html_stripper", cc => cc
.Filters("eng_stopwords", "trim", "lowercase")
.CharFilters("html_strip")
.Tokenizer("autocomplete")
)
And to apply it:
[Text(Analyzer = "html_stripper", Name = nameof(HtmlText))]
public string HtmlText { get; set; }
Important Fields
The requirement: Matches in a title should be more important than matches within the content.
Luckily, Elasticsearch offers strategies to boost results if the match occurs in one field or the other. This is done within the search query construction by using the boost
option:
const int titleBoost = 15;
.Query(qx => qx.MultiMatch(m => m
.Query(searchRequest.Query.ToLower())
.Fields(ff => ff
.Field(f => f.Title, boost: titleBoost)
.Field(f => f.Summary)
...
)
.Type(TextQueryType.BestFields)
) && filteringQuery)
As you can see, the MultiMatch
query is very useful in situations like this, and situations like this are not that rare at all! Often, some fields are more important and some are not—this mechanism enables us to take that into account.
It is not always easy to set boost values right away. You’ll need to play with this a bit to get the desired results.
Prioritizing Articles
The requirement: Some articles are more important than others. Either the author is more important, or the article itself has more likes/shares/upvotes/etc. More important articles should rank higher.
Elasticsearch allows us to implement our scoring function, and we simplify it in a way that we define a field “Importance,” which is double value—in our case, greater than 1. You can define your own importance function/factor and apply it similarly. You can define multiple boost and scoring modes—whichever suits you best. This one worked for us nicely:
.Query(q => q
.FunctionScore(fsc => fsc
.BoostMode(FunctionBoostMode.Multiply)
.ScoreMode(FunctionScoreMode.Sum)
.Functions(f => f
.FieldValueFactor(b => b
.Field(nameof(SearchItemDocumentBase.Rating))
.Missing(0.7)
.Modifier(FieldValueFactorModifier.None)
)
)
.Query(qx => qx.MultiMatch(m => m
.Query(searchRequest.Query.ToLower())
.Fields(ff => ff
...
)
.Type(TextQueryType.BestFields)
) && filteringQuery)
)
)
Each movie has a rating, and we deduced the actor rating by the average of ratings for movies they were cast in (not a very scientific method). We scaled that rating to a double value in the interval [0,1].
Full-word Matches
The requirement: Full-word matches should rank higher.
By now, we are getting fairly good results for our searches, but you might notice that some results which contain partial matches might rank higher than exact matches. To deal with that, we added an additional field in our document named “Keywords” which does not use an autocomplete analyzer, but instead uses a keyword tokenizer and provides a boost factor to push exact match results higher.
This field will match only if the exact word is matched. It will not match “wood” for “wooden” like the autocomplete analyzer does.
Wrap Up
This article should have given you an overview of how to set up Elasticsearch in your .NET project, and with a little effort, provide a nice search-everywhere functionality.
The learning curve can be a bit steep, but it is worth it, especially when you tweak it just right and start getting great search results.
Always remember to add thorough test cases with expected results to make sure that you do not mess up parameters too much when introducing changes and playing around.
The full code for this article is available on GitHub, and uses data pulled from the TMDB database to show how search results are improving with each step.
Further Reading on the Toptal Blog:
Zagreb, Croatia
Member since July 11, 2016
About the author
Ivan has a decade of work on projects of all sizes, mostly using .NET technologies. He also likes to take part in algorithm competitions.