Elasticsearch: 10 Tips to Get Started
Aug 15, 2017
Aug 15, 2017
Note: This article was originally published on August 15, 2017 and has been migrated from our previous blog. Some details — tools, libraries, benchmarks, industry context — may be outdated. For our latest perspective, see our recent posts.
We have recently finished an innovative, data-driven project based on Elasticsearch. The aim was to find similarities between objects across sets. Sets were static, although they were a decent size (90+ million records) and there was a requirement that search was fast (nearly instant) – so Elasticsearch was the best choice. During the project, I learned a few non-trivial rules and tweaks I’d like to share. They are primarily related to quickly setting up your cluster to be operating and robust, but I also included some hints how to make it faster.

Both schema migration and changes in your cluster setup take a lot of time as they require to move data around (to another kind of index or another instance).
Modifying the size of the cluster will not change behavior of your application, while after schema migration you would need to test if everything still works the same (or if it improved as desired).
Of course, the selected kind of cluster setup still influences its performance, but it is easier to evaluate when you’ve already finished the development of your application.
It’s better to default to the standard text type in your analyzer if you are not sure what kind of indexing to use for the particular field. Probably you will need it in that form anyway.
The biggest problem I ran into was indexing field as an integer or long just because it contained only digits in all documents, without giving it a second thought (I wasn’t sure if we would need to search using that field at the moment). If require searching for, a phrase and some part of this phrase may be found in these columns then you won’t be able to use single multi match query, but build more complicated bool query manually.
Moreover, some digits only fields are better suited for text search just because of their nature. For example when you are searching for a phone number or zip code you are more likely to search using text fuzzing instead of range.
Someone may suggest that keyword is better suited for such fields. Well, the point about multi match searching applies to the keyword indexing. You will probably prefer the text just because it is more convenient when you can multi-match one large phrase from which one token can be found in this field. Of course, there are cases when you still will want to use other mappings (range queries for numerical fields and exact phrase matching in keywords), but the point is to use text when not sure.
I will give an example: similar query when you use only text fields:
{
"Multi_match": {
"query": "girard 17",
"fields": [
"streetname",
"streetnum"
]
}
}
and when you used keyword/integer fields:
{
"bool": {
"should": [
{
"match": {
"query": "girard",
"field": "streetname"
}
},
{
"match": {
"query": 17,
"field": "streetnum"
}
}
]
}
}
You may still like the second form for its flexibility, but first one is much more readable and easier to debug. Choose an analyzer for your text fields as soon as possible. In our case, the best choice was English analyzer with standard analyzer as a fallback if we decided that English would not fit. The reason that this decision is important early on is that you want to get expected results immediately, and you will avoid later surprise if you didn’t get some obvious matches. For example, with the English analyzer in place you can find ‘cat’ in the document when only the word ‘cats’ appears because it is completely ignored and trailing ‘s’ in the word is removed. You can also customize analyzer adding more of “stop words” (words like “the”) and defining synonyms. For example, if you are indexing addresses you probably should use analyzer like:
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"street_synonym" : {
"type" : "synonym",
"synonyms" : [
"street, st",
"avenue, ave",
…
]
}
},
"analyzer": {
"address": {
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stop",
"street_synonym”
]
}
}
}
}
}
Remember to test your analyzer before you start indexing your data.
Sooner or later you will realize that you made some bad decisions in your mappings, or you may want to support new kind of queries. You will need to reindex your data at that point. There are a few things you need to take into consideration. Reindexing using Elasticsearch requires additional disc space of size of your cluster. So at least you need to be able to increase that space during the process.
Another thing is that your index name will change, so you either will need to change index used in your application after reindexing is done, or use aliases feature in Elasticsearch and switch which index is aliased. Of course, the second approach is more robust, but you have to remember always to use an alias instead of index name in your app.
You probably already know that there are numbers of query types in Elasticsearch, but if your aim is to do mostly text search 90% of your demands would be met using just term query and way of combining sub queries into bigger query using bool query , so it is important to learn in the first place all features of these two types of query. Be sure to understand the relationship between term query, and analyzers (both index and search time).
Other full-text queries are built on top of these two queries, and all the features you learned about them can still be used in the other ones. Match and multi match queries are directly translatable to the combination of bool and term queries. Of course, if you want to use the full power of Elasticsearch you should still need to read all of the query DSL parts of the documentation.
Naming your queries and highlighting helps you understand why each document is returned for your query. Naming queries are most useful if subquery doesn’t have to match – if it is part of should group. If you named them you don’t have to check each of them – you would be told which parts matched. If you also enabled highlighting, you will also get information on which parts of documents matched. It will help you exclude unwanted matches in the future.
In our case, we used Amazon Elasticsearch service, but there are many other providers. If you haven’t yet set up Elasticsearch cluster, it will surprise you with bootstrap checks. Also, you would probably need to develop some tools for adding and removing nodes from the cluster. You also would have to take security into consideration by yourself. If your main focus is development speed of your application, I recommend outsourcing the setup.
This one is a tip for speeding up your cluster. There is a great explanation on how to share your data here, so I will only explain our approach that worked perfectly. The first step is to estimate how much data you will need to be available soon. Don’t prepare for the hypothetical enormous growth of your data – just calculate how much you will have in say a year based on a current speed of increase. Then apply the basic rule: shard should not exceed maximum java heap size (so 32 GB) then add a few shards more for small overallocation.
Another criterion is to always have more shards than nodes in the cluster, so if you want to be prepared for expanding your cluster to 20 nodes use about 30 shards. In our case we had about 250GB (without any gain shortly) index, so we decided to use 30 shards, so each shard would be about 8GB and would fit in the memory of our nodes. Don’t worry about adding replicas as their main reason is to keep your data invulnerable to crash of a number of your nodes and will much likely to slow down performance due to additional memory usage.
It makes the query faster if it needs to score fewer documents. The query may consist of the combination of subqueries, but if some part of it can exclude some documents by itself than other queries does not even have to check them, so the whole process is faster. In our case, we made sure to use the filter, must not and must in bool query whenever it was possible and still make sense.
It is tempting to just adjust boosts of should queries instead because excluding documents requires giving more thought for each query, so you do not exclude something you didn’t want (false negatives are harder to debug in contrast to false positives). Improvement of performance of the query is significant though so that you can run your queries multiple times.
Also remember to avoid queries that are slow by their nature (script, regex) – at least ensure that you use another kind of filter first.
Of course, you need to monitor cluster health in first place. But if your cluster runs out of memory or is at 100% CPU all the time it will result in your queries running slower. It sounds obvious, but I still mention it here as it’s easy to forget when you finally test the speed of your queries.
Say hello to optimized workflows and enhanced decision-making! Harness the power of AI to unlock differentiated insights
Harness the power of AI - Whether it’s optimizing supply chains in logistics, preventing fraud in healthcare insurance, or leveraging advanced social listening to enhance your portfolio companies.