Elasticsearch: 10 Tips to Get Started

Aug 15, 20179 min read

Elasticsearch: 10 Advices to Get Started

We have recently finished an innovative, data-driven project based on Elasticsearch. The aim was to find similarities between objects across sets. Sets were static, although they were a decent size (90+ million records) and there was a requirement that search was fast (nearly instant), so Elasticsearch was the best choice. During the project, I learned a few non-trivial rules and tweaks I’d like to share. They are primarily related to quickly setting up your cluster to be operating and robust, but I also included some hints how to make it faster.

Focus on the data schema instead of cluster setup first

Both schema migration and changes in your cluster setup take a lot of time as they require to move data around (to another kind of index or another instance).

Modifying the size of the cluster will not change behavior of your application, while after schema migration you would need to test if everything still works the same (or if it improved as desired).

Of course, the selected kind of cluster setup still influences its performance, but it is easier to evaluate when you’ve already finished the development of your application.

It’s better to default to the standard text type in your analyzer if you are not sure what kind of indexing to use for the particular field. Probably you will need it in that form anyway.

The biggest problem I ran into was indexing field as an integer or long just because it contained only digits in all documents, without giving it a second thought (I wasn’t sure if we would need to search using that field at the moment). If require searching for, a phrase and some part of this phrase may be found in these columns then you won’t be able to use single multi match query, but build more complicated bool query manually.

Moreover, some digits only fields are better suited for text search just because of their nature. For example when you are searching for a phone number or zip code you are more likely to search using text fuzzing instead of range.

Someone may suggest that keyword is better suited for such fields. Well, the point about multi match searching applies to the keyword indexing. You will probably prefer the text just because it is more convenient when you can multi-match one large phrase from which one token can be found in this field. Of course, there are cases when you still will want to use other mappings (range queries for numerical fields and exact phrase matching in keywords), but the point is to use text when not sure.

I will give an example: similar query when you use only text fields:

{
  "Multi_match": {
    "query": "girard 17",
    "fields": [
      "streetname",
      "streetnum"
    ]
  }
}

and when you used keyword/integer fields:

{
  "bool": {
    "should": [
      {
        "match": {
          "query": "girard",
          "field": "streetname"
        }
      },
      {
        "match": {
          "query": 17,
          "field": "streetnum"
        }
      }
    ]
  }
}

You may still like the second form for its flexibility, but first one is much more readable and easier to debug. Choose an analyzer for your text fields as soon as possible. In our case, the best choice was English analyzer with standard analyzer as a fallback if we decided that English would not fit. The reason that this decision is important early on is that you want to get expected results immediately, and you will avoid later surprise if you didn’t get some obvious matches. For example, with the English analyzer in place you can find ‘cat’ in the document when only the word ‘cats’ appears because it is completely ignored and trailing ‘s’ in the word is removed. You can also customize analyzer adding more of “stop words” (words like “the”) and defining synonyms. For example, if you are indexing addresses you probably should use analyzer like:

{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "street_synonym" : {
          "type" : "synonym",
          "synonyms" : [
            "street, st",
            "avenue, ave",
            … 
          ]
        }
      },
      "analyzer": {
        "address": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "english_stop",
            "street_synonym”
          ]
        }
      }
    }
  }
}

Remember to test your analyzer before you start indexing your data.

Be prepared to reindex your data

Sooner or later you will realize that you made some bad decisions in your mappings, or you may want to support new kind of queries. You will need to reindex your data at that point. There are a few things you need to take into consideration. Reindexing using Elasticsearch requires additional disc space of size of your cluster. So at least you need to be able to increase that space during the process.

Another thing is that your index name will change, so you either will need to change index used in your application after reindexing is done, or use aliases feature in Elasticsearch and switch which index is aliased. Of course, the second approach is more robust, but you have to remember always to use an alias instead of index name in your app.

Bool and term are (almost) all you need for text search, but learn other queries as well

You probably already know that there are numbers of query types in Elasticsearch, but if your aim is to do mostly text search 90% of your demands would be met using just term query and way of combining sub queries into bigger query using bool query , so it is important to learn in the first place all features of these two types of query. Be sure to understand the relationship between term query, and analyzers (both index and search time).

Other full-text queries are built on top of these two queries, and all the features you learned about them can still be used in the other ones. Match and multi match queries are directly translatable to the combination of bool and term queries. Of course, if you want to use the full power of Elasticsearch you should still need to read all of the query DSL parts of the documentation.

Name your sub queries, especially if they are part of “should” group. Also, use highlighting

Naming your queries and highlighting helps you understand why each document is returned for your query. Naming queries are most useful if subquery doesn’t have to match, if it is part of should group. If you named them you don’t have to check each of them: you would be told which parts matched. If you also enabled highlighting, you will also get information on which parts of documents matched. It will help you exclude unwanted matches in the future.

Use external Elasticsearch service, so you don’t have to solve cluster setup problems yourself

In our case, we used Amazon Elasticsearch service, but there are many other providers. If you haven’t yet set up Elasticsearch cluster, it will surprise you with bootstrap checks. Also, you would probably need to develop some tools for adding and removing nodes from the cluster. You also would have to take security into consideration by yourself. If your main focus is development speed of your application, I recommend outsourcing the setup.

Shard your data based on an expected amount of data soon

This one is a tip for speeding up your cluster. There is a great explanation on how to share your data here, so I will only explain our approach that worked perfectly. The first step is to estimate how much data you will need to be available soon. Don’t prepare for the hypothetical enormous growth of your data. Just calculate how much you will have in say a year based on a current speed of increase. Then apply the basic rule: shard should not exceed maximum java heap size (so 32 GB) then add a few shards more for small overallocation.

Another criterion is to always have more shards than nodes in the cluster, so if you want to be prepared for expanding your cluster to 20 nodes use about 30 shards. In our case we had about 250GB (without any gain shortly) index, so we decided to use 30 shards, so each shard would be about 8GB and would fit in the memory of our nodes. Don’t worry about adding replicas as their main reason is to keep your data invulnerable to crash of a number of your nodes and will much likely to slow down performance due to additional memory usage.

When writing your queries make sure to filter out as many documents as possible

It makes the query faster if it needs to score fewer documents. The query may consist of the combination of subqueries, but if some part of it can exclude some documents by itself than other queries does not even have to check them, so the whole process is faster. In our case, we made sure to use the filter, must not and must in bool query whenever it was possible and still make sense.

It is tempting to just adjust boosts of should queries instead because excluding documents requires giving more thought for each query, so you do not exclude something you didn’t want (false negatives are harder to debug in contrast to false positives). Improvement of performance of the query is significant though so that you can run your queries multiple times.

Also remember to avoid queries that are slow by their nature (script, regex). At least ensure that you use another kind of filter first.

Monitor memory and CPU of your cluster

Of course, you need to monitor cluster health in first place. But if your cluster runs out of memory or is at 100% CPU all the time it will result in your queries running slower. It sounds obvious, but I still mention it here as it’s easy to forget when you finally test the speed of your queries.