Nick Price Translation
I am currently working on a large logging project that was originally implemented using AWS Elasticsearch. Having worked with the large-scale Elasticsearch backbone clusters for several years, I am completely overwhelmed by the quality of the AWS implementation and cannot understand why they did not fix it or at least improve it.
Summary
Elasticsearch stores data in various indexes that you create explicitly or that can be created automatically after the data is sent. The entries in each index are divided into a certain number of shards, which are then balanced between nodes in your cluster (as evenly as possible if the number of your shards is not evenly divided by the number of nodes). There are two main types of shards in ElasticSearch: basic shards and replica shards. Replica shards provide fault tolerance in the event of a node failure, and users can set the number of replica shards separately for each index.
The work of standard Elasticsearch
Elasticsearch - It is elastic. Sometimes it can be pretty finicky, but, in general, you can add nodes to the cluster or delete them. And if in the case of a node removal there is a suitable number of replicas, Elasticsearch will distribute shards and even balance the load on the nodes in the cluster. This usually works.
Fulfillment of expensive requests can sometimes lead to the fall of nodes and the like, but a large number of settings helps to maintain the work. With a sufficient number of shard replicas, if the node falls, this does not affect the work as a whole.
The standard Elasticsearch also has a number of add-ons available, including the X-Pack, audit features, granular ACLs, monitoring and alerts. Most of the X-Pack recently became free, probably in response to the new Splunk license policy.
Amazon Elasticsearch Work
As usual, Amazon took the open source code for part of Elasticsearch, made a hard fork and started selling it as its own service, gradually introducing its own versions of functions that for many years have been available in one way or another in the main version of Elasticsearch.
The Amazon product lacks many things, such as: RBAC and audit, which is especially problematic for us, because we accept logs from different teams and would like to separate them from each other. At the moment, any user who has access to Elasticsearch has all access rights and can accidentally delete someone else’s data, change the way they are replicated on the nodes and completely stop receiving data by adding the wrong indexing template.
This is frustrating, but this is not the biggest problem with the service. Rebalancing shards - the central concept of Elasticsearch - does not work in the AWS implementation, which negates almost everything good in Elasticsearch.
Typically, when data is added to nodes, one can fill up more than the others. This is expected since there is no guarantee that the loaded records will be the same size or that the number of shards will always be evenly distributed across all nodes of the cluster. This is not critical, because Elasticsearch can rebalance shards between nodes, and if one node is really full, then other nodes will gladly start receiving data instead of filled.
This is not supported on Amazon. Some nodes may fill up (much) faster than others.
Moreover, in Amazon, if one node in your Elasticsearch cluster does not have enough free space, the entire cluster stops receiving data , it stops completely. Amazon’s solution is to let users go through the nightmare of periodically changing the number of shards in their indexing templates, and then reindexing previously created data into new indexes, deleting previous indexes, and, if necessary, reverse indexing the data into the old structure. This is completely redundant, and requires, in addition to large computational costs, that an unprocessed copy of the downloaded data be saved along with the analyzed record, because an unprocessed copy will be required for re-indexing. And, of course, this doubles the amount of memory required for “normal” work on AWS.
“Oops! I did not reindex the entire cluster often enough, and the node was full! What to do?"
You have two options. First, delete as much data as needed to bring the cluster back to life, and then start reindexing with the hope that nothing will fall apart. Do you have a backup of what you want to delete?
The second option is to add more nodes to the cluster or resize existing ones to a larger instance size.
But wait, how do I add nodes or make changes if shards cannot be rebalanced?
Amazon's solution is a blue-green deployment. They spin up a whole new cluster, copy the entire contents of the previous cluster into a new one, and then switch and destroy the old cluster.
Such resizing tasks can take days, for large clusters, as you can imagine, duplicating several trillion records can take some time. This also creates a crazy load on the existing cluster (probably already exceeding capacity) and can actually cause the cluster to fail. I performed several similar operations on more than 30 clusters in AWS and only once did I observe a successful completion in automatic mode.
So, you tried to resize your cluster, and the task did not complete. Now what?
Amazon Interactions
Your task of resizing the cluster was cut off (for the service that you probably chose to not deal with such an article), so you open the ticket to AWS tech support with the highest priority. Of course, they will complain about the amount or size of your shard and will kindly add a link to the "best practices" that you have read 500 times already. And then you wait for it to be fixed. And wait. And wait. The last time I tried to resize the cluster, and it was blocked, which led to serious malfunctions, it took SEVEN DAYS to return everything online. They restored the cluster itself in a couple of days, but when everything stopped, it is obvious that the nodes Kibana runs on have lost contact with the main cluster. AWS Support spent another four days trying to fix something while wondering if Kibana was working. They didn’t even know if they fixed the problem, and I had to check if they had restored communication between their own systems. Since then I have stopped doing anything other than deleting data if the node is full.
The costs of our organization on AWS are huge. This gives us the opportunity to periodically meet with their experts in various fields, discuss implementation strategies and deal with a variety of technical issues. We made an appointment with a representative of Elasticsearch, during which I spent most of the meeting explaining the basics of Elasticsearch and describing ... the quirks ... of their product. The expert was in complete shock that everything collapses when the node is full. If the sent expert does not know the basics of the work of his product, it is not surprising that the support team needs seven days to resume the production cluster.
Thoughts in the end
In the logging project, which I plunged into, there is a share of architectural errors and weak design decisions that we are currently working on. And of course, I expected AWS Elasticsearch to be different from the original product. However, in AWS Elasticsearch, so many fundamental functions are disabled or missing that this exacerbates almost all the problems we face.
For easy use and small clusters, AWS Elasticsearch works quite well, but for petabyte sized clusters, it was an endless nightmare.
I'm extremely curious why Amazon's Elasticsearch implementation cannot balance shards; this is quite fundamental Elasticsearch functionality. Even despite the limitations compared to the main Elasticsearch, it would certainly be an acceptable product for large clusters if it just worked properly. I can’t understand why Amazon is offering something so broken, and why they haven’t remedied the situation in more than two years.
As others have suggested, and this seems reasonable, this behavior is a sign of the AWS implementation, designed as a giant multi-tenant cluster, trying to provide isolation to make it look like a stand-alone cluster for end users. Even with options such as encrypted data alone and encrypted data transfer, this seems plausible. Or perhaps their tools and configurations are simply a legacy of a much earlier architecture.
And, as my friend remarked, it’s quite funny that they still call it “Flexible” when you cannot add or remove nodes from your clusters without spinning up a new one and transferring all your data.
Footnote: when I wrote this text, I found a post two years ago with many similar claims: read.acloud.guru/things-you-should-know-before-using-awss-elasticsearch-service-7cd70c9afb4f