A Short Emergency Response Guide for Elasticsearch
Help! Production is on fire!
Our job is first to identify what exactly is on fire, and for that we need to get as much context as possible, as fast as possible.
Step 1: Check _cat/nodes
for resource usage
Lets first check which nodes are actually in trouble.
Run:
GET _cat/nodes?v&h=ip,name,cpu,ram.max,heap.max,heap.percent,node.role,diskAvail,master
The output looks like this.
master ip name cpu ram.max heap.max heap.percent node.role diskAvail
- 10.47.48.170 instance-0000000009 26 2gb 844mb 60 himrst 48.7gb
- 10.47.48.127 instance-0000000008 16 2gb 844mb 64 himrst 52.7gb
* 10.47.48.118 instance-0000000003 7 2gb 844mb 69 himrst 50gb
- 10.47.48.61 instance-0000000005 0 1gb 268mb 41 lr 1.8gb
If we notice a high CPU usage on a group of nodes, verify the node.role
of those roles, if you have data tiers (a hot-warm architecture) it might be that only warm
nodes are high, and hot
nodes are unaffected.
There are more variables you can check like search.query_current
and search.fetch_current
which will show us the amount of time is being currently spent for search in query and fetch phases respectively. GET _cat/nodes?help
is your friend
Step 2: Check the hot threads
This API yields a breakdown of the hot threads on each selected node in the cluster. The output is plain text with a breakdown of each node’s top hot threads.
Lets sample them by 1 second:
GET _nodes/hot_threads?interval=1s&ignore_idle_threads
The output will be something like this, with only the important parts:
::: {warm-xxx}{XXXXXXX}{YYYYYYYY}{10.10.10.10}{10.10.10.10:9300}{aws_availability_zone=us-west-2b, data_type=warm, ml.machine_memory=64388997120, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
Hot threads at 2019-12-30T23:22:24.304Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
44.0% (440.2ms out of 1000ms) cpu usage by thread 'elasticsearch[warm-xxx][management][T#1]'
... omitted ...
42.7% (413.4ms out of 1000ms) cpu usage by thread 'elasticsearch[warm-xxx][search][T#2]'
... omitted ...
41.8% (408.9ms out of 1000ms) cpu usage by thread 'elasticsearch[warm-xxx][search][T#7]'
... omitted ...
Step 3: Check Tasks
The Task Management API can give you a lot of information about the operations being executed at the cluster at any given time.
GET _tasks?human&detailed
They can be filtered to include only read (search) operations
GET _tasks?human&detailed&actions=indices:data/read/*
The output will show us the running_time
and, in case of searches, the content of the specific search request being executed:
"0l67b6iLSzmp7v3TNtkjbQ:138659665" : {
"node" : "0l67b6iLSzmp7v3TNtkjbQ",
"id" : 138659665,
"type" : "transport",
"action" : "indices:data/read/search",
"description" : "indices[tasks], types[], search_type[QUERY_THEN_FETCH], source[{\"size\":0,\"query\":{\"bool\":{\"must\":[{\"terms\":{\"User.Actions.ActionId\":[\"f6583dbd-4079-4efd-80c4-28e3f0606c1f\",\"2f80c480-18a4-4079-4efd-d3bdf9361164\",
...
"start_time" : "2022-06-28T15:27:20.766Z",
"start_time_in_millis" : 1656430040766,
"running_time" : "36.6s",
"running_time_in_nanos" : 36665783627,
The task 0l67b6iLSzmp7v3TNtkjbQ:138659665
is a search task that has been running for 36 seconds, its source
is also there, which might help identify the culprit.
Step 4: Cancel long running tasks
If you see queries that are impacting performance for too long and you want to cancel them, you can with PUT _tasks/<id>/_cancel
:
POST _tasks/0l67b6iLSzmp7v3TNtkjbQ:138659665/_cancel
However if you find yourself cancelling tasks every day, your team should really rethink their data structure, query design or cluster size.
Problem solved? Lets get a little bit more proactive by:
1) having a dedicated Monitoring Cluster
- In Elastic Cloud that is as easy as creating a secondary (small) cluster and pointing Logs and Metrics over there.
- In on-premises clusters you need to start an instance of Metricbeat and point it to your production cluster:
-
The elasticsearch module will fetch monitoring info from your
host
with the following configuration (metricbeat.yml):metricbeat.modules: - module: elasticsearch xpack.enabled: true period: 10s hosts: ["https://prod-cluster:9200"] scope: cluster username: "user" password: "secret" ssl.enabled: true ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] ssl.verification_mode: "certificate" output.elasticsearch: hosts: ["https://my-monitoring-cluster:9200"] username: "metricbeat_writer" password: "secret"
-
2) Enabling Slow Logs