Bug 1477742 - How to monitor Elasticsearch for bulk index rejections
How to monitor Elasticsearch for bulk index rejections
Status: NEW
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation (Show other bugs)
3.6.0
x86_64 All
unspecified Severity high
: ---
: 3.6.z
Assigned To: Vikram Goyal
Vikram Goyal
Vikram Goyal
aos-scalability-36
: TestBlocker
Depends On: 1470862
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-02 14:58 EDT by Rich Megginson
Modified: 2017-08-02 17:53 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1470862
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rich Megginson 2017-08-02 14:58:08 EDT
I'm not sure where this goes, maybe in the Performance guide if there is a section on logging, but we need to document how customers can monitor Elasticsearch for bulk index rejections.

Below is copied from the doc text of 1470862:

In OCP 3.5 and earlier, the Fluentd image included fluent-plugin-elasticsearch version 1.9.2 and earlier.  This version will silently drop records sent in a bulk index request when the queue size is full [1].  In OCP 3.6, which uses version 1.9.5, an error log message was added, which is why we now see the “Error: status=429” message in the Fluentd logs when this occurs [2].

One thing that might help to reduce the frequency of this problem is to increase the Fluentd buffer chunk size, but so far our testing does not give consistent results.  You will need to stop, configure, and restart Fluentd running on all of your nodes.

- edit the daemonset

# oc edit -n logging daemonset logging-fluentd

in the `env:` section, look for the BUFFER_SIZE_LIMIT.  If the value is less than 8Mi (8 megabytes), change the value to 8Mi, otherwise, use a value of 16Mi or 32Mi.  This will increase, roughly, the size of each bulk index request, and the theory is that this will decrease the number of such requests, made to Elasticsearch, thereby allowing Elasticsearch to process them more efficiently.

Once the edit is saved, the Fluentd daemonset trigger should cause a restart of all of the Fluentd pods running in the cluster.

## How to monitor Elasticsearch ##

You can monitor the Elasticsearch bulk index thread pool to see how many bulk index requests it processes and rejects.

- get the name of an Elasticsearch pod

# oc get -n logging pods -l component=es

# espod=$name_of_es_pod

- issue the following command

# oc exec -n logging $espod -- \
  curl -s -k --cert /etc/elasticsearch/secret/admin-cert \
  --key /etc/elasticsearch/secret/admin-key \
https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize                



The output looks like this:

host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize
10.128.0.6           2262             0          0           0             50

"completed" means the number of bulk indexing operations that have been completed.  There will be many, hundreds or thousands of, log records per bulk index request.  "queue" is the number of pending requests that have been queued up for the server to process.  Once this queue is full, additional operations will be rejected.

Note the number of bulk.rejected operations.  These should correspond to "error status=429", roughly, in your Fluentd pod logs.  Rejected operations means that Fluentd dropped these records, and you might need to increase the chunk size again.

If you have multiple nodes running Elasticsearch, they will each be listed in the curl output.

[1] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.2/lib/fluent/plugin/out_elasticsearch.rb#L353

[2] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.5/lib/fluent/plugin/out_elasticsearch.rb#L355

Note You need to log in before you can comment on or make changes to this bug.