Bug 1477742 - How to monitor Elasticsearch for bulk index rejections
Summary: How to monitor Elasticsearch for bulk index rejections
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 3.6.0
Hardware: x86_64
OS: All
unspecified
high
Target Milestone: ---
: 3.6.z
Assignee: Vikram Goyal
QA Contact: Vikram Goyal
Vikram Goyal
URL:
Whiteboard: aos-scalability-36
Depends On: 1470862
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-02 18:58 UTC by Rich Megginson
Modified: 2019-11-20 18:52 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1470862
Environment:
Last Closed: 2019-11-20 18:52:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Rich Megginson 2017-08-02 18:58:08 UTC
I'm not sure where this goes, maybe in the Performance guide if there is a section on logging, but we need to document how customers can monitor Elasticsearch for bulk index rejections.

Below is copied from the doc text of 1470862:

In OCP 3.5 and earlier, the Fluentd image included fluent-plugin-elasticsearch version 1.9.2 and earlier.  This version will silently drop records sent in a bulk index request when the queue size is full [1].  In OCP 3.6, which uses version 1.9.5, an error log message was added, which is why we now see the “Error: status=429” message in the Fluentd logs when this occurs [2].

One thing that might help to reduce the frequency of this problem is to increase the Fluentd buffer chunk size, but so far our testing does not give consistent results.  You will need to stop, configure, and restart Fluentd running on all of your nodes.

- edit the daemonset

# oc edit -n logging daemonset logging-fluentd

in the `env:` section, look for the BUFFER_SIZE_LIMIT.  If the value is less than 8Mi (8 megabytes), change the value to 8Mi, otherwise, use a value of 16Mi or 32Mi.  This will increase, roughly, the size of each bulk index request, and the theory is that this will decrease the number of such requests, made to Elasticsearch, thereby allowing Elasticsearch to process them more efficiently.

Once the edit is saved, the Fluentd daemonset trigger should cause a restart of all of the Fluentd pods running in the cluster.

## How to monitor Elasticsearch ##

You can monitor the Elasticsearch bulk index thread pool to see how many bulk index requests it processes and rejects.

- get the name of an Elasticsearch pod

# oc get -n logging pods -l component=es

# espod=$name_of_es_pod

- issue the following command

# oc exec -n logging $espod -- \
  curl -s -k --cert /etc/elasticsearch/secret/admin-cert \
  --key /etc/elasticsearch/secret/admin-key \
https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize                



The output looks like this:

host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize
10.128.0.6           2262             0          0           0             50

"completed" means the number of bulk indexing operations that have been completed.  There will be many, hundreds or thousands of, log records per bulk index request.  "queue" is the number of pending requests that have been queued up for the server to process.  Once this queue is full, additional operations will be rejected.

Note the number of bulk.rejected operations.  These should correspond to "error status=429", roughly, in your Fluentd pod logs.  Rejected operations means that Fluentd dropped these records, and you might need to increase the chunk size again.

If you have multiple nodes running Elasticsearch, they will each be listed in the curl output.

[1] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.2/lib/fluent/plugin/out_elasticsearch.rb#L353

[2] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.5/lib/fluent/plugin/out_elasticsearch.rb#L355

Comment 1 Stephen Cuppett 2019-11-20 18:52:04 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift


Note You need to log in before you can comment on or make changes to this bug.