In OCP 3.5 and earlier, the Fluentd image included fluent-plugin-elasticsearch version 1.9.2 and earlier. This version will silently drop records sent in a bulk index request when the queue size is full . In OCP 3.6, which uses version 1.9.5, an error log message was added, which is why we now see the “Error: status=429” message in the Fluentd logs when this occurs .
One thing that might help to reduce the frequency of this problem is to increase the Fluentd buffer chunk size, but so far our testing does not give consistent results. You will need to stop, configure, and restart Fluentd running on all of your nodes.
- edit the daemonset
# oc edit -n logging daemonset logging-fluentd
in the `env:` section, look for the BUFFER_SIZE_LIMIT. If the value is less than 8Mi (8 megabytes), change the value to 8Mi, otherwise, use a value of 16Mi or 32Mi. This will increase, roughly, the size of each bulk index request, and the theory is that this will decrease the number of such requests, made to Elasticsearch, thereby allowing Elasticsearch to process them more efficiently.
Once the edit is saved, the Fluentd daemonset trigger should cause a restart of all of the Fluentd pods running in the cluster.
## How to monitor Elasticsearch ##
You can monitor the Elasticsearch bulk index thread pool to see how many bulk index requests it processes and rejects.
- get the name of an Elasticsearch pod
# oc get -n logging pods -l component=es
- issue the following command
# oc exec -n logging $espod -- \
curl -s -k --cert /etc/elasticsearch/secret/admin-cert \
--key /etc/elasticsearch/secret/admin-key \
The output looks like this:
host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize
10.128.0.6 2262 0 0 0 50
"completed" means the number of bulk indexing operations that have been completed. There will be many, hundreds or thousands of, log records per bulk index request. "queue" is the number of pending requests that have been queued up for the server to process. Once this queue is full, additional operations will be rejected.
Note the number of bulk.rejected operations. These should correspond to "error status=429", roughly, in your Fluentd pod logs. Rejected operations means that Fluentd dropped these records, and you might need to increase the chunk size again.
If you have multiple nodes running Elasticsearch, they will each be listed in the curl output.