Bug 1470862

Summary: Fluentd logs filling with "Could not push log to Elasticsearch" errors, status=429 when starting 100-150 fluentd pods
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: LoggingAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, jcantril, mifiedle, pportant, pweil, rmeggins, stwalter, trankin, vlaad, wsun, xtian
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.6.z   
Hardware: x86_64   
OS: All   
Whiteboard: aos-scalability-36
Fixed In Version: Doc Type: Release Note
Doc Text:
In OCP 3.5 and earlier, the Fluentd image included fluent-plugin-elasticsearch version 1.9.2 and earlier. This version will silently drop records sent in a bulk index request when the queue size is full [1]. In OCP 3.6, which uses version 1.9.5, an error log message was added, which is why we now see the “Error: status=429” message in the Fluentd logs when this occurs [2]. One thing that might help to reduce the frequency of this problem is to increase the Fluentd buffer chunk size, but so far our testing does not give consistent results. You will need to stop, configure, and restart Fluentd running on all of your nodes. - edit the daemonset # oc edit -n logging daemonset logging-fluentd in the `env:` section, look for the BUFFER_SIZE_LIMIT. If the value is less than 8Mi (8 megabytes), change the value to 8Mi, otherwise, use a value of 16Mi or 32Mi. This will increase, roughly, the size of each bulk index request, and the theory is that this will decrease the number of such requests, made to Elasticsearch, thereby allowing Elasticsearch to process them more efficiently. Once the edit is saved, the Fluentd daemonset trigger should cause a restart of all of the Fluentd pods running in the cluster. ## How to monitor Elasticsearch ## You can monitor the Elasticsearch bulk index thread pool to see how many bulk index requests it processes and rejects. - get the name of an Elasticsearch pod # oc get -n logging pods -l component=es # espod=$name_of_es_pod - issue the following command # oc exec -n logging $espod -- \ curl -s -k --cert /etc/elasticsearch/secret/admin-cert \ --key /etc/elasticsearch/secret/admin-key \ https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize The output looks like this: host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 10.128.0.6 2262 0 0 0 50 "completed" means the number of bulk indexing operations that have been completed. There will be many, hundreds or thousands of, log records per bulk index request. "queue" is the number of pending requests that have been queued up for the server to process. Once this queue is full, additional operations will be rejected. Note the number of bulk.rejected operations. These should correspond to "error status=429", roughly, in your Fluentd pod logs. Rejected operations means that Fluentd dropped these records, and you might need to increase the chunk size again. If you have multiple nodes running Elasticsearch, they will each be listed in the curl output. [1] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.2/lib/fluent/plugin/out_elasticsearch.rb#L353 [2] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.5/lib/fluent/plugin/out_elasticsearch.rb#L355
Story Points: ---
Clone Of:
: 1477742 (view as bug list) Environment:
Last Closed: 2017-10-25 13:02:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1477742    
Attachments:
Description Flags
full log from a fluentd pod none

Description Mike Fiedler 2017-07-13 20:56:03 UTC
Created attachment 1297867 [details]
full log from a fluentd pod

Description of problem:

Scaling up the number of (non-mux) fluentds in the 3.6 scale cluster, at somewhere between 100-150 fluentd pods, the fluentd logs start filling with freuent "Could not push to Elasticsearch" messages (see below for an example.   Full pod log attached)

Environment overview:

3 masters/ 3 etcd
3 infra nodes (40 vCPU/110 GB RAM) where all logging components are running
3 node ES cluster size (3 deployment ES configs).  1 ES per infra node
logging-mux is NOT in use
Change BUFFER_QUEUE_LIMIT=128 and BUFFER_SIZE_LIMIT=16m at the request of @portante

There are a total of 460 nodes in the cluster


Version-Release number of selected component (if applicable): logging-fluentd is 3.6.140-3, elasticsearch and kibana are 3.6.140-1


How reproducible:  Always in this cluster when labelling 150 nodes for fluentd


Steps to Reproduce:
1.  Set up logging with 3 elasticsearch cluster on a cluster with > 150 nodes
2.  Start labelling nodes with logging-infra-fluentd=true
3.  Watch the logs of the fluentd pods for the Could not push to Elasticsearch errors.

Actual results:

2017-07-13 16:12:00 -0400 [error]: Could not push log to Elasticsearch: {"took"=>3, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lEt-yH
UkC3cmRhe-", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1a34d37a on EsThreadPoolExecutor[bulk, queue capac
ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327053]]"}}}]}
2017-07-13 16:12:00 -0400 [error]: Could not push log to Elasticsearch: {"took"=>1, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lLjtBr
GY7t4hWyhu", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@390823ea on EsThreadPoolExecutor[bulk, queue capac
ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327053]]"}}}]}
2017-07-13 16:12:10 -0400 [error]: Could not push log to Elasticsearch: {"took"=>4, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6y
brsXkJhZGc", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capac
ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327119]]"}}}, {"create"=>{"_index"=>".operation
s.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6ybrsXkJhZGd", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.T
ransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 5
0, completed tasks = 327119]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6ybrsXkJhZGe", "status"=>429, "error"=>{"type"=>"es_rejected_execution_except
ion", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312
a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327119]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6ybrsXkJh
ZGf", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capacity = 5
0, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327119]]"}}}]}
2017-07-13 16:12:25 -0400 [error]: Could not push log to Elasticsearch: {"took"=>1, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBr
GY7t4hYBE7", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capac
ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327191]]"}}}, {"create"=>{"_index"=>".operation
s.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBrGY7t4hYBE8", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.T
ransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 5
0, completed tasks = 327191]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBrGY7t4hYBE9", "status"=>429, "error"=>{"type"=>"es_rejected_execution_except
ion", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312
a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327191]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBrGY7t4hY
BE-", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capacity = 5
0, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327191]]"}}}]}


Expected results:


Additional info:

Comment 1 Jeff Cantrill 2017-07-14 20:30:15 UTC
@Peter, I presume this is related to the amount of resources given to Elasticsearch?

Comment 2 Jeff Cantrill 2017-07-14 20:31:09 UTC
@Mike,

Can you provide additional details regarding the deployment of the environment: https://github.com/openshift/origin-aggregated-logging/blob/master/docs/issues.md

Comment 3 Jeff Cantrill 2017-07-14 20:37:22 UTC
@Peter is this related? https://bugzilla.redhat.com/show_bug.cgi?id=1395405

Comment 4 Mike Fiedler 2017-07-14 21:04:15 UTC
OpenShift Cluster Details

Cluster version (e.g. OCP v3.5)
3.6.126.1

Cluster infrastructure (e.g. AWS, GCP)
OSP 11

Number of infrastructure nodes and size (e.g. CPU, RAM)
3 - 40 vCPU, 110GB RAM

Number of compute nodes and size (e.g. CPU, RAM)
450 - 4 vCPU 16 GB RAM

Logging Image versions
3.6.140-3 for fluentd
3.6.140-1 for elastic, kibana, curator

Installer Details

How was logging installed (e.g. Openshift Ansible/Installer, deployer)
openshift-ansible openshift-logging.yaml

Openshift installer version
3.6.126.1

Ansible inventory file
[oo_first_master]
172.16.0.8

[oo_first_master:vars]
openshift_deployment_type=openshift-enterprise
openshift_release=v3.6.0

openshift_logging_install_logging=true
openshift_logging_use_ops=false
openshift_logging_master_url=https://172.16.0.20:8443
openshift_logging_master_public_url=https://10.12.70.33:8443
openshift_logging_kibana_hostname=kibana.ocp.net
openshift_logging_namespace=logging
openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=v3.6.126.1
openshift_logging_es_cluster_size=3
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_size=250Gi
openshift_logging_fluentd_use_journal=true
openshift_logging_use_mux=true
openshift_logging_use_mux_client=true

(image later upgraded to versions above)

Ansible installer logs
not saved

Logging Component Details

Customizations:

1. Removed console logging for elasticsearch config
2. Removed cpu limits for all logging components
3. Raised elasticsearch memory limit to 31MB
4. Removed memory limit for fluentd
5. BUFFER_SIZE_LIMIT=16m and BUFFER_QUEUE_LIMIT=128 for fluentd.

We lost this cluster today, will have to try to reproduce in another env if needed.  @portante has more data as well.

Comment 5 Jeff Cantrill 2017-07-17 15:45:34 UTC
I'm a little confused why any of the items listed in 'customizations' is not reflected in the inventory file since there are variables for all these items.  Also, can you please clarify the memory limit for elasticsearch since '31MB' is below the minimum threshold calculated for ES instances.

Customizations

1. https://github.com/openshift/openshift-ansible/pull/4773
2. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/defaults/main.yml#L9
3. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/defaults/main.yml#L10
4. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/defaults/main.yml#L12
5. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/templates/fluentd.j2#L99-L102

Comment 6 Mike Fiedler 2017-07-17 17:32:34 UTC
The updates were made post-install to the deployment configurations and daemonsets at the advice of @portante.

The ES instances were 40vCPU and 110GB RAM.   Where do you see a smaller calculation?   Is there a bug for very large memory sizes in the calculation?

Comment 7 Jeff Cantrill 2017-07-17 17:40:57 UTC
@Mike,

You will never be able to set the memory larger then 64G which calculates down to actual heap of 32G [1].  Max heap is based on memory size guidelines specified by ES. The number I identified in comment #5 was a reflection of what you stated in comment #4



[1] https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L30

Comment 8 Mike Fiedler 2017-07-17 18:13:37 UTC
Gah, need to read (and type) better.  31GB, not 31MB.   Don't know many huge java projects that run well in 31MB.   Apologies.  31 Gigabytes, not 31 Megabytes.

Comment 9 Peter Portante 2017-07-17 23:30:29 UTC
@jeff, the 31 GB number comes from the first line of this guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#_just_how_far_under_32gb_should_i_set_the_jvm

Comment 10 Jeff Cantrill 2017-07-18 12:01:15 UTC
@Peter do you wish to give it max heap or half [1] of 31Gb:

[1] https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L73

Comment 11 Peter Portante 2017-07-18 23:04:21 UTC
@Jeff, we wanted to give it the HEAP size of 31GB to ES, the maximum.  We don't want to specify 32GB, because that can push us over the large pointer limit.

Comment 18 Rich Megginson 2017-08-01 15:50:30 UTC
@pportant - please take a look at the Release Note I added to the Doc Text.

Comment 20 Peter Portante 2017-08-01 18:07:04 UTC
@Rich, we have to work on that doc text a bit.  I don' think that is the answer.  Let's review together and re-post.

Comment 21 Rich Megginson 2017-08-01 22:49:24 UTC
Updated Doc Text curl command - still not complete

Comment 22 Rich Megginson 2017-09-08 15:23:23 UTC
Fixed in 3.6:

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=595912
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-82316-20170908145540
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31-2

Waiting for 3.7 puddle rebuild in order to fix image for 3.7

Comment 23 Mike Fiedler 2017-09-13 19:47:52 UTC
Verified on 3.6.173.0.32.

Can push 100 messages/sec/node * 200 nodes without issue with Elastic Search cpu limits disabled (3 node ES cluster with 8vCPU/instance).  logging-fluentd rarely has to retry and when it does it does so successfully.

Comment 24 Rich Megginson 2017-10-23 18:38:09 UTC
With this new fix, when Fluentd gets an error response from Elasticsearch, Fluentd will retry the operation until it succeeds.  However, it will also retry successful operations too in some cases, which will lead to duplicate records.   See https://bugzilla.redhat.com/show_bug.cgi?id=1491401

Comment 26 errata-xmlrpc 2017-10-25 13:02:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049