Bug 1470862
Summary: | Fluentd logs filling with "Could not push log to Elasticsearch" errors, status=429 when starting 100-150 fluentd pods | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Logging | Assignee: | Rich Megginson <rmeggins> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.6.0 | CC: | aos-bugs, jcantril, mifiedle, pportant, pweil, rmeggins, stwalter, trankin, vlaad, wsun, xtian | ||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||
Target Release: | 3.6.z | ||||||
Hardware: | x86_64 | ||||||
OS: | All | ||||||
Whiteboard: | aos-scalability-36 | ||||||
Fixed In Version: | Doc Type: | Release Note | |||||
Doc Text: |
In OCP 3.5 and earlier, the Fluentd image included fluent-plugin-elasticsearch version 1.9.2 and earlier. This version will silently drop records sent in a bulk index request when the queue size is full [1]. In OCP 3.6, which uses version 1.9.5, an error log message was added, which is why we now see the “Error: status=429” message in the Fluentd logs when this occurs [2].
One thing that might help to reduce the frequency of this problem is to increase the Fluentd buffer chunk size, but so far our testing does not give consistent results. You will need to stop, configure, and restart Fluentd running on all of your nodes.
- edit the daemonset
# oc edit -n logging daemonset logging-fluentd
in the `env:` section, look for the BUFFER_SIZE_LIMIT. If the value is less than 8Mi (8 megabytes), change the value to 8Mi, otherwise, use a value of 16Mi or 32Mi. This will increase, roughly, the size of each bulk index request, and the theory is that this will decrease the number of such requests, made to Elasticsearch, thereby allowing Elasticsearch to process them more efficiently.
Once the edit is saved, the Fluentd daemonset trigger should cause a restart of all of the Fluentd pods running in the cluster.
## How to monitor Elasticsearch ##
You can monitor the Elasticsearch bulk index thread pool to see how many bulk index requests it processes and rejects.
- get the name of an Elasticsearch pod
# oc get -n logging pods -l component=es
# espod=$name_of_es_pod
- issue the following command
# oc exec -n logging $espod -- \
curl -s -k --cert /etc/elasticsearch/secret/admin-cert \
--key /etc/elasticsearch/secret/admin-key \
https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize
The output looks like this:
host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize
10.128.0.6 2262 0 0 0 50
"completed" means the number of bulk indexing operations that have been completed. There will be many, hundreds or thousands of, log records per bulk index request. "queue" is the number of pending requests that have been queued up for the server to process. Once this queue is full, additional operations will be rejected.
Note the number of bulk.rejected operations. These should correspond to "error status=429", roughly, in your Fluentd pod logs. Rejected operations means that Fluentd dropped these records, and you might need to increase the chunk size again.
If you have multiple nodes running Elasticsearch, they will each be listed in the curl output.
[1] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.2/lib/fluent/plugin/out_elasticsearch.rb#L353
[2] https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.5/lib/fluent/plugin/out_elasticsearch.rb#L355
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1477742 (view as bug list) | Environment: | |||||
Last Closed: | 2017-10-25 13:02:19 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1477742 | ||||||
Attachments: |
|
@Peter, I presume this is related to the amount of resources given to Elasticsearch? @Mike, Can you provide additional details regarding the deployment of the environment: https://github.com/openshift/origin-aggregated-logging/blob/master/docs/issues.md @Peter is this related? https://bugzilla.redhat.com/show_bug.cgi?id=1395405 OpenShift Cluster Details Cluster version (e.g. OCP v3.5) 3.6.126.1 Cluster infrastructure (e.g. AWS, GCP) OSP 11 Number of infrastructure nodes and size (e.g. CPU, RAM) 3 - 40 vCPU, 110GB RAM Number of compute nodes and size (e.g. CPU, RAM) 450 - 4 vCPU 16 GB RAM Logging Image versions 3.6.140-3 for fluentd 3.6.140-1 for elastic, kibana, curator Installer Details How was logging installed (e.g. Openshift Ansible/Installer, deployer) openshift-ansible openshift-logging.yaml Openshift installer version 3.6.126.1 Ansible inventory file [oo_first_master] 172.16.0.8 [oo_first_master:vars] openshift_deployment_type=openshift-enterprise openshift_release=v3.6.0 openshift_logging_install_logging=true openshift_logging_use_ops=false openshift_logging_master_url=https://172.16.0.20:8443 openshift_logging_master_public_url=https://10.12.70.33:8443 openshift_logging_kibana_hostname=kibana.ocp.net openshift_logging_namespace=logging openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/ openshift_logging_image_version=v3.6.126.1 openshift_logging_es_cluster_size=3 openshift_logging_es_pvc_dynamic=true openshift_logging_es_pvc_size=250Gi openshift_logging_fluentd_use_journal=true openshift_logging_use_mux=true openshift_logging_use_mux_client=true (image later upgraded to versions above) Ansible installer logs not saved Logging Component Details Customizations: 1. Removed console logging for elasticsearch config 2. Removed cpu limits for all logging components 3. Raised elasticsearch memory limit to 31MB 4. Removed memory limit for fluentd 5. BUFFER_SIZE_LIMIT=16m and BUFFER_QUEUE_LIMIT=128 for fluentd. We lost this cluster today, will have to try to reproduce in another env if needed. @portante has more data as well. I'm a little confused why any of the items listed in 'customizations' is not reflected in the inventory file since there are variables for all these items. Also, can you please clarify the memory limit for elasticsearch since '31MB' is below the minimum threshold calculated for ES instances. Customizations 1. https://github.com/openshift/openshift-ansible/pull/4773 2. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/defaults/main.yml#L9 3. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/defaults/main.yml#L10 4. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/defaults/main.yml#L12 5. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/templates/fluentd.j2#L99-L102 The updates were made post-install to the deployment configurations and daemonsets at the advice of @portante. The ES instances were 40vCPU and 110GB RAM. Where do you see a smaller calculation? Is there a bug for very large memory sizes in the calculation? @Mike, You will never be able to set the memory larger then 64G which calculates down to actual heap of 32G [1]. Max heap is based on memory size guidelines specified by ES. The number I identified in comment #5 was a reflection of what you stated in comment #4 [1] https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L30 Gah, need to read (and type) better. 31GB, not 31MB. Don't know many huge java projects that run well in 31MB. Apologies. 31 Gigabytes, not 31 Megabytes. @jeff, the 31 GB number comes from the first line of this guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#_just_how_far_under_32gb_should_i_set_the_jvm @Peter do you wish to give it max heap or half [1] of 31Gb: [1] https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L73 @Jeff, we wanted to give it the HEAP size of 31GB to ES, the maximum. We don't want to specify 32GB, because that can push us over the large pointer limit. @pportant - please take a look at the Release Note I added to the Doc Text. @Rich, we have to work on that doc text a bit. I don' think that is the answer. Let's review together and re-post. Updated Doc Text curl command - still not complete Fixed in 3.6: koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=595912 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-82316-20170908145540 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31-2 Waiting for 3.7 puddle rebuild in order to fix image for 3.7 Verified on 3.6.173.0.32. Can push 100 messages/sec/node * 200 nodes without issue with Elastic Search cpu limits disabled (3 node ES cluster with 8vCPU/instance). logging-fluentd rarely has to retry and when it does it does so successfully. With this new fix, when Fluentd gets an error response from Elasticsearch, Fluentd will retry the operation until it succeeds. However, it will also retry successful operations too in some cases, which will lead to duplicate records. See https://bugzilla.redhat.com/show_bug.cgi?id=1491401 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049 |
Created attachment 1297867 [details] full log from a fluentd pod Description of problem: Scaling up the number of (non-mux) fluentds in the 3.6 scale cluster, at somewhere between 100-150 fluentd pods, the fluentd logs start filling with freuent "Could not push to Elasticsearch" messages (see below for an example. Full pod log attached) Environment overview: 3 masters/ 3 etcd 3 infra nodes (40 vCPU/110 GB RAM) where all logging components are running 3 node ES cluster size (3 deployment ES configs). 1 ES per infra node logging-mux is NOT in use Change BUFFER_QUEUE_LIMIT=128 and BUFFER_SIZE_LIMIT=16m at the request of @portante There are a total of 460 nodes in the cluster Version-Release number of selected component (if applicable): logging-fluentd is 3.6.140-3, elasticsearch and kibana are 3.6.140-1 How reproducible: Always in this cluster when labelling 150 nodes for fluentd Steps to Reproduce: 1. Set up logging with 3 elasticsearch cluster on a cluster with > 150 nodes 2. Start labelling nodes with logging-infra-fluentd=true 3. Watch the logs of the fluentd pods for the Could not push to Elasticsearch errors. Actual results: 2017-07-13 16:12:00 -0400 [error]: Could not push log to Elasticsearch: {"took"=>3, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lEt-yH UkC3cmRhe-", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1a34d37a on EsThreadPoolExecutor[bulk, queue capac ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327053]]"}}}]} 2017-07-13 16:12:00 -0400 [error]: Could not push log to Elasticsearch: {"took"=>1, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lLjtBr GY7t4hWyhu", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@390823ea on EsThreadPoolExecutor[bulk, queue capac ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327053]]"}}}]} 2017-07-13 16:12:10 -0400 [error]: Could not push log to Elasticsearch: {"took"=>4, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6y brsXkJhZGc", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capac ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327119]]"}}}, {"create"=>{"_index"=>".operation s.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6ybrsXkJhZGd", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.T ransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 5 0, completed tasks = 327119]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6ybrsXkJhZGe", "status"=>429, "error"=>{"type"=>"es_rejected_execution_except ion", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312 a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327119]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lHYN6ybrsXkJh ZGf", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@3c670dc8 on EsThreadPoolExecutor[bulk, queue capacity = 5 0, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327119]]"}}}]} 2017-07-13 16:12:25 -0400 [error]: Could not push log to Elasticsearch: {"took"=>1, "errors"=>true, "items"=>[{"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBr GY7t4hYBE7", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capac ity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327191]]"}}}, {"create"=>{"_index"=>".operation s.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBrGY7t4hYBE8", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.T ransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 5 0, completed tasks = 327191]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBrGY7t4hYBE9", "status"=>429, "error"=>{"type"=>"es_rejected_execution_except ion", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312 a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327191]]"}}}, {"create"=>{"_index"=>".operations.2017.07.13", "_type"=>"com.redhat.viaq.common", "_id"=>"AV09lRrUBrGY7t4hY BE-", "status"=>429, "error"=>{"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@1e491dd2 on EsThreadPoolExecutor[bulk, queue capacity = 5 0, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@312a2162[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 327191]]"}}}]} Expected results: Additional info: