Bug 1629015
| Summary: | Scale lab: with 2000 nodes, fluentd stops sending buffers to elasticsearch, buffers wedged | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
| Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||
| Status: | CLOSED DEFERRED | QA Contact: | Mike Fiedler <mifiedle> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.11.0 | CC: | aos-bugs, mifiedle, rmeggins | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | aos-scalability-311 | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-02-01 12:01:24 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Mike Fiedler
2018-09-14 16:16:54 UTC
Created attachment 1483370 [details]
logs and wedged buffer files
Contains
fluentd pod log from a node which gets wedged
buffer files from /var/lib/fluent on that node
journal from that node
logging-es.log from all three ES servers
Are you using docker or crio? If docker - are you using log-driver json-file or journald? If you don't know, use sudo docker info | grep -i log Looks like elasticsearch is overloaded for some of the requests. That might explain "a handful of indices are missing messages": [2018-09-14T13:49:52,602][DEBUG][o.e.a.b.TransportShardBulkAction] [logging-es-data-master-40vugi05] [project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0] failed to execute bulk item (create) BulkShardRequest [[project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0]] containing [17] requests org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_181] at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] (In reply to Rich Megginson from comment #3) > Looks like elasticsearch is overloaded for some of the requests. That might > explain "a handful of indices are missing messages": > > [2018-09-14T13:49:52,602][DEBUG][o.e.a.b.TransportShardBulkAction] > [logging-es-data-master-40vugi05] > [project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0] > failed to execute bulk item (create) BulkShardRequest > [[project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0]] > containing [17] requests > org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: > failed to process cluster event (put-mapping) within 30s > at > org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher. > lambda$null$0(ClusterService.java:255) > ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] > at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_181] > at > org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher. > lambda$onTimeout$1(ClusterService.java:254) > ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] > at > org.elasticsearch.common.util.concurrent. > ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) > ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: > 1149) [?:1.8.0_181] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: > 624) [?:1.8.0_181] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] But it is logged at DEBUG level, and we don't see any error handling in fluentd, so perhaps this isn't significant. However, fluentd does report "buffer flush took longer time than slow_flush_log_threshold" which can indicate elasticsearch perf problems, and I also see these: [2018-09-14T13:53:39,736][WARN ][o.e.m.j.JvmGcMonitorService] [logging-es-data-master-40vugi05] [gc][old][70219][24] duration [27.5s], collections [1]/[27.9s], total [27.5s]/[3.9m], memory [29.4gb]->[29.6gb]/[30.2gb], all_pools {[young] [1.2gb]->[1.4gb]/[1.8gb]}{[survivor] [0b]->[0b]/[232.9mb]}{[old] [28.2gb]->[28.2gb]/[28.2gb]} [2018-09-14T13:53:39,736][WARN ][o.e.m.j.JvmGcMonitorService] [logging-es-data-master-40vugi05] [gc][70219] overhead, spent [27.5s] collecting in the last [27.9s] which indicates elasticsearch is doing a lot of garbage collection, and this is taking a significant amount of time, which indicates elasticsearch is under memory pressure i.e. running out of RAM. It may be that we have reached the threshold at which horizontally scaling Elasticsearch is necessary, since we cannot add more RAM. We should consider identifying the threshold at which it begins to break down and then understand what happens if we add a 4th (or 5th) node so we can advise customers about limitations and how to resolve. It seems the fluentd should be error logging why it can't flush its buffers. It does not seem to be load related as even with the cluster idle, the buffers remain until fluentd is restarted. > Are you using docker or crio? If docker - are you using log-driver json-file or journald?
docker.x86_64 2:1.13.1-74.git6e3bb8e.el7
The log driver is json-file
Can you run with fluentd in DEBUG mode?
1) edit cm/logging-fluentd
change this:
@include configs.d/openshift/system.conf
to this:
<system>
log_level debug
</system>
2) oc set env ds/logging-fluentd DEBUG=true
This will cause fluentd to write its output to /var/log/fluentd.log on each node, rather than to the docker/oc logs. Maybe we can get a clue as to why fluentd either gives up, or is stuck in some sort of loop.
This was not seen in subsequent runs in the scale lab environment. This will be re-tested in 4.0 and a new issue will be opened if this is seen again. Increasing the number of index shards = the number of nodes in the ES cluster did seem to help by balancing the workload across ES nodes and preventing "lumpy" loads on individual ES instances. Marking this closed for now and will continue the investigation in 4.0. (In reply to Mike Fiedler from comment #12) > This was not seen in subsequent runs in the scale lab environment. This > will be re-tested in 4.0 and a new issue will be opened if this is seen > again. Increasing the number of index shards = the number of nodes in the > ES cluster did seem to help by balancing the workload across ES nodes and > preventing "lumpy" loads on individual ES instances. Marking this closed > for now and will continue the investigation in 4.0. Note that 4.0 uses fluentd 1.x and ruby 2.5, which have much improved performance over fluentd 0.12 and ruby 2.0 used in 3.11 and earlier. |