Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1629015

Summary: Scale lab: with 2000 nodes, fluentd stops sending buffers to elasticsearch, buffers wedged
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED DEFERRED QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, mifiedle, rmeggins
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-311
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-01 12:01:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs and wedged buffer files none

Description Mike Fiedler 2018-09-14 16:16:54 UTC
Description of problem:

2000 node scalability OCP 3.11 environment, 2000 fluentd
3 ES servers with 40 vCPU/160GB memory each.   
ES memory limit is 61GB

Frequently seeing cases where fluentd stops sending/indexing.  During multiple runs when all 2000 nodes are logging (50 messages/second/node) 2-3 fluents (sometimes more) stop having their logs recorded in ES.  Inspecting /var/lib/fluentd shows buffers backed up.  They are not retry buffers.   The fluentd logs give no indication why things are backed up and not being sent.

Restarting fluentd on the node clears the buffers and fluentd is operational again.

I did not encounter this when testing on smaller AWS environments or in the 250 node scale cluster set up earlier in 3.11.

Version-Release number of selected component (if applicable): 3.11.0-0.28.0


How reproducible: Frequent in this 2000 node scalability environment.  On approximately 1 of 3 or 4 test runs


Steps to Reproduce:
1.  Only seen in scalability env described above so far - someone should check online starter
2.  Create 1000 projects each with 2 replicas of the SVT logtest application (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md)
3.  Start logging from all 2000 pods at a rate of 50 messages/second/node
4.  Compare expected messages for each index with actual

Actual results:

A handful of indices will be missing messages.   Determine the fluentd node for that index and look at /var/lib/fluentd.  Buffers are present.

Try sending more messages from that node - they will not reach ES

Restart fluentd on that node - buffers are cleared and missing messages show up in ES.


Additional info:

Will attach a tarball with logs and stuck buffers

Comment 1 Mike Fiedler 2018-09-14 16:19:27 UTC
Created attachment 1483370 [details]
logs and wedged buffer files

Contains

fluentd pod log from a node which gets wedged
buffer files from /var/lib/fluent on that node
journal from that node
logging-es.log from all three ES servers

Comment 2 Rich Megginson 2018-09-14 17:26:27 UTC
Are you using docker or crio?  If docker - are you using log-driver json-file or journald?  If you don't know, use

sudo docker info | grep -i log

Comment 3 Rich Megginson 2018-09-14 17:29:05 UTC
Looks like elasticsearch is overloaded for some of the requests.  That might explain "a handful of indices are missing messages":

[2018-09-14T13:49:52,602][DEBUG][o.e.a.b.TransportShardBulkAction] [logging-es-data-master-40vugi05] [project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0] failed to execute bulk item (create) BulkShardRequest [[project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0]] containing [17] requests
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_181]
	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Comment 4 Rich Megginson 2018-09-14 17:35:17 UTC
(In reply to Rich Megginson from comment #3)
> Looks like elasticsearch is overloaded for some of the requests.  That might
> explain "a handful of indices are missing messages":
> 
> [2018-09-14T13:49:52,602][DEBUG][o.e.a.b.TransportShardBulkAction]
> [logging-es-data-master-40vugi05]
> [project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0]
> failed to execute bulk item (create) BulkShardRequest
> [[project.cake-a-53.6a3778a6-b784-11e8-b6b6-fa163ead9d57.2018.09.13][0]]
> containing [17] requests
> org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException:
> failed to process cluster event (put-mapping) within 30s
> 	at
> org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.
> lambda$null$0(ClusterService.java:255)
> ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
> 	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_181]
> 	at
> org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.
> lambda$onTimeout$1(ClusterService.java:254)
> ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
> 	at
> org.elasticsearch.common.util.concurrent.
> ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575)
> ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
> 	at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
> 1149) [?:1.8.0_181]
> 	at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
> 624) [?:1.8.0_181]
> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

But it is logged at DEBUG level, and we don't see any error handling in fluentd, so perhaps this isn't significant.

However, fluentd does report "buffer flush took longer time than slow_flush_log_threshold" which can indicate elasticsearch perf problems, and I also see these:

[2018-09-14T13:53:39,736][WARN ][o.e.m.j.JvmGcMonitorService] [logging-es-data-master-40vugi05] [gc][old][70219][24] duration [27.5s], collections [1]/[27.9s], total [27.5s]/[3.9m], memory [29.4gb]->[29.6gb]/[30.2gb], all_pools {[young] [1.2gb]->[1.4gb]/[1.8gb]}{[survivor] [0b]->[0b]/[232.9mb]}{[old] [28.2gb]->[28.2gb]/[28.2gb]}
[2018-09-14T13:53:39,736][WARN ][o.e.m.j.JvmGcMonitorService] [logging-es-data-master-40vugi05] [gc][70219] overhead, spent [27.5s] collecting in the last [27.9s]

which indicates elasticsearch is doing a lot of garbage collection, and this is taking a significant amount of time, which indicates elasticsearch is under memory pressure i.e. running out of RAM.

It may be that we have reached the threshold at which horizontally scaling Elasticsearch is necessary, since we cannot add more RAM.

Comment 5 Jeff Cantrill 2018-09-14 19:21:10 UTC
We should consider identifying the threshold at which it begins to break down and then understand what happens if we add a 4th (or 5th) node so we can advise customers about limitations and how to resolve.

Comment 6 Mike Fiedler 2018-09-15 18:39:39 UTC
It seems the fluentd should be error logging why it can't flush its buffers.  It does not seem to be load related as even with the cluster idle, the buffers remain until fluentd is restarted.

Comment 7 Mike Fiedler 2018-09-15 18:43:44 UTC
> Are you using docker or crio?  If docker - are you using log-driver json-file or journald? 

docker.x86_64               2:1.13.1-74.git6e3bb8e.el7  

The log driver is json-file

Comment 8 Rich Megginson 2018-09-18 20:14:40 UTC
Can you run with fluentd in DEBUG mode?

1) edit cm/logging-fluentd

change this:

    @include configs.d/openshift/system.conf

to this:

    <system>
      log_level debug
    </system>

2) oc set env ds/logging-fluentd DEBUG=true

This will cause fluentd to write its output to /var/log/fluentd.log on each node, rather than to the docker/oc logs.  Maybe we can get a clue as to why fluentd either gives up, or is stuck in some sort of loop.

Comment 12 Mike Fiedler 2019-02-01 12:01:24 UTC
This was not seen in subsequent runs in the scale lab environment.   This will be re-tested in 4.0 and a new issue will be opened if this is seen again.   Increasing the number of index shards = the number of nodes in the ES cluster did seem to help by balancing the workload across ES nodes and preventing "lumpy" loads on individual ES instances.    Marking this closed for now and will continue the investigation in 4.0.

Comment 13 Rich Megginson 2019-02-01 12:15:31 UTC
(In reply to Mike Fiedler from comment #12)
> This was not seen in subsequent runs in the scale lab environment.   This
> will be re-tested in 4.0 and a new issue will be opened if this is seen
> again.   Increasing the number of index shards = the number of nodes in the
> ES cluster did seem to help by balancing the workload across ES nodes and
> preventing "lumpy" loads on individual ES instances.    Marking this closed
> for now and will continue the investigation in 4.0.

Note that 4.0 uses fluentd 1.x and ruby 2.5, which have much improved performance over fluentd 0.12 and ruby 2.0 used in 3.11 and earlier.