Bug 1579018
| Summary: | 3.10 scalability: logging-fluentd wedged with buffer files in /var/lib/fluentd and missing messages for 300 nodes/1200 projects | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
| Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.10.0 | CC: | aos-bugs, mifiedle, pportant, rmeggins | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.10.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | aos-scalability-310 | ||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: |
undefined
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-07-30 19:15:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Mike Fiedler
2018-05-16 19:17:14 UTC
Created attachment 1437550 [details]
logs and buffer files
logging-fluentd logs from two stuck nodes
3 ES server logs
tar file of /var/lib/fluentd on a stuck node
tl;dr Not sure if this is a "bug" or just an indication that we've hit the limit on what Elasticsearch can do with the ootb config, and we need to scale up Elasticsearch - increase bulk queue size, increase RAM, increase CPU, increase disk speed. I don't see anything in the logs to indicate any problems. The slow log threshold indicates that it is taking a long time to send records to Elasticsearch, even when they do get through (as opposed to being rejected due to bulk rejection), which is to be expected if Elasticsearch is completely overloaded. I haven't looked at the log buffer files, but I'm assuming that the messages missing in Elasticsearch are in them. If you look at _cat/indices, do you see new indices for new projects/namespaces eventually being created? Do you see the doc counts going up? If you are monitoring the bulk thread_pool, do you see the bulkcompletion (bc) rate increasing, meaning that some of the bulk operations are being processed successfully? The pbench for elasticsearch will be interesting, but I'm assuming it will show that Elasticsearch is maxed out of RAM and/or CPU and/or disk I/O throughput. Some more details, sorry. ES memory limit is 62GB ES storage is on local NVME ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent). pbench data to follow in private comment (In reply to Mike Fiedler from comment #3) > Some more details, sorry. > > ES memory limit is 62GB > ES storage is on local NVME Logging inventory: # EFK logging stack variables openshift_logging_es_pvc_storage_class_name: "gluster-storage-block" > ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent). > > pbench data to follow in private comment re: comment 4. Initially deployed on gluster-block, but problems with throughput found. PVCs recreated on local NVME (see pbench data in comment 5). On second thought - we don't yet have a 3.10 fluentd image build which has all of the latest retry logic in it - this testing will establish a good baseline, and we can use it to confirm that the new retry logic will actually help if not fix this issue. The new image shows significant improvement. Successfully logged across 3000 namespaces with no lost messages. If we get the scale cluster back in 3.10, will try to push it further. With thousands of namespaces, it is critical to precreate the indices. Quickly creating even 1500 indices is very expensive - drives elasticsearch to cpu usage of 30+ cores. I will follow up with links to performance data. Verified with logging-fluentd:v3.10.0-0.47.0.1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |