Bug 1445053
Summary: | Fluentd logger is unable to keep up with high amounts of logs from containers running on node. | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> | |
Component: | Logging | Assignee: | Noriko Hosoi <nhosoi> | |
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 3.6.0 | CC: | aos-bugs, a, jcantril, juzhao, mifiedle, nhosoi, pportant, rhowe, rmeggins, vlaad, wsun, xiazhao | |
Target Milestone: | --- | |||
Target Release: | 3.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | aos-scalability-34 | |||
Fixed In Version: | Doc Type: | Release Note | ||
Doc Text: |
When running logging in a large environment, the default values for fluentd memory, cpu, and buffer/chunk sizes will not be sufficient. You will need to edit the daemonset configuration to increase these limits. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1466005 for the process to follow.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1466005 (view as bug list) | Environment: | ||
Last Closed: | 2017-11-28 21:53:29 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1466005 |
Description
Ryan Howe
2017-04-24 19:57:50 UTC
Would you be able to post your Fluentd configuration for both the nodes and external fluentd server? The default memory and cpu for fluentd is not suitable for very high load environments. You can edit the cpu and memory limits in the daemonset logging-fluentd to increase these. Does that help? It looks this bug is for the documentation/tuning guide.
> I have created the following bugs around our fluentd to get more documentation and configure fluentd to better handle high volume of logs coming from containers.
> https://bugzilla.redhat.com/show_bug.cgi?id=1445053
Adding [DOCS] to the subject.
In OCP-3.6, the CPU and MEMORY limit are configurable with the environment variables:
FLUENTD_CPU_LIMIT: 100m (by default)
FLUENTD_MEMORY_LIMIT: 512Mi (by default)
and in the ansible inventory:
openshift_logging_fluentd_cpu_limit: 100m (by default)
openshift_logging_fluentd_memory_limit: 512Mi (by default)
Note: the mux (server side) fluentd has the different variables and default values.
MUX_CPU_LIMIT: 500m (by default)
MUX_MEMORY_LIMIT: 2Gi (by default)
openshift_logging_mux_cpu_limit: 500m (by default)
openshift_logging_mux_memory_limit: 2Gi (by default)
@nhosoi Could you please help confirm if this bug is same with this trello card? If yes, then QE had verified it and passed, according to the test result: https://trello.com/c/rRDMkhhF/512-3-change-fluentd-buffer-queue-and-chunk-limits-to-more-reasonable-valuesloggingepic-ois-agl-perf Hi Xia, (In reply to Xia Zhao from comment #7) > @nhosoi > > Could you please help confirm if this bug is same with this trello card? If > yes, then QE had verified it and passed, according to the test result: > https://trello.com/c/rRDMkhhF/512-3-change-fluentd-buffer-queue-and-chunk- > limits-to-more-reasonable-valuesloggingepic-ois-agl-perf Yes, you are right. The trello card you mentioned is the same with this bug. In addition, this card is also related to this bug. https://trello.com/c/pHnZPpoE/435-2-tune-buffer-chunk-limit-loggingepic-ois-agl-perfcda Thanks! @Noriko This defect is already verified on non HA environments, should we verify it on a large scale or performance environment? (In reply to Junqi Zhao from comment #12) > @Noriko > > This defect is already verified on non HA environments, should we verify it > on a large scale or performance environment? Hi Junqi, The effort conducted by Mike has already started and some performance issue is reported with the larger scale. You may want to keep eye on this bug... https://bugzilla.redhat.com/show_bug.cgi?id=1469859 This bug is targeted for 3.6.0, but verification is blocked by bug 1469859 which is currently targeted for 3.6.1 This is related to https://bugzilla.redhat.com/show_bug.cgi?id=1470862 and may even be a duplicate of that, as symptoms of 1470862 will be that logs are very slow to show up in Kibana, and logs will be missing. I don't think this should depend on 1469859 as that bz is about mux, not fluentd directly into elasticsearch. Verified on logging 3.6.173.0.32. Replicated the environment this bz was originally opened against. 80 pods running on 3 fluentd nodes logging at 4.5 million lines per hour. logging-fluentd easily kept up. No message loss and no lag. ES was a 3 node ES cluster with cpu and memory limits removed logging-fluentd had a momory limit of 1Gi and a 1 cpu limit (1000 millicore) ES storage was AWS EBS io1 We repeated the test at 6.5 million lines per hour and again logging-fluentd kept up. journald itself was about out capacity though, it was using almost a full core. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |