Bug 1469859
Summary: | Starting 30 fluentd pods with mux service enabled and 3 logging-mux pods running pegs logging-mux pod CPU | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Logging | Assignee: | Rich Megginson <rmeggins> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.6.0 | CC: | aos-bugs, bmcelvee, jcantril, nhosoi, pportant, rmeggins, sradco, vlaad, wabouham, xtian | ||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||
Target Release: | 3.6.z | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | aos-scalability-36 | ||||||
Fixed In Version: | Doc Type: | Technology Preview | |||||
Doc Text: |
Package(s) providing the Technology Preview: OpenShift logging
Description of the Technology Preview: Using Fluentd as a component for aggregating logs from Fluentd node agents, called "mux", is a Technology Preview for 3.6.0.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-10-25 13:02:19 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1482002, 1498999 | ||||||
Bug Blocks: | 1413147, 1534588 | ||||||
Attachments: |
|
Description
Mike Fiedler
2017-07-12 02:36:09 UTC
The logging-tmux deployment config and logging-fluentd daemonset were both updated to use the v3.6.140-3 container images The logging-tmux deployment config and logging-fluentd daemonset were both updated to use the v3.6.140-3 container images This appears to be a bug in fluentd, or one of its plugins, where something triggers a spin loop moving the process to use 100% of the CPU continually, no matter what load or connection rate is placed on it. No logs from fluentd indicate any problems. Further debugging is needed. This looks like the same thing Shirly is seeing in the RHV lab The 232 MB in #c0 is the docker image size - not a runtime value. For CPU, fluentd is limited to 1 core on both the mux and the collector - fluentd can't use more than a core without the multiprocess plugin. So, the only actual "unlimited" resource would be memory. It seems we can move most of the log processing from the mux size to the individual fluentd collectors to reduce the CPU load. Then the mux will only receive, enhance with kube metadata, and send to ES. That might be a simple change we can make easily, and verify rather quickly. Fixed by https://github.com/openshift/openshift-ansible/commit/a6ed38676ad48a35911d449682cf2120016335c8 and https://github.com/openshift/origin-aggregated-logging/commit/d0c4f0817d192f9bd503cd83be2aa9bf22c5aee0 When deploying logging to use mux, configure ansible like this: openshift_logging_use_mux=True openshift_logging_mux_client_mode=maximal This will configure the fluentd running on each node to do as much of the processing as possible, except for the k8s metadata processing, which will be done by mux. This should greatly offload the cpu usage from mux on to the fluentd running on each node. These fixes should already be in the latest openshift-ansible and logging image 3.6.1 candidates. Verified on 3.7.0-0.126.4. Single logging-mux pod scalability now depends on message rates. When scale limits of logging-mux are reached, it exhibits high cpu utilization but no longer throws errors or suffers from message loss. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049 |