Created attachment 1296662 [details] logging-mux logs during high cpu episode Description of problem: On the scalability cluster with 3 logging-mux pods running, starting 30 fluentd pods by labeling nodes with a 0.5 second delay pegs the logging-mux pod CPU at 100% of 1 core and it never recovers until the fluentd pods are stopped. Illustration of the issue: http://pbench.perf.lab.eng.bos.redhat.com/results/EC2::svt-j-1/20170711_3mux_30_fluentd_idle_a/tools-default/svt_infra_4:172.16.0.5/pidstat/cpu_usage.html The logging-mux-fluentd jumps from 0% utilization to a full core and never abates. Environment overview: 3 masters/ 3 etcd 3 infra nodes (40 vCPU/110 GB RAM) where all logging components are running 3 node ES cluster size (3 deployment ES configs). 1 ES per infra node 3 logging-mux replicas. 1 logging-mux per infra node 30 fluentd pods running (10 per logging-mux). Running on compute nodes with 4 vCPU and 8GB memory. There are a total of 460 nodes in the cluster, but only 30 are running fluentd in this scenario. ES deployment config memory limit raised to 31 GB and cpu limit removed logging-mux memory and cpu limits removed logging-fluentd memory and cpu limits removed no logging activity apart from ops logging on the logging-fluentd nodes logging-mux and logging-fluentd are running the v3.6.140-3 fluentd image: <brew-pulp internal hostname>:8888/openshift3/logging-fluentd v3.6.140-3 40f006cee71d 2 hours ago 232.2 MB Version-Release number of selected component (if applicable): 3.6.140-3 fluentd image How reproducible: Always when starting the fluentd pods by labelling nodes with a 0.5 second sleep Additional info: pidstat for logging-mux fluentd when the logging-fluentd pods start: 09:35:02 PM UID PID %usr %system %guest %CPU CPU Command 09:38:32 PM 0 120500 0.10 0.00 0.00 0.10 33 fluentd 09:38:42 PM 0 120500 0.00 0.00 0.00 0.00 33 fluentd 09:38:52 PM 0 120500 77.50 1.40 0.00 78.90 4 fluentd 09:39:02 PM 0 120500 97.70 2.20 0.00 99.90 0 fluentd 09:39:12 PM 0 120500 98.00 1.70 0.00 99.70 4 fluentd 09:39:22 PM 0 120500 98.30 1.70 0.00 100.00 34 fluentd 09:39:32 PM 0 120500 96.10 3.80 0.00 99.90 6 fluentd 09:39:42 PM 0 120500 94.30 5.30 0.00 99.60 3 fluentd 09:39:52 PM 0 120500 95.10 4.40 0.00 99.50 24 fluentd 09:40:02 PM 0 120500 96.80 3.00 0.00 99.80 8 fluentd 09:40:12 PM 0 120500 97.50 1.90 0.00 99.40 2 fluentd 09:40:22 PM 0 120500 93.40 5.40 0.00 98.80 17 fluentd 09:40:32 PM 0 120500 95.00 4.70 0.00 99.70 26 fluentd 09:40:42 PM 0 120500 97.90 1.50 0.00 99.40 2 fluentd logging-fluentd logs show repeated slow flush messages (logs attached) 2017-07-11 21:51:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ab62b4" elapsed_time=21.516913934 slow_flush_log_threshold=20.0 2017-07-11 21:52:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ab62b4" elapsed_time=20.114825274 slow_flush_log_threshold=20.0 2017-07-11 21:52:38 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ab62b4" elapsed_time=20.525696254 slow_flush_log_threshold=20.0 2017-07-11 21:53:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ab62b4" elapsed_time=20.196766113 slow_flush_log_threshold=20.0 2017-07-11 21:54:02 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ab62b4" elapsed_time=20.765578436 slow_flush_log_threshold=20.0 2017-07-11 22:01:35 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ab62b4" elapsed_time=20.002420649 slow_flush_log_threshold=20.0
The logging-tmux deployment config and logging-fluentd daemonset were both updated to use the v3.6.140-3 container images
This appears to be a bug in fluentd, or one of its plugins, where something triggers a spin loop moving the process to use 100% of the CPU continually, no matter what load or connection rate is placed on it. No logs from fluentd indicate any problems. Further debugging is needed.
This looks like the same thing Shirly is seeing in the RHV lab
The 232 MB in #c0 is the docker image size - not a runtime value. For CPU, fluentd is limited to 1 core on both the mux and the collector - fluentd can't use more than a core without the multiprocess plugin. So, the only actual "unlimited" resource would be memory.
It seems we can move most of the log processing from the mux size to the individual fluentd collectors to reduce the CPU load. Then the mux will only receive, enhance with kube metadata, and send to ES. That might be a simple change we can make easily, and verify rather quickly.
Fixed by https://github.com/openshift/openshift-ansible/commit/a6ed38676ad48a35911d449682cf2120016335c8 and https://github.com/openshift/origin-aggregated-logging/commit/d0c4f0817d192f9bd503cd83be2aa9bf22c5aee0 When deploying logging to use mux, configure ansible like this: openshift_logging_use_mux=True openshift_logging_mux_client_mode=maximal This will configure the fluentd running on each node to do as much of the processing as possible, except for the k8s metadata processing, which will be done by mux. This should greatly offload the cpu usage from mux on to the fluentd running on each node. These fixes should already be in the latest openshift-ansible and logging image 3.6.1 candidates.
Verified on 3.7.0-0.126.4. Single logging-mux pod scalability now depends on message rates. When scale limits of logging-mux are reached, it exhibits high cpu utilization but no longer throws errors or suffers from message loss.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049