Bug 1445053

Summary:	Fluentd logger is unable to keep up with high amounts of logs from containers running on node.
Product:	OpenShift Container Platform	Reporter:	Ryan Howe <rhowe>
Component:	Logging	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	medium	Docs Contact:
Priority:	low
Version:	3.6.0	CC:	aos-bugs, a, jcantril, juzhao, mifiedle, nhosoi, pportant, rhowe, rmeggins, vlaad, wsun, xiazhao
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-34
Fixed In Version:		Doc Type:	Release Note
Doc Text:	When running logging in a large environment, the default values for fluentd memory, cpu, and buffer/chunk sizes will not be sufficient. You will need to edit the daemonset configuration to increase these limits. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1466005 for the process to follow.	Story Points:	---
Clone Of:
Clones:	1466005 (view as bug list)		Environment:
Last Closed:	2017-11-28 21:53:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1466005

Description Ryan Howe 2017-04-24 19:57:50 UTC

Description of problem:

When generating a large amount of logs fluentd running on openshift can not keep up. Logs are dropped or logs are exported after a long wait time. 


Steps to Reproduce:
1. Running a test that produces 50k lines of logs at 10k per line, where we end up producing about 2.5 to 4.5 million log lines per hour from cluster. The logs come from 3 working nodes averaging 80 pods per node. 

2. Configured fluentd to export logs to an external fluentd server. 


Actual results:
High CPU usage from fluentd and when tailing logs from containers and what was exported. Seeing 1 hour or more delay at times, as well as missing log entries.


Expected results:
Fluent to export logs with out a huge delay. 


Additional info:

We believe the issue is due to the following:
1. OpenShift's configs for fluend modifying the logs to be better read by kibana
2. Fluentd just not able to preform well under this load. 


Cluster details: 
  Worker Nodes have 32GB ram with 8 cpus 
  The network 10GB x1
  The disk is fiberchannel mounted.

Comment 1 Anurag Gupta 2017-04-25 23:24:42 UTC

Would you be able to post your Fluentd configuration for both the nodes and external fluentd server?

Comment 2 Rich Megginson 2017-06-16 15:48:18 UTC

The default memory and cpu for fluentd is not suitable for very high load environments.  You can edit the cpu and memory limits in the daemonset logging-fluentd to increase these.  Does that help?

Comment 3 Noriko Hosoi 2017-06-21 21:09:44 UTC

It looks this bug is for the documentation/tuning guide.

> I have created the following bugs around our fluentd to get more documentation and configure fluentd to better handle high volume of logs coming from containers. 
> https://bugzilla.redhat.com/show_bug.cgi?id=1445053

Adding [DOCS] to the subject.

In OCP-3.6, the CPU and MEMORY limit are configurable with the environment variables:
      FLUENTD_CPU_LIMIT: 100m (by default)
      FLUENTD_MEMORY_LIMIT: 512Mi (by default)
and in the ansible inventory:
      openshift_logging_fluentd_cpu_limit: 100m (by default)
      openshift_logging_fluentd_memory_limit: 512Mi (by default)

Note: the mux (server side) fluentd has the different variables and default values.
      MUX_CPU_LIMIT: 500m (by default)
      MUX_MEMORY_LIMIT: 2Gi (by default)
      openshift_logging_mux_cpu_limit: 500m (by default)
      openshift_logging_mux_memory_limit: 2Gi (by default)

Comment 4 Jeff Cantrill 2017-06-28 14:49:12 UTC

https://github.com/openshift/openshift-ansible/pull/4618

Comment 7 Xia Zhao 2017-07-06 10:01:43 UTC

@nhosoi

Could you please help confirm if this bug is same with this trello card? If yes, then QE had verified it and passed, according to the test result:
https://trello.com/c/rRDMkhhF/512-3-change-fluentd-buffer-queue-and-chunk-limits-to-more-reasonable-valuesloggingepic-ois-agl-perf

Comment 8 Noriko Hosoi 2017-07-06 16:29:15 UTC

Hi Xia, 

(In reply to Xia Zhao from comment #7)
> @nhosoi
> 
> Could you please help confirm if this bug is same with this trello card? If
> yes, then QE had verified it and passed, according to the test result:
> https://trello.com/c/rRDMkhhF/512-3-change-fluentd-buffer-queue-and-chunk-
> limits-to-more-reasonable-valuesloggingepic-ois-agl-perf

Yes, you are right.  The trello card you mentioned is the same with this bug. In addition, this card is also related to this bug.

https://trello.com/c/pHnZPpoE/435-2-tune-buffer-chunk-limit-loggingepic-ois-agl-perfcda

Thanks!

Comment 12 Junqi Zhao 2017-07-12 23:44:54 UTC

@Noriko

This defect is already verified on non HA environments, should we verify it on a large scale or performance environment?

Comment 13 Noriko Hosoi 2017-07-13 00:07:02 UTC

(In reply to Junqi Zhao from comment #12)
> @Noriko
> 
> This defect is already verified on non HA environments, should we verify it
> on a large scale or performance environment?

Hi Junqi,
The effort conducted by Mike has already started and some performance issue is reported with the larger scale.  You may want to keep eye on this bug...
https://bugzilla.redhat.com/show_bug.cgi?id=1469859

Comment 14 Mike Fiedler 2017-07-17 14:19:22 UTC

This bug is targeted for 3.6.0, but verification is blocked by bug 1469859 which is currently targeted for 3.6.1

Comment 15 Rich Megginson 2017-08-02 19:05:34 UTC

This is related to https://bugzilla.redhat.com/show_bug.cgi?id=1470862 and may even be a duplicate of that, as symptoms of 1470862 will be that logs are very slow to show up in Kibana, and logs will be missing.

I don't think this should depend on 1469859 as that bz is about mux, not fluentd directly into elasticsearch.

Comment 17 Mike Fiedler 2017-09-18 14:37:48 UTC

Verified on logging 3.6.173.0.32.

Replicated the environment this bz was originally opened against.  80 pods running on 3 fluentd nodes logging at 4.5 million lines per hour.   logging-fluentd easily kept up.  No message loss and no lag.

ES was a 3 node ES cluster with cpu and memory limits removed
logging-fluentd had a momory limit of 1Gi and a 1 cpu limit (1000 millicore)
ES storage was AWS EBS io1 


We repeated the test at 6.5 million lines per hour and again logging-fluentd kept up.  journald itself was about out capacity though, it  was using almost a full core.

Comment 21 errata-xmlrpc 2017-11-28 21:53:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188