Bug 1489976

Summary:	logging-fluentd unable to process more than 1000 messages per second
Product:	OpenShift Container Platform	Reporter:	Walid A. <wabouham>
Component:	Logging	Assignee:	Rich Megginson <rmeggins>
Status:	CLOSED ERRATA	QA Contact:	Walid A. <wabouham>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.6.1	CC:	aos-bugs, jcantril, mifiedle, pportant, rmeggins, wabouham
Target Milestone:	---
Target Release:	3.6.z
Hardware:	x86_64
OS:	Linux
Whiteboard:	aos-scalability-37
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The fluentd processing pipeline to format journald records (system and container logs) into the viaq data model format was using dozens of embedded ruby evaluations per record. Consequence: The record processing was very slow, with excessive CPU usage. Fix: The processing and formatting was moved into dedicated ruby code in the viaq filter plugin. Result: The record processing is much faster, with less CPU usage.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-25 13:06:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Walid A. 2017-09-08 21:04:59 UTC

Description of problem:
Sustainable single node logging throughput on OpenShift is limited to approximately 400 messages per second with journald as log driver.   At this rate logging-fluentd, with cpu limits removed, uses a full core (the maximum it can use without the multiprocess plugin) and journald uses 65% of a core.   Beyond this rate, as we approach 500 messages per sec, indexing of messages starts falling behind significantly, and fluentd message backlog (trailing journal) increases by at least 10,000 log messages per min.

This bug is being opened to track any documentation or tuning recommendations needed for this limitation.

Links:

inventory used to deploy logging :  https://gist.github.com/wabouhamad/40e85c4848ceb09dd25bb4374327b632

deployment config for elasticsearch: https://gist.github.com/wabouhamad/aabb0776eabf236372a84a4d4b45bd05

daemonset config for logging-fluentd:
https://gist.github.com/wabouhamad/0edb511fce77e2a631b25a326ecdbff5

Ansible playbook to tune journald: https://gist.github.com/wabouhamad/e0c2cb8937ecbf15870b34f3d669a8f3

Logtest cluster-loader config file:
https://gist.github.com/wabouhamad/bb64bec3b752e791617ec59822b8ab51



Version-Release number of selected component (if applicable):
Logging version 3.6.173.0.27

How reproducible:
Always when logging at this rate


Steps to Reproduce:
Cluster details:  All instances are m4.xlarge on AWS.  1 master, 1 infra node where ES is running and 1 compute node where logging-fluentd and an application pod are running.
Deploy logging with the inventory linked above.  Adjust hostnames accordingly
Edit the elasticsearch deployment configuration to increase the CPU limit (see https://bugzilla.redhat.com/show_bug.cgi?id=1488509)
Tune journald on the cluster compute nodes using the playbook linked above.
Start a pod to generate messages to stdout at 100, 200, 400 and 500 messages per second (tool used by QE:  https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md)
Use a monitoring tool such as pbench (https://github.com/distributed-system-analysis/pbench/blob/master/README.md) to monitor cpu, memory, etc on the system where the pod is running
Monitor number of messages across multiple components:
logged by the logging pod form Master node:
oc logs -n <logging_pod_project>   <logging_pod_name> | tail -1

logged by pod from journal (on compute node)
journalctl -u docker | grep <unique_string_from_logged_message> | tail -1

In elasticsearch under indexed pod project
export POD=<logging-es-pod-name>
oc exec $POD -n logging -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v

Any messages lost.  Check total number of messages collected from steps above and compare with message sequence number

During the test, monitor the following log message rates on the compute node running the logging pod:
Number of messages fluentd is behind journald, and ensure that this number is not growing over time:
journalctl -c $( cat /var/log/journal.pos ) | wc -l

journal logging rate per min: 
journalctl -S "$( date +'%Y-%m-%d %H:%M:%S' --date="60 seconds ago" )" | wc -l


Actual results:

Around 400 sustained messages per second, logging-fluentd runs out out of cpu capacity and is no longer able to keep up with incoming messages.  Eventually the fluentd file buffer cache is exhausted.

See private comment for internal link to pbench data


Expected results:

Unknown - need to set end user expectations in this area.  Fluentd claims to be able to process 18K messages per second (https://www.fluentd.org/faqs) and internal discussions (see private link below) indicate user expectations of rates higher than 400 messages per second.

Additional info:

Comment 2 Rich Megginson 2017-09-08 21:58:06 UTC

We have a few changes in the 3.6.2 candidate images which should greatly reduce the fluentd cpu usage.

primarily: https://github.com/openshift/origin-aggregated-logging/commit/8f9434cca3c9a71fde7f89270699dd050009ae79
which is this code: https://github.com/ViaQ/fluent-plugin-viaq_data_model/tree/master/lib/fluent/plugin

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=596094
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-57008-20170908214752
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31-3

Comment 3 Walid A. 2017-09-12 14:49:51 UTC

Update:  redeployed the logging stack with v3.6.173.0.32, and seeing a marked improvement in sustainable logging rates from a single pod, almost double, around 750 messages per second with the same journald (storage persistent, RateLimitIntervalSec=0) and fluend ds configuration as before (limits 1 core, 1Gi Mem, 16mb BUFFER_SIZE_LIMIT).  

Fluend CPU utilization went down from a full core to 68% of a core while systemd-jounrald CPU util remained at a full core when pod was logging at 750 messages per sec.

pbench data is added in next private comment

Comment 5 Rich Megginson 2017-09-12 15:25:58 UTC

I'm not sure if there is any additional journald tuning that can be done, but we should investigate that.

If you want to measure the raw, raw fluentd performance reading from the journal, then you could disable fluentd and let a large journal accumulate.

- unlabel nodes
- wait for fluentd pods to stop
- remove /var/log/journal.pos on each node
- set JOURNAL_READ_FROM_HEAD=true in ds/logging-fluentd
- wait for journald records to accumulate
- label nodes
- wait for fluentd pods to start

Then you can see how long it takes fluentd to get to the tail of the journal

Comment 6 Rich Megginson 2017-09-12 15:34:36 UTC

If you want to measure the raw fluentd performance in a very, very controlled environment, controlling exactly the contents of the journal, then you can create your own journal file, something like this:

https://github.com/richm/origin-aggregated-logging/blob/new-perf-test/test/performance/fluentd-to-es.sh#L329

That is, create a text file in journalctl -o export format, then feed that into /usr/lib/systemd/systemd-journal-remote to create a binary journald file, then mount that journal into the fluentd pod using a hostPath:

oc set volume daemonset/logging-fluentd --add -t hostPath --name testjournal -m /journal --path $datadir

then tell fluentd to use that journal:

oc set env daemonset/logging-fluentd JOURNAL_SOURCE=/journal/journal JOURNAL_READ_FROM_HEAD=true

Comment 7 Rich Megginson 2017-09-29 16:06:56 UTC

fixed in fluentd image 3.6.173.0.32

Comment 8 Walid A. 2017-09-29 16:25:08 UTC

Setting to verified, as stated in comment 4, with fluentd image 3.6.173.0.32 we were able to sustain 750 log messages per sec (1024 bytes message size) logging from a single pod per node/fluentd.   

Details and pbench data are captured in google doc, please see next private comment.

Comment 11 errata-xmlrpc 2017-10-25 13:06:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049