Description of problem: Sustainable single node logging throughput on OpenShift is limited to approximately 400 messages per second with journald as log driver. At this rate logging-fluentd, with cpu limits removed, uses a full core (the maximum it can use without the multiprocess plugin) and journald uses 65% of a core. Beyond this rate, as we approach 500 messages per sec, indexing of messages starts falling behind significantly, and fluentd message backlog (trailing journal) increases by at least 10,000 log messages per min. This bug is being opened to track any documentation or tuning recommendations needed for this limitation. Links: inventory used to deploy logging : https://gist.github.com/wabouhamad/40e85c4848ceb09dd25bb4374327b632 deployment config for elasticsearch: https://gist.github.com/wabouhamad/aabb0776eabf236372a84a4d4b45bd05 daemonset config for logging-fluentd: https://gist.github.com/wabouhamad/0edb511fce77e2a631b25a326ecdbff5 Ansible playbook to tune journald: https://gist.github.com/wabouhamad/e0c2cb8937ecbf15870b34f3d669a8f3 Logtest cluster-loader config file: https://gist.github.com/wabouhamad/bb64bec3b752e791617ec59822b8ab51 Version-Release number of selected component (if applicable): Logging version 3.6.173.0.27 How reproducible: Always when logging at this rate Steps to Reproduce: Cluster details: All instances are m4.xlarge on AWS. 1 master, 1 infra node where ES is running and 1 compute node where logging-fluentd and an application pod are running. Deploy logging with the inventory linked above. Adjust hostnames accordingly Edit the elasticsearch deployment configuration to increase the CPU limit (see https://bugzilla.redhat.com/show_bug.cgi?id=1488509) Tune journald on the cluster compute nodes using the playbook linked above. Start a pod to generate messages to stdout at 100, 200, 400 and 500 messages per second (tool used by QE: https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) Use a monitoring tool such as pbench (https://github.com/distributed-system-analysis/pbench/blob/master/README.md) to monitor cpu, memory, etc on the system where the pod is running Monitor number of messages across multiple components: logged by the logging pod form Master node: oc logs -n <logging_pod_project> <logging_pod_name> | tail -1 logged by pod from journal (on compute node) journalctl -u docker | grep <unique_string_from_logged_message> | tail -1 In elasticsearch under indexed pod project export POD=<logging-es-pod-name> oc exec $POD -n logging -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v Any messages lost. Check total number of messages collected from steps above and compare with message sequence number During the test, monitor the following log message rates on the compute node running the logging pod: Number of messages fluentd is behind journald, and ensure that this number is not growing over time: journalctl -c $( cat /var/log/journal.pos ) | wc -l journal logging rate per min: journalctl -S "$( date +'%Y-%m-%d %H:%M:%S' --date="60 seconds ago" )" | wc -l Actual results: Around 400 sustained messages per second, logging-fluentd runs out out of cpu capacity and is no longer able to keep up with incoming messages. Eventually the fluentd file buffer cache is exhausted. See private comment for internal link to pbench data Expected results: Unknown - need to set end user expectations in this area. Fluentd claims to be able to process 18K messages per second (https://www.fluentd.org/faqs) and internal discussions (see private link below) indicate user expectations of rates higher than 400 messages per second. Additional info:
We have a few changes in the 3.6.2 candidate images which should greatly reduce the fluentd cpu usage. primarily: https://github.com/openshift/origin-aggregated-logging/commit/8f9434cca3c9a71fde7f89270699dd050009ae79 which is this code: https://github.com/ViaQ/fluent-plugin-viaq_data_model/tree/master/lib/fluent/plugin koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=596094 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-57008-20170908214752 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31-3
Update: redeployed the logging stack with v3.6.173.0.32, and seeing a marked improvement in sustainable logging rates from a single pod, almost double, around 750 messages per second with the same journald (storage persistent, RateLimitIntervalSec=0) and fluend ds configuration as before (limits 1 core, 1Gi Mem, 16mb BUFFER_SIZE_LIMIT). Fluend CPU utilization went down from a full core to 68% of a core while systemd-jounrald CPU util remained at a full core when pod was logging at 750 messages per sec. pbench data is added in next private comment
I'm not sure if there is any additional journald tuning that can be done, but we should investigate that. If you want to measure the raw, raw fluentd performance reading from the journal, then you could disable fluentd and let a large journal accumulate. - unlabel nodes - wait for fluentd pods to stop - remove /var/log/journal.pos on each node - set JOURNAL_READ_FROM_HEAD=true in ds/logging-fluentd - wait for journald records to accumulate - label nodes - wait for fluentd pods to start Then you can see how long it takes fluentd to get to the tail of the journal
If you want to measure the raw fluentd performance in a very, very controlled environment, controlling exactly the contents of the journal, then you can create your own journal file, something like this: https://github.com/richm/origin-aggregated-logging/blob/new-perf-test/test/performance/fluentd-to-es.sh#L329 That is, create a text file in journalctl -o export format, then feed that into /usr/lib/systemd/systemd-journal-remote to create a binary journald file, then mount that journal into the fluentd pod using a hostPath: oc set volume daemonset/logging-fluentd --add -t hostPath --name testjournal -m /journal --path $datadir then tell fluentd to use that journal: oc set env daemonset/logging-fluentd JOURNAL_SOURCE=/journal/journal JOURNAL_READ_FROM_HEAD=true
fixed in fluentd image 3.6.173.0.32
Setting to verified, as stated in comment 4, with fluentd image 3.6.173.0.32 we were able to sustain 750 log messages per sec (1024 bytes message size) logging from a single pod per node/fluentd. Details and pbench data are captured in google doc, please see next private comment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049