Description of problem: When fluentd is reading from the journald, and the output buffer queue is full, the fluent-plugin-systemd will start dropping log records. What it should do instead is back off, stop reading from the journal, and wait for the queue to drain before reading and submitting more records. I have filed an upstream issue for this: https://github.com/reevoo/fluent-plugin-systemd/issues/37 But in the meantime the workaround is to add `buffer_queue_full_action block` to all of our output plugins, including the secure_forward documentation. Version-Release number of selected component (if applicable): How reproducible: Difficult to reproduce - you have to have a very loaded fluentd reading at a high rate from the journal - using JOURNAL_READ_FROM_HEAD=true can help to reproduce. You will see this error in the fluentd log: Exception emitting record: BufferQueueLimitError queue size exceeds limit or Exception emitting record: BufferOverflowError buffer space has too many data Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
https://github.com/openshift/origin-aggregated-logging/pull/538
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/329021457bc55b1f55d3488ccfa2fcf258e9da47 Bug 1473788 - fluentd drops records when using journal and output queue is full https://bugzilla.redhat.com/show_bug.cgi?id=1473788 Workaround for now is to make sure to set `buffer_queue_full_action block` when reading from the journal
koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=575981 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-43694-20170723174502 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.167 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.167-2
Hi Rich, Could you please help review the below test steps and let me know if any additional tests you would suggest us to do? Thanks! Test steps: 1. Deployed logging with openshift_logging_fluentd_journal_read_from_head=true in inventory file 2. To increase the read journal load of fluentd, created 30 projects with 38 pods running inside on openshift, each pod sended out logs continously: # for i in {1..30} do oc new-project project${i} oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT sleep 30 done # oc get po --all-namespaces | grep -i running | wc -l 38 3. Wait after 1 hour until every index can be found in ES, searched in fluentd logs for "BufferQueueLimitError", no findings. --Attached the fluentd log here. Test env: # openshift version openshift v3.6.171 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 Images tested with: logging-fluentd v3.6 ff6b9ae7d3e1 5 hours ago 232.2 MB
Created attachment 1305848 [details] the fluentd log found in env with bug fix
Yes, this looks good. 2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not This is good. This means the setting is there. I'm not sure what all of those KubeClient messages are - I don't think they should be filling up the log like that, but it is not related to this bug. Please file a bug about that.
Thanks for confirmation, Rich. Will file another bug about the issue of KubeClient messages, set this bz to verified according to comment #4.
@Rich, FYI. The related KubeClient messages bz was created here: https://bugzilla.redhat.com/show_bug.cgi?id=1476731 Thanks, Xia
I updated the title because we confirmed this issue whilst using json in 1486473
Hi, We can set buffer_queue_full_action block in the logging-fluentd configmap as a workaround?
(In reply to Steven Walter from comment #10) > Hi, We can set buffer_queue_full_action block in the logging-fluentd > configmap as a workaround? Yes, you can, but you also have to copy the entire output-operations.conf, output-applications.conf, output-es-config.conf, and output-es-ops-config.conf from the image into the configmap.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049