1473788 – fluentd drops records when using journal or json and output queue is full

Bug 1473788 - fluentd drops records when using journal or json and output queue is full

Summary: fluentd drops records when using journal or json and output queue is full

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Rich Megginson
QA Contact:	Xia Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-21 16:28 UTC by Rich Megginson
Modified:	2021-06-10 12:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Fluentd does not stop reading from the journal when the output queue is full. Consequence: Records are dropped until Fluentd can empty the queue by sending records to Elasticsearch. Fix: Introduce a new configuration parameter - `buffer_queue_full_action` - to all of our output plugins. If using the journal as input, Fluentd will use a value of `block` for this parameter, which will cause Fluentd to stop reading from the journal until Fluentd is able to flush the queue. Result: Records are not dropped when Fluentd is reading from the journal and the output buffer queue is full.
Clone Of:
Environment:
Last Closed:	2017-10-25 13:04:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
the fluentd log found in env with bug fix (33.95 KB, text/plain) 2017-07-28 09:13 UTC, Xia Zhao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin-aggregated-logging pull 538	'None'	closed	Bug 1473788 - fluentd drops records when using journal and output queue is full	2021-02-18 10:31:46 UTC
Github	reevoo fluent-plugin-systemd issues 37	'None'	closed	pause reading from journal when there is back pressure	2021-02-18 10:31:45 UTC
Red Hat Knowledge Base (Solution)	3209371	None	None	None	2017-10-06 17:27:36 UTC
Red Hat Product Errata	RHBA-2017:3049	normal	SHIPPED_LIVE	OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update	2017-10-25 15:57:15 UTC

Internal Links: 1685604

Description Rich Megginson 2017-07-21 16:28:23 UTC

Description of problem:
When fluentd is reading from the journald, and the output buffer queue is full, the fluent-plugin-systemd will start dropping log records.  What it should do instead is back off, stop reading from the journal, and wait for the queue to drain before reading and submitting more records.

I have filed an upstream issue for this: https://github.com/reevoo/fluent-plugin-systemd/issues/37

But in the meantime the workaround is to add `buffer_queue_full_action block` to all of our output plugins, including the secure_forward documentation.

Version-Release number of selected component (if applicable):


How reproducible:
Difficult to reproduce - you have to have a very loaded fluentd reading at a high rate from the journal - using JOURNAL_READ_FROM_HEAD=true can help to reproduce.  You will see this error in the fluentd log:

Exception emitting record: BufferQueueLimitError queue size exceeds limit

or

Exception emitting record: BufferOverflowError buffer space has too many data 

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Rich Megginson 2017-07-21 19:07:51 UTC

https://github.com/openshift/origin-aggregated-logging/pull/538

Comment 2 openshift-github-bot 2017-07-22 00:28:35 UTC

Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/329021457bc55b1f55d3488ccfa2fcf258e9da47
Bug 1473788 - fluentd drops records when using journal and output queue is full

https://bugzilla.redhat.com/show_bug.cgi?id=1473788
Workaround for now is to make sure to set
`buffer_queue_full_action block` when reading from the journal

Comment 3 Rich Megginson 2017-07-23 18:18:14 UTC

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=575981
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-43694-20170723174502
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.167
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.167-2

Comment 4 Xia Zhao 2017-07-28 09:10:07 UTC

Hi Rich,

Could you please help review the below test steps and let me know if any additional tests you would suggest us to do? Thanks!

Test steps:
1. Deployed logging with openshift_logging_fluentd_journal_read_from_head=true in inventory file

2. To increase the read journal load of fluentd, created 30 projects with 38 pods running inside on openshift, each pod sended out logs continously:

# for i in {1..30}
do
      oc new-project project${i}
      oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT
      sleep 30
done

# oc get po --all-namespaces | grep -i running | wc -l
38

3. Wait after 1 hour until every index can be found in ES, searched in fluentd logs for "BufferQueueLimitError", no findings.

--Attached the fluentd log here.

Test env:
# openshift version
openshift v3.6.171
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

Images tested with:
logging-fluentd         v3.6                ff6b9ae7d3e1        5 hours ago         232.2 MB

Comment 5 Xia Zhao 2017-07-28 09:13:11 UTC

Created attachment 1305848 [details]
the fluentd log found in env with bug fix

Comment 6 Rich Megginson 2017-07-28 12:14:49 UTC

Yes, this looks good.

2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

This is good.  This means the setting is there.

I'm not sure what all of those KubeClient messages are - I don't think they should be filling up the log like that, but it is not related to this bug.  Please file a bug about that.

Comment 7 Xia Zhao 2017-07-29 11:47:33 UTC

Thanks for confirmation, Rich. Will file another bug about the issue of KubeClient messages, set this bz to verified according to comment #4.

Comment 8 Xia Zhao 2017-07-31 10:01:55 UTC

@Rich,

FYI. The related KubeClient messages bz was created here: https://bugzilla.redhat.com/show_bug.cgi?id=1476731

Thanks,
Xia

Comment 9 Steven Walter 2017-09-12 20:29:28 UTC

I updated the title because we confirmed this issue whilst using json in 1486473

Comment 10 Steven Walter 2017-10-06 17:27:20 UTC

Hi, We can set buffer_queue_full_action block in the logging-fluentd configmap as a workaround?

Comment 11 Rich Megginson 2017-10-06 17:35:02 UTC

(In reply to Steven Walter from comment #10)
> Hi, We can set buffer_queue_full_action block in the logging-fluentd
> configmap as a workaround?

Yes, you can, but you also have to copy the entire output-operations.conf, output-applications.conf, output-es-config.conf, and output-es-ops-config.conf from the image into the configmap.

Comment 13 errata-xmlrpc 2017-10-25 13:04:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Note You need to log in before you can comment on or make changes to this bug.