1460749 – Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable

Bug 1460749 - Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable

Summary: Data loss of logs can occur if fluentd pod is terminated/restarted when Elast...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.4.1
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Noriko Hosoi
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1491947
Blocks:	1477513 1477515 1483114
TreeView+	depends on / blocked

Reported:	2017-06-12 15:08 UTC by Peter Portante
Modified:	2021-06-10 12:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Clone Of:
Clones:	1477513 1477515 1483114 (view as bug list)
Environment:
Last Closed:	2017-11-28 21:56:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Peter Portante 2017-06-12 15:08:41 UTC

Using the `memory` buffer for fluentd means that a restart of the fluentd pod will cause a loss of whatever logs are in the buffer's queue if fluentd is unable to communicate with Elasticsearch.  If we use the `file` buffer, the queue is persisted to disk.

We'll likely want to give the user an option via ansible to associate a small PV for that on-disk queue.

Comment 1 Rich Megginson 2017-07-05 15:58:57 UTC

Moving this to urgent as this is a blocker for 3.6, and it is critical for mux since there is no on-disk source of logs to recover from.

Comment 4 Rich Megginson 2017-08-01 14:52:00 UTC

Noriko, did file buffer get in for 3.6?  If so, please mark this bug as MODIFIED and include the PRs for openshift-ansible and origin-aggregated-logging, for the release-3.6 branch.

Comment 5 Noriko Hosoi 2017-08-01 16:59:49 UTC

(In reply to Rich Megginson from comment #4)
> Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> MODIFIED and include the PRs for openshift-ansible and
> origin-aggregated-logging, for the release-3.6 branch.

No Merge has not happened to the both master and release-3.6 branch...

https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
https://github.com/openshift/origin-aggregated-logging/pull/559 -- release-3.6

I notieced the pull requests have no flags like these.
  component/fluentd priority/P0 release/3.[67]
I should have set them?  If so, could you tell me how?

Comment 6 Rich Megginson 2017-08-01 17:09:49 UTC

(In reply to Noriko Hosoi from comment #5)
> (In reply to Rich Megginson from comment #4)
> > Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> > MODIFIED and include the PRs for openshift-ansible and
> > origin-aggregated-logging, for the release-3.6 branch.
> 
> No Merge has not happened to the both master and release-3.6 branch...
> 
> https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
> https://github.com/openshift/origin-aggregated-logging/pull/559 --
> release-3.6
> 
> I notieced the pull requests have no flags like these.
>   component/fluentd priority/P0 release/3.[67]
> I should have set them?  If so, could you tell me how?

The flags aren't really necessary, they are just helpful when looking at the list of PRs to know at a glance what the PR is all about.

Once the 3.6 branch opens for 3.6.1 PRs, we'll get this merged.

Comment 7 Xia Zhao 2017-08-21 06:54:55 UTC

The bug verification work is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1482002

Comment 8 Xia Zhao 2017-08-24 06:23:03 UTC

reassign to @juzhao as he is the trello card owner

Comment 9 Junqi Zhao 2017-08-28 01:44:46 UTC

Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1483845

Comment 10 Junqi Zhao 2017-09-01 10:09:51 UTC

Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1487573

Comment 11 Junqi Zhao 2017-09-18 08:41:51 UTC

Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

6. Repeat step 3 to 5, make sure no log is missing.

Test env
# openshift version
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Images:
logging-curator-v3.7.0-0.126.4.0
logging-elasticsearch-v3.7.0-0.126.4.0
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.7.0-0.126.4.0
logging-auth-proxy-v3.7.0-0.126.4.0

Comment 14 errata-xmlrpc 2017-11-28 21:56:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.