Bug 1460749 - Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable
Summary: Data loss of logs can occur if fluentd pod is terminated/restarted when Elast...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.4.1
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: Noriko Hosoi
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1491947
Blocks: 1477513 1477515 1483114
TreeView+ depends on / blocked
 
Reported: 2017-06-12 15:08 UTC by Peter Portante
Modified: 2017-11-28 21:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Clone Of:
: 1477513 1477515 1483114 (view as bug list)
Environment:
Last Closed: 2017-11-28 21:56:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Peter Portante 2017-06-12 15:08:41 UTC
Using the `memory` buffer for fluentd means that a restart of the fluentd pod will cause a loss of whatever logs are in the buffer's queue if fluentd is unable to communicate with Elasticsearch.  If we use the `file` buffer, the queue is persisted to disk.

We'll likely want to give the user an option via ansible to associate a small PV for that on-disk queue.

Comment 1 Rich Megginson 2017-07-05 15:58:57 UTC
Moving this to urgent as this is a blocker for 3.6, and it is critical for mux since there is no on-disk source of logs to recover from.

Comment 4 Rich Megginson 2017-08-01 14:52:00 UTC
Noriko, did file buffer get in for 3.6?  If so, please mark this bug as MODIFIED and include the PRs for openshift-ansible and origin-aggregated-logging, for the release-3.6 branch.

Comment 5 Noriko Hosoi 2017-08-01 16:59:49 UTC
(In reply to Rich Megginson from comment #4)
> Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> MODIFIED and include the PRs for openshift-ansible and
> origin-aggregated-logging, for the release-3.6 branch.

No Merge has not happened to the both master and release-3.6 branch...

https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
https://github.com/openshift/origin-aggregated-logging/pull/559 -- release-3.6

I notieced the pull requests have no flags like these.
  component/fluentd priority/P0 release/3.[67]
I should have set them?  If so, could you tell me how?

Comment 6 Rich Megginson 2017-08-01 17:09:49 UTC
(In reply to Noriko Hosoi from comment #5)
> (In reply to Rich Megginson from comment #4)
> > Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> > MODIFIED and include the PRs for openshift-ansible and
> > origin-aggregated-logging, for the release-3.6 branch.
> 
> No Merge has not happened to the both master and release-3.6 branch...
> 
> https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
> https://github.com/openshift/origin-aggregated-logging/pull/559 --
> release-3.6
> 
> I notieced the pull requests have no flags like these.
>   component/fluentd priority/P0 release/3.[67]
> I should have set them?  If so, could you tell me how?

The flags aren't really necessary, they are just helpful when looking at the list of PRs to know at a glance what the PR is all about.

Once the 3.6 branch opens for 3.6.1 PRs, we'll get this merged.

Comment 7 Xia Zhao 2017-08-21 06:54:55 UTC
The bug verification work is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1482002

Comment 8 Xia Zhao 2017-08-24 06:23:03 UTC
reassign to @juzhao as he is the trello card owner

Comment 9 Junqi Zhao 2017-08-28 01:44:46 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1483845

Comment 10 Junqi Zhao 2017-09-01 10:09:51 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1487573

Comment 11 Junqi Zhao 2017-09-18 08:41:51 UTC
Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

6. Repeat step 3 to 5, make sure no log is missing.

Test env
# openshift version
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Images:
logging-curator-v3.7.0-0.126.4.0
logging-elasticsearch-v3.7.0-0.126.4.0
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.7.0-0.126.4.0
logging-auth-proxy-v3.7.0-0.126.4.0

Comment 14 errata-xmlrpc 2017-11-28 21:56:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.