Bug 1460749

Summary: Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable
Product: OpenShift Container Platform Reporter: Peter Portante <pportant>
Component: LoggingAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.4.1CC: aos-bugs, jcantril, nhosoi, pportant, pweil, rmeggins, rromerom
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Story Points: ---
Clone Of:
: 1477513 1477515 1483114 (view as bug list) Environment:
Last Closed: 2017-11-28 21:56:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1491947    
Bug Blocks: 1477513, 1477515, 1483114    

Description Peter Portante 2017-06-12 15:08:41 UTC
Using the `memory` buffer for fluentd means that a restart of the fluentd pod will cause a loss of whatever logs are in the buffer's queue if fluentd is unable to communicate with Elasticsearch.  If we use the `file` buffer, the queue is persisted to disk.

We'll likely want to give the user an option via ansible to associate a small PV for that on-disk queue.

Comment 1 Rich Megginson 2017-07-05 15:58:57 UTC
Moving this to urgent as this is a blocker for 3.6, and it is critical for mux since there is no on-disk source of logs to recover from.

Comment 4 Rich Megginson 2017-08-01 14:52:00 UTC
Noriko, did file buffer get in for 3.6?  If so, please mark this bug as MODIFIED and include the PRs for openshift-ansible and origin-aggregated-logging, for the release-3.6 branch.

Comment 5 Noriko Hosoi 2017-08-01 16:59:49 UTC
(In reply to Rich Megginson from comment #4)
> Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> MODIFIED and include the PRs for openshift-ansible and
> origin-aggregated-logging, for the release-3.6 branch.

No Merge has not happened to the both master and release-3.6 branch...

https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
https://github.com/openshift/origin-aggregated-logging/pull/559 -- release-3.6

I notieced the pull requests have no flags like these.
  component/fluentd priority/P0 release/3.[67]
I should have set them?  If so, could you tell me how?

Comment 6 Rich Megginson 2017-08-01 17:09:49 UTC
(In reply to Noriko Hosoi from comment #5)
> (In reply to Rich Megginson from comment #4)
> > Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> > MODIFIED and include the PRs for openshift-ansible and
> > origin-aggregated-logging, for the release-3.6 branch.
> 
> No Merge has not happened to the both master and release-3.6 branch...
> 
> https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
> https://github.com/openshift/origin-aggregated-logging/pull/559 --
> release-3.6
> 
> I notieced the pull requests have no flags like these.
>   component/fluentd priority/P0 release/3.[67]
> I should have set them?  If so, could you tell me how?

The flags aren't really necessary, they are just helpful when looking at the list of PRs to know at a glance what the PR is all about.

Once the 3.6 branch opens for 3.6.1 PRs, we'll get this merged.

Comment 7 Xia Zhao 2017-08-21 06:54:55 UTC
The bug verification work is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1482002

Comment 8 Xia Zhao 2017-08-24 06:23:03 UTC
reassign to @juzhao as he is the trello card owner

Comment 9 Junqi Zhao 2017-08-28 01:44:46 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1483845

Comment 10 Junqi Zhao 2017-09-01 10:09:51 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1487573

Comment 11 Junqi Zhao 2017-09-18 08:41:51 UTC
Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

6. Repeat step 3 to 5, make sure no log is missing.

Test env
# openshift version
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Images:
logging-curator-v3.7.0-0.126.4.0
logging-elasticsearch-v3.7.0-0.126.4.0
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.7.0-0.126.4.0
logging-auth-proxy-v3.7.0-0.126.4.0

Comment 14 errata-xmlrpc 2017-11-28 21:56:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188