Bug 1460749 - Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable
Data loss of logs can occur if fluentd pod is terminated/restarted when Elast...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.4.1
All All
urgent Severity urgent
: ---
: 3.7.0
Assigned To: Noriko Hosoi
Junqi Zhao
:
Depends On: 1491947
Blocks: 1477513 1477515 1483114
  Show dependency treegraph
 
Reported: 2017-06-12 11:08 EDT by Peter Portante
Modified: 2017-11-28 16:56 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Story Points: ---
Clone Of:
: 1477513 1477515 1483114 (view as bug list)
Environment:
Last Closed: 2017-11-28 16:56:55 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description Peter Portante 2017-06-12 11:08:41 EDT
Using the `memory` buffer for fluentd means that a restart of the fluentd pod will cause a loss of whatever logs are in the buffer's queue if fluentd is unable to communicate with Elasticsearch.  If we use the `file` buffer, the queue is persisted to disk.

We'll likely want to give the user an option via ansible to associate a small PV for that on-disk queue.
Comment 1 Rich Megginson 2017-07-05 11:58:57 EDT
Moving this to urgent as this is a blocker for 3.6, and it is critical for mux since there is no on-disk source of logs to recover from.
Comment 4 Rich Megginson 2017-08-01 10:52:00 EDT
Noriko, did file buffer get in for 3.6?  If so, please mark this bug as MODIFIED and include the PRs for openshift-ansible and origin-aggregated-logging, for the release-3.6 branch.
Comment 5 Noriko Hosoi 2017-08-01 12:59:49 EDT
(In reply to Rich Megginson from comment #4)
> Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> MODIFIED and include the PRs for openshift-ansible and
> origin-aggregated-logging, for the release-3.6 branch.

No Merge has not happened to the both master and release-3.6 branch...

https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
https://github.com/openshift/origin-aggregated-logging/pull/559 -- release-3.6

I notieced the pull requests have no flags like these.
  component/fluentd priority/P0 release/3.[67]
I should have set them?  If so, could you tell me how?
Comment 6 Rich Megginson 2017-08-01 13:09:49 EDT
(In reply to Noriko Hosoi from comment #5)
> (In reply to Rich Megginson from comment #4)
> > Noriko, did file buffer get in for 3.6?  If so, please mark this bug as
> > MODIFIED and include the PRs for openshift-ansible and
> > origin-aggregated-logging, for the release-3.6 branch.
> 
> No Merge has not happened to the both master and release-3.6 branch...
> 
> https://github.com/openshift/origin-aggregated-logging/pull/556 -- master
> https://github.com/openshift/origin-aggregated-logging/pull/559 --
> release-3.6
> 
> I notieced the pull requests have no flags like these.
>   component/fluentd priority/P0 release/3.[67]
> I should have set them?  If so, could you tell me how?

The flags aren't really necessary, they are just helpful when looking at the list of PRs to know at a glance what the PR is all about.

Once the 3.6 branch opens for 3.6.1 PRs, we'll get this merged.
Comment 7 Xia Zhao 2017-08-21 02:54:55 EDT
The bug verification work is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1482002
Comment 8 Xia Zhao 2017-08-24 02:23:03 EDT
reassign to @juzhao as he is the trello card owner
Comment 9 Junqi Zhao 2017-08-27 21:44:46 EDT
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1483845
Comment 10 Junqi Zhao 2017-09-01 06:09:51 EDT
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1487573
Comment 11 Junqi Zhao 2017-09-18 04:41:51 EDT
Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

6. Repeat step 3 to 5, make sure no log is missing.

Test env
# openshift version
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Images:
logging-curator-v3.7.0-0.126.4.0
logging-elasticsearch-v3.7.0-0.126.4.0
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.7.0-0.126.4.0
logging-auth-proxy-v3.7.0-0.126.4.0
Comment 14 errata-xmlrpc 2017-11-28 16:56:55 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.