Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1483114 - Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable
Data loss of logs can occur if fluentd pod is terminated/restarted when Elast...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.4.1
All All
urgent Severity urgent
: ---
: 3.6.z
Assigned To: Noriko Hosoi
Junqi Zhao
UpcomingRelease,
:
Depends On: 1460749
Blocks: 1477513 1477515
  Show dependency treegraph
 
Reported: 2017-08-18 14:46 EDT by Jeff Cantrill
Modified: 2017-10-25 09:04 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Story Points: ---
Clone Of: 1460749
Environment:
Last Closed: 2017-10-25 09:04:36 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github origin-aggregated-logging/pull/559 None None None 2017-08-18 14:47 EDT
Red Hat Product Errata RHBA-2017:3049 normal SHIPPED_LIVE OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update 2017-10-25 11:57:15 EDT

  None (edit)
Comment 1 Jeff Cantrill 2017-08-18 14:48:06 EDT
fixed in PR origin-aggregated-logging/pull/559
Comment 2 Rich Megginson 2017-08-23 16:52:25 EDT
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=587924
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-23619-20170823203852
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5-10
Comment 3 Xia Zhao 2017-08-24 02:23:19 EDT
reassign to @juzhao as he is the trello card owner
Comment 4 Junqi Zhao 2017-08-29 04:47:59 EDT
@Noriko

I have two questions:
1. Do we still have openshift_logging_use_mux_client parameter?
From PR https://github.com/openshift/openshift-ansible/pull/4554/ for
https://bugzilla.redhat.com/show_bug.cgi?id=1464024#c0

There is openshift_logging_use_mux_client: False in roles/openshift_logging_fluentd/defaults/main.yml

But it can not be found now.

ansilbe playbooks version
openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch

I don't think it is necessary to test with mux, are you agree?

2. Since we only can use file buffer now, I think the only verification way for this defect is 
1) stop fluentd pods for a while and restart them later.
2) verify fluentd is still able to communicate with Elasticsearch
3) verify the logs stored in file/pv could be retrieved by kibana, no log is missing.

Do you have better way to verify it?
Comment 5 Rich Megginson 2017-08-29 09:15:58 EDT
(In reply to Junqi Zhao from comment #4)
> @Noriko
> 
> I have two questions:
> 1. Do we still have openshift_logging_use_mux_client parameter?
> From PR https://github.com/openshift/openshift-ansible/pull/4554/ for
> https://bugzilla.redhat.com/show_bug.cgi?id=1464024#c0
> 
> There is openshift_logging_use_mux_client: False in
> roles/openshift_logging_fluentd/defaults/main.yml
> 
> But it can not be found now.

Right.  In 3.6.1 we got rid of that setting because there are now multiple mux client modes:

https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_logging#mux---secure_forward-listener-service

"openshift_logging_mux_client_mode: Values - minimal, maximal. Default is unset. Setting this value will cause the Fluentd node agent to send logs to mux rather than directly to Elasticsearch. The value maximal means that Fluentd will do as much processing as possible at the node before sending the records to mux. This is the current recommended way to use mux due to current scaling issues. The value minimal means that Fluentd will do no processing at all, and send the raw logs to mux for processing. We do not currently recommend using this mode, and ansible will warn you about this.
"

When testing mux, use `openshift_logging_mux_client_mode=maximal`

> 
> ansilbe playbooks version
> openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch
> 
> I don't think it is necessary to test with mux, are you agree?

We would prefer testing with mux, as mux has no persistence at all, and is the most vulnerable for data loss.

> 
> 2. Since we only can use file buffer now, I think the only verification way
> for this defect is 
> 1) stop fluentd pods for a while and restart them later.
> 2) verify fluentd is still able to communicate with Elasticsearch
> 3) verify the logs stored in file/pv could be retrieved by kibana, no log is
> missing.
> 
> Do you have better way to verify it?
Comment 6 Junqi Zhao 2017-09-01 04:43:26 EDT
Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

Test env
# openshift version
openshift v3.6.173.0.21
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

Images:
logging-curator-v3.6.173.0.21-15
logging-elasticsearch-v3.6.173.0.21-15
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.6.173.0.21-15
logging-auth-proxy-v3.6.173.0.21-15
Comment 7 Junqi Zhao 2017-09-01 04:44:52 EDT
(In reply to Junqi Zhao from comment #6)
> Verification steps:
> 1. Use mux to test, set the following parameters in inventory file
> openshift_logging_use_mux=true
> openshift_logging_mux_client_mode=maximal
> 
> 2. Creat one project to populate logs.
> 
> 3. Stop fluentd pods, and note down the last project logs in kibana
> 
> 4. Wait for a while, and restart fluentd pods.
> 
> 5. Check the subsequent logs after step 3, no logs is missing.

add step 6
  6. Repeat step 3 to 5, make sure no log is missing 
> Test env
> # openshift version
> openshift v3.6.173.0.21
> kubernetes v1.6.1+5115d708d7
> etcd 3.2.1
> 
> Images:
> logging-curator-v3.6.173.0.21-15
> logging-elasticsearch-v3.6.173.0.21-15
> logging-fluentd-v3.6.173.0.28-1
> logging-kibana-v3.6.173.0.21-15
> logging-auth-proxy-v3.6.173.0.21-15
Comment 8 Junqi Zhao 2017-09-01 05:32:26 EDT
@nhosoi
one more question, in fluentd pod log, "'exclude1' parameter is deprecated" has reported in one defect, I want to ask is it right for the warn message, such as:
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

We didn't see this before.

# oc logs logging-fluentd-hnphl
umounts of dead containers will fail. Ignoring...
umount: /var/lib/docker/containers/*/shm: mountpoint not found
2017-09-01 03:28:26 -0400 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-09-01 03:28:28 -0400 [warn]: 'exclude1' parameter is deprecated: Use <exclude> section
2017-09-01 03:28:28 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
Comment 9 Noriko Hosoi 2017-09-01 10:56:07 EDT
Hi Junqi,

Regarding exclude1, pr#629 is submitted and reviewed.
Looking into 'block' action one, next.
Thanks.
Comment 11 errata-xmlrpc 2017-10-25 09:04:36 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Note You need to log in before you can comment on or make changes to this bug.