Bug 1483114

Summary: Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable
Product: OpenShift Container Platform Reporter: Jeff Cantrill <jcantril>
Component: LoggingAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.4.1CC: aos-bugs, jcantril, nhosoi, pportant, pweil, rmeggins, rromerom, xiazhao
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: All   
OS: All   
Whiteboard: UpcomingRelease,
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Story Points: ---
Clone Of: 1460749 Environment:
Last Closed: 2017-10-25 13:04:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1460749    
Bug Blocks: 1477513, 1477515    

Comment 1 Jeff Cantrill 2017-08-18 18:48:06 UTC
fixed in PR origin-aggregated-logging/pull/559

Comment 2 Rich Megginson 2017-08-23 20:52:25 UTC
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=587924
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-23619-20170823203852
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5-10

Comment 3 Xia Zhao 2017-08-24 06:23:19 UTC
reassign to @juzhao as he is the trello card owner

Comment 4 Junqi Zhao 2017-08-29 08:47:59 UTC
@Noriko

I have two questions:
1. Do we still have openshift_logging_use_mux_client parameter?
From PR https://github.com/openshift/openshift-ansible/pull/4554/ for
https://bugzilla.redhat.com/show_bug.cgi?id=1464024#c0

There is openshift_logging_use_mux_client: False in roles/openshift_logging_fluentd/defaults/main.yml

But it can not be found now.

ansilbe playbooks version
openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch

I don't think it is necessary to test with mux, are you agree?

2. Since we only can use file buffer now, I think the only verification way for this defect is 
1) stop fluentd pods for a while and restart them later.
2) verify fluentd is still able to communicate with Elasticsearch
3) verify the logs stored in file/pv could be retrieved by kibana, no log is missing.

Do you have better way to verify it?

Comment 5 Rich Megginson 2017-08-29 13:15:58 UTC
(In reply to Junqi Zhao from comment #4)
> @Noriko
> 
> I have two questions:
> 1. Do we still have openshift_logging_use_mux_client parameter?
> From PR https://github.com/openshift/openshift-ansible/pull/4554/ for
> https://bugzilla.redhat.com/show_bug.cgi?id=1464024#c0
> 
> There is openshift_logging_use_mux_client: False in
> roles/openshift_logging_fluentd/defaults/main.yml
> 
> But it can not be found now.

Right.  In 3.6.1 we got rid of that setting because there are now multiple mux client modes:

https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_logging#mux---secure_forward-listener-service

"openshift_logging_mux_client_mode: Values - minimal, maximal. Default is unset. Setting this value will cause the Fluentd node agent to send logs to mux rather than directly to Elasticsearch. The value maximal means that Fluentd will do as much processing as possible at the node before sending the records to mux. This is the current recommended way to use mux due to current scaling issues. The value minimal means that Fluentd will do no processing at all, and send the raw logs to mux for processing. We do not currently recommend using this mode, and ansible will warn you about this.
"

When testing mux, use `openshift_logging_mux_client_mode=maximal`

> 
> ansilbe playbooks version
> openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch
> 
> I don't think it is necessary to test with mux, are you agree?

We would prefer testing with mux, as mux has no persistence at all, and is the most vulnerable for data loss.

> 
> 2. Since we only can use file buffer now, I think the only verification way
> for this defect is 
> 1) stop fluentd pods for a while and restart them later.
> 2) verify fluentd is still able to communicate with Elasticsearch
> 3) verify the logs stored in file/pv could be retrieved by kibana, no log is
> missing.
> 
> Do you have better way to verify it?

Comment 6 Junqi Zhao 2017-09-01 08:43:26 UTC
Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

Test env
# openshift version
openshift v3.6.173.0.21
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

Images:
logging-curator-v3.6.173.0.21-15
logging-elasticsearch-v3.6.173.0.21-15
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.6.173.0.21-15
logging-auth-proxy-v3.6.173.0.21-15

Comment 7 Junqi Zhao 2017-09-01 08:44:52 UTC
(In reply to Junqi Zhao from comment #6)
> Verification steps:
> 1. Use mux to test, set the following parameters in inventory file
> openshift_logging_use_mux=true
> openshift_logging_mux_client_mode=maximal
> 
> 2. Creat one project to populate logs.
> 
> 3. Stop fluentd pods, and note down the last project logs in kibana
> 
> 4. Wait for a while, and restart fluentd pods.
> 
> 5. Check the subsequent logs after step 3, no logs is missing.

add step 6
  6. Repeat step 3 to 5, make sure no log is missing 
> Test env
> # openshift version
> openshift v3.6.173.0.21
> kubernetes v1.6.1+5115d708d7
> etcd 3.2.1
> 
> Images:
> logging-curator-v3.6.173.0.21-15
> logging-elasticsearch-v3.6.173.0.21-15
> logging-fluentd-v3.6.173.0.28-1
> logging-kibana-v3.6.173.0.21-15
> logging-auth-proxy-v3.6.173.0.21-15

Comment 8 Junqi Zhao 2017-09-01 09:32:26 UTC
@nhosoi
one more question, in fluentd pod log, "'exclude1' parameter is deprecated" has reported in one defect, I want to ask is it right for the warn message, such as:
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

We didn't see this before.

# oc logs logging-fluentd-hnphl
umounts of dead containers will fail. Ignoring...
umount: /var/lib/docker/containers/*/shm: mountpoint not found
2017-09-01 03:28:26 -0400 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-09-01 03:28:28 -0400 [warn]: 'exclude1' parameter is deprecated: Use <exclude> section
2017-09-01 03:28:28 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

Comment 9 Noriko Hosoi 2017-09-01 14:56:07 UTC
Hi Junqi,

Regarding exclude1, pr#629 is submitted and reviewed.
Looking into 'block' action one, next.
Thanks.

Comment 11 errata-xmlrpc 2017-10-25 13:04:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049