1483114 – Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable

Bug 1483114 - Data loss of logs can occur if fluentd pod is terminated/restarted when Elasticsearch is unavailable

Summary: Data loss of logs can occur if fluentd pod is terminated/restarted when Elast...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.4.1
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Noriko Hosoi
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:	UpcomingRelease,
Depends On:	1460749
Blocks:	1477513 1477515
TreeView+	depends on / blocked

Reported:	2017-08-18 18:46 UTC by Jeff Cantrill
Modified:	2021-06-10 12:51 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Messages are read into fluentds memory buffer and are lost if the pod is restarted because fluentd considers them read but they have not been pushed to storage Consequence: Any message not stored but already read by fluentd is lost Fix: Replace the memory buffer with a file based buffer Result: File buffered messages are pushed to storage once fluentd restarts
Clone Of:	1460749
Environment:
Last Closed:	2017-10-25 13:04:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	origin-aggregated-logging/pull/559	0	None	None	None	2020-09-18 03:30:49 UTC
Red Hat Product Errata	RHBA-2017:3049	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update	2017-10-25 15:57:15 UTC

Comment 1 Jeff Cantrill 2017-08-18 18:48:06 UTC

fixed in PR origin-aggregated-logging/pull/559

Comment 2 Rich Megginson 2017-08-23 20:52:25 UTC

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=587924
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-23619-20170823203852
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5-10

Comment 3 Xia Zhao 2017-08-24 06:23:19 UTC

reassign to @juzhao as he is the trello card owner

Comment 4 Junqi Zhao 2017-08-29 08:47:59 UTC

@Noriko

I have two questions:
1. Do we still have openshift_logging_use_mux_client parameter?
From PR https://github.com/openshift/openshift-ansible/pull/4554/ for
https://bugzilla.redhat.com/show_bug.cgi?id=1464024#c0

There is openshift_logging_use_mux_client: False in roles/openshift_logging_fluentd/defaults/main.yml

But it can not be found now.

ansilbe playbooks version
openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch

I don't think it is necessary to test with mux, are you agree?

2. Since we only can use file buffer now, I think the only verification way for this defect is 
1) stop fluentd pods for a while and restart them later.
2) verify fluentd is still able to communicate with Elasticsearch
3) verify the logs stored in file/pv could be retrieved by kibana, no log is missing.

Do you have better way to verify it?

Comment 5 Rich Megginson 2017-08-29 13:15:58 UTC

(In reply to Junqi Zhao from comment #4)
> @Noriko
> 
> I have two questions:
> 1. Do we still have openshift_logging_use_mux_client parameter?
> From PR https://github.com/openshift/openshift-ansible/pull/4554/ for
> https://bugzilla.redhat.com/show_bug.cgi?id=1464024#c0
> 
> There is openshift_logging_use_mux_client: False in
> roles/openshift_logging_fluentd/defaults/main.yml
> 
> But it can not be found now.

Right.  In 3.6.1 we got rid of that setting because there are now multiple mux client modes:

https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_logging#mux---secure_forward-listener-service

"openshift_logging_mux_client_mode: Values - minimal, maximal. Default is unset. Setting this value will cause the Fluentd node agent to send logs to mux rather than directly to Elasticsearch. The value maximal means that Fluentd will do as much processing as possible at the node before sending the records to mux. This is the current recommended way to use mux due to current scaling issues. The value minimal means that Fluentd will do no processing at all, and send the raw logs to mux for processing. We do not currently recommend using this mode, and ansible will warn you about this.
"

When testing mux, use `openshift_logging_mux_client_mode=maximal`

> 
> ansilbe playbooks version
> openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch
> 
> I don't think it is necessary to test with mux, are you agree?

We would prefer testing with mux, as mux has no persistence at all, and is the most vulnerable for data loss.

> 
> 2. Since we only can use file buffer now, I think the only verification way
> for this defect is 
> 1) stop fluentd pods for a while and restart them later.
> 2) verify fluentd is still able to communicate with Elasticsearch
> 3) verify the logs stored in file/pv could be retrieved by kibana, no log is
> missing.
> 
> Do you have better way to verify it?

Comment 6 Junqi Zhao 2017-09-01 08:43:26 UTC

Verification steps:
1. Use mux to test, set the following parameters in inventory file
openshift_logging_use_mux=true
openshift_logging_mux_client_mode=maximal

2. Creat one project to populate logs.

3. Stop fluentd pods, and note down the last project logs in kibana

4. Wait for a while, and restart fluentd pods.

5. Check the subsequent logs after step 3, no logs is missing.

Test env
# openshift version
openshift v3.6.173.0.21
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

Images:
logging-curator-v3.6.173.0.21-15
logging-elasticsearch-v3.6.173.0.21-15
logging-fluentd-v3.6.173.0.28-1
logging-kibana-v3.6.173.0.21-15
logging-auth-proxy-v3.6.173.0.21-15

Comment 7 Junqi Zhao 2017-09-01 08:44:52 UTC

(In reply to Junqi Zhao from comment #6)
> Verification steps:
> 1. Use mux to test, set the following parameters in inventory file
> openshift_logging_use_mux=true
> openshift_logging_mux_client_mode=maximal
> 
> 2. Creat one project to populate logs.
> 
> 3. Stop fluentd pods, and note down the last project logs in kibana
> 
> 4. Wait for a while, and restart fluentd pods.
> 
> 5. Check the subsequent logs after step 3, no logs is missing.

add step 6
  6. Repeat step 3 to 5, make sure no log is missing 
> Test env
> # openshift version
> openshift v3.6.173.0.21
> kubernetes v1.6.1+5115d708d7
> etcd 3.2.1
> 
> Images:
> logging-curator-v3.6.173.0.21-15
> logging-elasticsearch-v3.6.173.0.21-15
> logging-fluentd-v3.6.173.0.28-1
> logging-kibana-v3.6.173.0.21-15
> logging-auth-proxy-v3.6.173.0.21-15

Comment 8 Junqi Zhao 2017-09-01 09:32:26 UTC

@nhosoi
one more question, in fluentd pod log, "'exclude1' parameter is deprecated" has reported in one defect, I want to ask is it right for the warn message, such as:
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

We didn't see this before.

# oc logs logging-fluentd-hnphl
umounts of dead containers will fail. Ignoring...
umount: /var/lib/docker/containers/*/shm: mountpoint not found
2017-09-01 03:28:26 -0400 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-09-01 03:28:28 -0400 [warn]: 'exclude1' parameter is deprecated: Use <exclude> section
2017-09-01 03:28:28 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-09-01 03:28:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

Comment 9 Noriko Hosoi 2017-09-01 14:56:07 UTC

Hi Junqi,

Regarding exclude1, pr#629 is submitted and reviewed.
Looking into 'block' action one, next.
Thanks.

Comment 11 errata-xmlrpc 2017-10-25 13:04:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Note You need to log in before you can comment on or make changes to this bug.