Bug 1707465

Summary: Fluentd wedged nodes on 4.1 cluster
Product: OpenShift Container Platform Reporter: Alex Krzos <akrzos>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, ewolinet, pweil, rmeggins
Target Milestone: ---Flags: akrzos: needinfo-
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-41
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Krzos 2019-05-07 15:10:51 UTC
Description of problem:
We ran scalability tests on 4.1 HTB4, then upgraded to a newer build.  After the testing we scaled the cluster down from 250 nodes to 3 worker nodes.  At this point fluentd ended up leaving many buffer files on the 3 worker nodes which ended up evicting all pods and rendering the worker nodes useless however in Ready state.

Fluentd shouldn't be able to render a node completely useless and un-fixable with the oc debug node command either.  Is there a method to control the amount of space fluentd can buffer to prevent this from making a node stuck in disk pressure condition?

Version-Release number of selected component (if applicable):
RHCOS/OCP: 4.1 beta4/4.1.0-0.nightly-2019-04-22-005054
Installer: v4.1.0-201904211700-dirty
Upgraded version: 4.1.0-0.nightly-2019-04-22-192604


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Example data on a worker node:

[root@ip-10-0-161-121 ~]# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0  100G  0 disk 
├─nvme0n1p1 259:1    0    1M  0 part 
├─nvme0n1p2 259:2    0    1G  0 part /boot
└─nvme0n1p3 259:3    0   99G  0 part /sysroot
[root@ip-10-0-161-121 ~]# du -sh /sysroot/ostree/deploy/rhcos/var/lib/fluentd/
80G     /sysroot/ostree/deploy/rhcos/var/lib/fluentd/
[root@ip-10-0-161-121 ~]# du -sh /sysroot/ostree/deploy/rhcos/var/lib/fluentd/*
65G     /sysroot/ostree/deploy/rhcos/var/lib/fluentd/buffer-output-es-config
16G     /sysroot/ostree/deploy/rhcos/var/lib/fluentd/es-retry

[root@ip-10-0-161-121 fluentd]# ls -l buffer-output-es-config/ | head -n 5
total 67161264
-rw-r--r--. 1 root root 3962146 May  7 10:22 buffer.b5884996697dd1535a3b1462b4f5745ab.log
-rw-r--r--. 1 root root      76 May  7 10:22 buffer.b5884996697dd1535a3b1462b4f5745ab.log.meta
-rw-r--r--. 1 root root 7973223 May  6 02:07 buffer.q5882e8e753f385a0cdbd7afd5fd7dc15.log
-rw-r--r--. 1 root root      76 May  6 02:07 buffer.q5882e8e753f385a0cdbd7afd5fd7dc15.log.meta
[root@ip-10-0-161-121 fluentd]# ls -l es-retry/ | head -n 5                                                                                                                                                                                      
total 16232340
-rw-r--r--. 1 root root 7970550 May  6 16:06 buffer.q5883a47297799630496fb8bee6a0cd11.log
-rw-r--r--. 1 root root      74 May  6 16:06 buffer.q5883a47297799630496fb8bee6a0cd11.log.meta
-rw-r--r--. 1 root root 7968085 May  6 16:06 buffer.q5883a472c363a9b6fdb1f6debb31f36a.log
-rw-r--r--. 1 root root      74 May  6 16:06 buffer.q5883a472c363a9b6fdb1f6debb31f36a.log.meta

The node will show events:

Events:
  Type     Reason                Age                    From                                                 Message
  ----     ------                ----                   ----                                                 -------
  Warning  FreeDiskSpaceFailed   179m                   kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 5472906444 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed   139m                   kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 4990909644 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed   89m                    kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 4990602444 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet  84m (x2602 over 18h)   kubelet, ip-10-0-161-121.us-west-2.compute.internal  Attempting to reclaim ephemeral-storage
  Warning  FreeDiskSpaceFailed   74m (x5 over 99m)      kubelet, ip-10-0-161-121.us-west-2.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 4990508236 bytes, but freed 0 bytes
  Warning  ImageGCFailed         59m                    kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 4992920780 bytes, but freed 0 bytes
  Warning  ImageGCFailed         49m                    kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 4990418124 bytes, but freed 0 bytes
  Warning  ImageGCFailed         29m                    kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 4998675660 bytes, but freed 0 bytes
  Warning  ImageGCFailed         8m59s                  kubelet, ip-10-0-161-121.us-west-2.compute.internal  failed to garbage collect required amount of images. Wanted to free 4998773964 bytes, but freed 0 bytes
  Warning  ImageGCFailed         3m59s (x208 over 17h)  kubelet, ip-10-0-161-121.us-west-2.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 5001112780 bytes, but freed 0 bytes

Lastly, you can not access the node via oc debug node/$node:
[root@ip-10-0-11-245 ~]# oc debug node/ip-10-0-161-121.us-west-2.compute.internal
Starting pod/ip-10-0-161-121us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Error from server (BadRequest): container "container-00" in pod "ip-10-0-161-121us-west-2computeinternal-debug" is not available


Looks like ssh is the only option to un-wedge this condition.

Comment 1 Jeff Cantrill 2019-05-07 15:33:02 UTC
(In reply to Alex Krzos from comment #0)
> Description of problem:

> We ran scalability tests on 4.1 HTB4, then upgraded to a newer build.  After
> the testing we scaled the cluster down from 250 nodes to 3 worker nodes.  At
> this point fluentd ended up leaving many buffer files on the 3 worker nodes
> which ended up evicting all pods and rendering the worker nodes useless
> however in Ready state.

Please provide more information here:

* Was fluentd still running on the nodes that had issues?
* Is Elasticsearch still running and functional on some 'infra' node?

Fluent will buffer files when it is unable to push to ES or not able to push fast enough.  Fluent will churn through these and only cleans them up as it processes them.

Comment 4 Jeff Cantrill 2019-05-08 01:52:22 UTC
Moved back to 4.1 as bz merged into master

Comment 6 Anping Li 2019-05-08 15:50:34 UTC
The Toleration-> node.kubernetes.io/disk-pressure have been added to fluentd when deployed using openshift/ose-cluster-logging-operator:v4.1.0-201905071513.  But I don't think that can control the amount of space fluentd buffer occupy. it just provide a way to prevent fluentd strucking the node.  To controll the disk size, it is better to provide standalone storage for fluentd buffer.

Comment 7 Jeff Cantrill 2019-05-08 18:29:54 UTC
(In reply to Anping Li from comment #6)
> The Toleration-> node.kubernetes.io/disk-pressure have been added to fluentd
> when deployed using
> openshift/ose-cluster-logging-operator:v4.1.0-201905071513.  But I don't
> think that can control the amount of space fluentd buffer occupy. it just
> provide a way to prevent fluentd strucking the node.  To controll the disk
> size, it is better to provide standalone storage for fluentd buffer.

Correct.  Ideally there would be something like a "Stateful DaemonSet".  My theor is there were a chain of events that likely led to this issue.  Cluster nodes where scaling down which caused the pod density to increase and increasing the log traffic on the remaining worker loads.  The collector was probably both unable to keep up with the traffic as well as unable to push to ES for whatever reason.  Fluent start to buffer logs, filling up disk and eventually gets evicted because the disk is getting full

Comment 9 errata-xmlrpc 2019-06-04 10:48:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758