Hide Forgot
Description of problem: We ran scalability tests on 4.1 HTB4, then upgraded to a newer build. After the testing we scaled the cluster down from 250 nodes to 3 worker nodes. At this point fluentd ended up leaving many buffer files on the 3 worker nodes which ended up evicting all pods and rendering the worker nodes useless however in Ready state. Fluentd shouldn't be able to render a node completely useless and un-fixable with the oc debug node command either. Is there a method to control the amount of space fluentd can buffer to prevent this from making a node stuck in disk pressure condition? Version-Release number of selected component (if applicable): RHCOS/OCP: 4.1 beta4/4.1.0-0.nightly-2019-04-22-005054 Installer: v4.1.0-201904211700-dirty Upgraded version: 4.1.0-0.nightly-2019-04-22-192604 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Example data on a worker node: [root@ip-10-0-161-121 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 100G 0 disk ├─nvme0n1p1 259:1 0 1M 0 part ├─nvme0n1p2 259:2 0 1G 0 part /boot └─nvme0n1p3 259:3 0 99G 0 part /sysroot [root@ip-10-0-161-121 ~]# du -sh /sysroot/ostree/deploy/rhcos/var/lib/fluentd/ 80G /sysroot/ostree/deploy/rhcos/var/lib/fluentd/ [root@ip-10-0-161-121 ~]# du -sh /sysroot/ostree/deploy/rhcos/var/lib/fluentd/* 65G /sysroot/ostree/deploy/rhcos/var/lib/fluentd/buffer-output-es-config 16G /sysroot/ostree/deploy/rhcos/var/lib/fluentd/es-retry [root@ip-10-0-161-121 fluentd]# ls -l buffer-output-es-config/ | head -n 5 total 67161264 -rw-r--r--. 1 root root 3962146 May 7 10:22 buffer.b5884996697dd1535a3b1462b4f5745ab.log -rw-r--r--. 1 root root 76 May 7 10:22 buffer.b5884996697dd1535a3b1462b4f5745ab.log.meta -rw-r--r--. 1 root root 7973223 May 6 02:07 buffer.q5882e8e753f385a0cdbd7afd5fd7dc15.log -rw-r--r--. 1 root root 76 May 6 02:07 buffer.q5882e8e753f385a0cdbd7afd5fd7dc15.log.meta [root@ip-10-0-161-121 fluentd]# ls -l es-retry/ | head -n 5 total 16232340 -rw-r--r--. 1 root root 7970550 May 6 16:06 buffer.q5883a47297799630496fb8bee6a0cd11.log -rw-r--r--. 1 root root 74 May 6 16:06 buffer.q5883a47297799630496fb8bee6a0cd11.log.meta -rw-r--r--. 1 root root 7968085 May 6 16:06 buffer.q5883a472c363a9b6fdb1f6debb31f36a.log -rw-r--r--. 1 root root 74 May 6 16:06 buffer.q5883a472c363a9b6fdb1f6debb31f36a.log.meta The node will show events: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FreeDiskSpaceFailed 179m kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 5472906444 bytes, but freed 0 bytes Warning FreeDiskSpaceFailed 139m kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 4990909644 bytes, but freed 0 bytes Warning FreeDiskSpaceFailed 89m kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 4990602444 bytes, but freed 0 bytes Warning EvictionThresholdMet 84m (x2602 over 18h) kubelet, ip-10-0-161-121.us-west-2.compute.internal Attempting to reclaim ephemeral-storage Warning FreeDiskSpaceFailed 74m (x5 over 99m) kubelet, ip-10-0-161-121.us-west-2.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 4990508236 bytes, but freed 0 bytes Warning ImageGCFailed 59m kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 4992920780 bytes, but freed 0 bytes Warning ImageGCFailed 49m kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 4990418124 bytes, but freed 0 bytes Warning ImageGCFailed 29m kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 4998675660 bytes, but freed 0 bytes Warning ImageGCFailed 8m59s kubelet, ip-10-0-161-121.us-west-2.compute.internal failed to garbage collect required amount of images. Wanted to free 4998773964 bytes, but freed 0 bytes Warning ImageGCFailed 3m59s (x208 over 17h) kubelet, ip-10-0-161-121.us-west-2.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 5001112780 bytes, but freed 0 bytes Lastly, you can not access the node via oc debug node/$node: [root@ip-10-0-11-245 ~]# oc debug node/ip-10-0-161-121.us-west-2.compute.internal Starting pod/ip-10-0-161-121us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Error from server (BadRequest): container "container-00" in pod "ip-10-0-161-121us-west-2computeinternal-debug" is not available Looks like ssh is the only option to un-wedge this condition.
(In reply to Alex Krzos from comment #0) > Description of problem: > We ran scalability tests on 4.1 HTB4, then upgraded to a newer build. After > the testing we scaled the cluster down from 250 nodes to 3 worker nodes. At > this point fluentd ended up leaving many buffer files on the 3 worker nodes > which ended up evicting all pods and rendering the worker nodes useless > however in Ready state. Please provide more information here: * Was fluentd still running on the nodes that had issues? * Is Elasticsearch still running and functional on some 'infra' node? Fluent will buffer files when it is unable to push to ES or not able to push fast enough. Fluent will churn through these and only cleans them up as it processes them.
Moved back to 4.1 as bz merged into master
The Toleration-> node.kubernetes.io/disk-pressure have been added to fluentd when deployed using openshift/ose-cluster-logging-operator:v4.1.0-201905071513. But I don't think that can control the amount of space fluentd buffer occupy. it just provide a way to prevent fluentd strucking the node. To controll the disk size, it is better to provide standalone storage for fluentd buffer.
(In reply to Anping Li from comment #6) > The Toleration-> node.kubernetes.io/disk-pressure have been added to fluentd > when deployed using > openshift/ose-cluster-logging-operator:v4.1.0-201905071513. But I don't > think that can control the amount of space fluentd buffer occupy. it just > provide a way to prevent fluentd strucking the node. To controll the disk > size, it is better to provide standalone storage for fluentd buffer. Correct. Ideally there would be something like a "Stateful DaemonSet". My theor is there were a chain of events that likely led to this issue. Cluster nodes where scaling down which caused the pod density to increase and increasing the log traffic on the remaining worker loads. The collector was probably both unable to keep up with the traffic as well as unable to push to ES for whatever reason. Fluent start to buffer logs, filling up disk and eventually gets evicted because the disk is getting full
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758