1999645 – "no space left on device" issue is seen on latest 4.8 builds

Bug 1999645 - "no space left on device" issue is seen on latest 4.8 builds

Summary: "no space left on device" issue is seen on latest 4.8 builds

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Release
Sub Component:
Version:	4.8
Hardware:	ppc64le
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Luke Meyer
QA Contact:	Manoj Kumar
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1992927 (view as bug list)
Depends On:	1997062
Blocks:	2000092 2000155 2000164
TreeView+	depends on / blocked

Reported:	2021-08-31 13:36 UTC by Manoj Kumar
Modified:	2023-09-15 01:35 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1997062
Clones:	2000092 2000155 2000164 (view as bug list)
Environment:
Last Closed:	2021-09-06 16:04:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 5262	None	None	None	2021-08-31 13:54:56 UTC
Red Hat Issue Tracker	MULTIARCH-1648	None	None	None	2021-08-31 13:46:15 UTC
Red Hat Knowledge Base (Solution)	6304881	None	None	None	2021-09-02 21:09:18 UTC
Red Hat Product Errata	RHBA-2021:3300	None	None	None	2021-09-06 16:04:49 UTC

Description Manoj Kumar 2021-08-31 13:36:35 UTC

Also seen with 4.8.5. Renders cluster unusable after 2 days on Power.

+++ This bug was initially created as a clone of Bug #1997062 +++

Issue is with below builds : 
4.9.0-0.nightly-ppc64le-2021-08-17-145337
4.9.0-0.nightly-ppc64le-2021-08-19-120135

on bastion : 
# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.3 (pvr 004e 0203)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

[core@master-0 ~]$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.3 (pvr 004e 0203)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7

No workload was deployed on the cluster.



# oc get nodes
NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   6d17h   v1.22.0-rc.0+3dfed96
master-1   Ready    master   6d17h   v1.22.0-rc.0+3dfed96
master-2   Ready    master   6d17h   v1.22.0-rc.0+3dfed96
worker-0   Ready    worker   6d17h   v1.22.0-rc.0+3dfed96
worker-1   Ready    worker   6d17h   v1.22.0-rc.0+3dfed96

--- Additional comment from Alisha on 2021-08-24 13:10:32 UTC ---

[root@master-0 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           8.0G  256K  8.0G   1% /dev/shm
tmpfs           8.0G  7.9G  151M  99% /run
tmpfs           8.0G     0  8.0G   0% /sys/fs/cgroup
/dev/sda4       120G   17G  104G  14% /sysroot
tmpfs           8.0G   64K  8.0G   1% /tmp
/dev/sdb3       364M  233M  109M  69% /boot
overlay         8.0G  7.9G  151M  99% /etc/NetworkManager/systemConnectionsMerged

--- Additional comment from Alisha on 2021-08-24 13:25:17 UTC ---

Platform is ppc64le.

OS info : 

on bastion : 
# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

CoreOS nodes :
[core@master-0 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux CoreOS release 4.9

--- Additional comment from Manoj Kumar on 2021-08-29 22:40:16 UTC ---

I did some more digging.  Found a tool to snoop on exec().  https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py

With a compiled version of the tool, I was able to correlate the new processes to the contents of /run/crio/exec-pid-dir:

https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py

[root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# ./execsnoop
In file included from <built-in>:2:
In file included from /virtual/include/bcc/bpf.h:12:
In file included from include/linux/types.h:6:
In file included from include/uapi/linux/types.h:14:
In file included from include/uapi/linux/posix_types.h:5:
In file included from include/linux/stddef.h:5:
In file included from include/uapi/linux/stddef.h:2:
In file included from include/linux/compiler_types.h:74:
include/linux/compiler-clang.h:25:9: warning: '__no_sanitize_address' macro redefined [-Wmacro-redefined]
#define __no_sanitize_address
        ^
include/linux/compiler-gcc.h:213:9: note: previous definition is here
#define __no_sanitize_address __attribute__((no_sanitize_address))
        ^
1 warning generated.
PCOMM            PID    PPID   RET ARGS
ldd              3733034 2553     0 /usr/bin/ldd /usr/bin/crio
ld64.so.2        3733035 3733034   0 /lib64/ld64.so.2 --verify /usr/bin/crio
ld64.so.2        3733038 3733037   0 /lib64/ld64.so.2 /usr/bin/crio
sh               3733039 5068     0   /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg
awk              3733039 5068     0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg
sh               3733040 5068     0   /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg
awk              3733040 5068     0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg
sh               3733041 5068     0   /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg
awk              3733041 5068     0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg
sh               3733042 5313     0   /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg
awk              3733042 5313     0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg
sh               3733043 5313     0   /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg
awk              3733043 5313     0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg
sh               3733044 5313     0   /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg
awk              3733044 5313     0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg
md5sum           3733046 3733045   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733047 3733045   0 /usr/bin/awk {print $1}
md5sum           3733049 3733048   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733050 3733048   0 /usr/bin/awk {print $1}
sleep            3733051 3709     0 /usr/bin/sleep 1
md5sum           3733053 3733052   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733054 3733052   0 /usr/bin/awk {print $1}
md5sum           3733056 3733055   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733057 3733055   0 /usr/bin/awk {print $1}
sleep            3733058 3709     0 /usr/bin/sleep 1
runc             3733059 2553     0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe36d95a00c-82c2-4fbe-b015-b2ab89cf3303 --process /tmp/exec-process-074617211 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3
exe              3733068 3733059   0 /proc/self/exe init
test             3733070 3733059   0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf
md5sum           3733077 3733076   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733078 3733076   0 /usr/bin/awk {print $1}
md5sum           3733080 3733079   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733081 3733079   0 /usr/bin/awk {print $1}
sleep            3733082 3709     0 /usr/bin/sleep 1
runc             3733083 2553     0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace50931519341ee0-e912-4070-a4f8-14d9196f1352 --process /tmp/exec-process-197278878 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315
exe              3733093 3733083   0 /proc/self/exe init
bash             3733095 3733083   0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c
etcdctl          3733101 3733095   0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json
grep             3733102 3733095   0 /usr/bin/grep "health":true
awk              3733112 3733110   0 /usr/bin/awk {print $1}
md5sum           3733111 3733110   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
md5sum           3733114 3733113   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733115 3733113   0 /usr/bin/awk {print $1}
sleep            3733116 3709     0 /usr/bin/sleep 1
runc             3733117 2553     0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664df311a28-07fb-4c11-9bb9-9ccdf005adc2 --process /tmp/exec-process-392734080 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664
exe              3733126 3733117   0 /proc/self/exe init
sh               3733131 3733117   0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key
grep             3733138 3733131   0 /usr/bin/grep "health":"true"
curl             3733137 3733131   0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health
md5sum           3733141 3733140   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733142 3733140   0 /usr/bin/awk {print $1}
md5sum           3733144 3733143   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733145 3733143   0 /usr/bin/awk {print $1}
sleep            3733146 3709     0 /usr/bin/sleep 1
md5sum           3733148 3733147   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733149 3733147   0 /usr/bin/awk {print $1}
md5sum           3733151 3733150   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733152 3733150   0 /usr/bin/awk {print $1}
sleep            3733153 3709     0 /usr/bin/sleep 1
md5sum           3733155 3733154   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733156 3733154   0 /usr/bin/awk {print $1}
md5sum           3733158 3733157   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733159 3733157   0 /usr/bin/awk {print $1}
sleep            3733160 3709     0 /usr/bin/sleep 1
runc             3733161 2553     0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe37b92324c-0714-46c1-b517-3ab9786ca397 --process /tmp/exec-process-050607721 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3
exe              3733169 3733161   0 /proc/self/exe init
test             3733173 3733161   0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf
md5sum           3733180 3733179   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
awk              3733181 3733179   0 /usr/bin/awk {print $1}
md5sum           3733183 3733182   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733184 3733182   0 /usr/bin/awk {print $1}
sleep            3733185 3709     0 /usr/bin/sleep 1
runc             3733186 2553     0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace5093158e423623-9b1d-45d4-b3f3-63eea92c797b --process /tmp/exec-process-598901766 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315
exe              3733195 3733186   0 /proc/self/exe init
bash             3733197 3733186   0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c
etcdctl          3733203 3733197   0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json
grep             3733204 3733197   0 /usr/bin/grep "health":true
awk              3733213 3733211   0 /usr/bin/awk {print $1}
md5sum           3733212 3733211   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token
md5sum           3733215 3733214   0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
awk              3733216 3733214   0 /usr/bin/awk {print $1}
sleep            3733217 3709     0 /usr/bin/sleep 1
runc             3733218 2553     0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664fd437fe6-f5b3-4f39-86b7-efb35ddcaa25 --process /tmp/exec-process-131250605 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664
exe              3733226 3733218   0 /proc/self/exe init
sh               3733230 3733218   0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key
grep             3733238 3733230   0 /usr/bin/grep "health":"true"
curl             3733237 3733230   0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health
^CTraceback (most recent call last):
  File "execsnoop.py", line 305, in <module>
  File "bcc/__init__.py", line 1445, in perf_buffer_poll
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "execsnoop.py", line 307, in <module>
NameError: name 'exit' is not defined
[3732956] Failed to execute script 'execsnoop' due to unhandled exception!
[root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# for i in `ls -t /run/crio/exec-pid-dir|head `; do cat /run/crio/exec-pid-dir/$i; echo ' '; done
3733230 
3733197 
3733173 
3733131 
3733095 
3733070 
3733017 
3732984 
3732958 
3732896


--- Additional comment from Manoj Kumar on 2021-08-30 13:08:10 UTC ---

This is being reported with 4.8.5 as well.  i.e. Potential to be hit by customers who upgrade to the most recent release.

--- Additional comment from Manoj Kumar on 2021-08-30 16:48:09 UTC ---

@prashanth found that this issue was introduced by
https://github.com/cri-o/cri-o/pull/5136 

And it is fixed/reverted by:
https://github.com/cri-o/cri-o/pull/5245
https://github.com/cri-o/cri-o/pull/5262

Comment 1 Dan Li 2021-08-31 13:57:02 UTC

Setting "Blocker+" for 4.8.z release per request from the multi-arch Power team as the bug is urgent.

Comment 2 Prashanth Sundararaman 2021-08-31 15:50:18 UTC

we need new 4.8 release images with the latest machine-os-content (https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.8-ppc64le&release=48.84.202108301604-0#48.84.202108301604-0) which has the latest cri-o version (1.21.2-15.rhaos4.8.gitcdc4f56.el8) with the fix for the issue

Comment 3 Thiago Alessio Pereira 2021-09-01 10:54:34 UTC

new images available, and 4.8.10 is updated with cri-o fix. Moving this to "ON_QA" to be able to attach it to corresponding advisory.

Comment 4 Thiago Alessio Pereira 2021-09-01 10:55:31 UTC

well, I don't have permissions to do it ...

Comment 6 Mike Fiedler 2021-09-01 12:09:38 UTC

@dmistry Can you please assign to someone for verification on power?

Comment 7 Deep Mistry 2021-09-01 12:17:30 UTC

@vvinnako or @amosingh  Can you verify this?

Comment 8 Manoj Kumar 2021-09-01 16:37:06 UTC

4.8.10 seems to resolve this problem. 

[root@test-hq58m-master-0 exec-pid-dir]# pwd

/run/crio/exec-pid-dir
[root@test-hq58m-master-0 exec-pid-dir]# ls |wc
      0       0       0

[root@mihawklp106 cri-o]# oc version
Client Version: 4.8.0-rc.0
Server Version: 4.8.10
Kubernetes Version: v1.21.1+9807387

Comment 9 Manoj Kumar 2021-09-01 16:45:21 UTC

[core@test-hq58m-master-0 ~]$ crio --version
crio version 1.21.2-15.rhaos4.8.gitcdc4f56.el8
Version:    1.21.2-15.rhaos4.8.gitcdc4f56.el8
GoVersion:  go1.16.6
Compiler:   gc
Platform:   linux/ppc64le
Linkmode:   dynamic

Comment 10 Amogh Singh 2021-09-01 17:02:43 UTC

@manokuma
The bug title is about 4.8 builds but the build ID in the description is 4.9 builds. Which build ID is the one to take a look at?

Comment 11 Manoj Kumar 2021-09-01 22:05:07 UTC

@amosingh 4.8. I have already verified this.

Comment 13 errata-xmlrpc 2021-09-06 16:04:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.10 packages update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3300

Comment 14 Peter Hunt 2021-10-13 19:26:58 UTC

*** Bug 1992927 has been marked as a duplicate of this bug. ***

Comment 15 Red Hat Bugzilla 2023-09-15 01:35:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.

amosingh
aos-bugs
aprabhu
danili
dgilmore
dmistry
dslavens
jdelft
jokerman
kahara
lmcfadde
lmeyer
manokuma
mfojtik
mifiedle
minmli
mkumatag
psundara
talessio
vvinnako
wking