Also seen with 4.8.5. Renders cluster unusable after 2 days on Power. +++ This bug was initially created as a clone of Bug #1997062 +++ Issue is with below builds : 4.9.0-0.nightly-ppc64le-2021-08-17-145337 4.9.0-0.nightly-ppc64le-2021-08-19-120135 on bastion : # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Model: 2.3 (pvr 004e 0203) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-7 Physical sockets: 2 Physical chips: 1 Physical cores/chip: 10 [core@master-0 ~]$ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Model: 2.3 (pvr 004e 0203) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-7 No workload was deployed on the cluster. # oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 6d17h v1.22.0-rc.0+3dfed96 master-1 Ready master 6d17h v1.22.0-rc.0+3dfed96 master-2 Ready master 6d17h v1.22.0-rc.0+3dfed96 worker-0 Ready worker 6d17h v1.22.0-rc.0+3dfed96 worker-1 Ready worker 6d17h v1.22.0-rc.0+3dfed96 --- Additional comment from Alisha on 2021-08-24 13:10:32 UTC --- [root@master-0 ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 7.9G 0 7.9G 0% /dev tmpfs 8.0G 256K 8.0G 1% /dev/shm tmpfs 8.0G 7.9G 151M 99% /run tmpfs 8.0G 0 8.0G 0% /sys/fs/cgroup /dev/sda4 120G 17G 104G 14% /sysroot tmpfs 8.0G 64K 8.0G 1% /tmp /dev/sdb3 364M 233M 109M 69% /boot overlay 8.0G 7.9G 151M 99% /etc/NetworkManager/systemConnectionsMerged --- Additional comment from Alisha on 2021-08-24 13:25:17 UTC --- Platform is ppc64le. OS info : on bastion : # cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 (Ootpa) CoreOS nodes : [core@master-0 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux CoreOS release 4.9 --- Additional comment from Manoj Kumar on 2021-08-29 22:40:16 UTC --- I did some more digging. Found a tool to snoop on exec(). https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py With a compiled version of the tool, I was able to correlate the new processes to the contents of /run/crio/exec-pid-dir: https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py [root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# ./execsnoop In file included from <built-in>:2: In file included from /virtual/include/bcc/bpf.h:12: In file included from include/linux/types.h:6: In file included from include/uapi/linux/types.h:14: In file included from include/uapi/linux/posix_types.h:5: In file included from include/linux/stddef.h:5: In file included from include/uapi/linux/stddef.h:2: In file included from include/linux/compiler_types.h:74: include/linux/compiler-clang.h:25:9: warning: '__no_sanitize_address' macro redefined [-Wmacro-redefined] #define __no_sanitize_address ^ include/linux/compiler-gcc.h:213:9: note: previous definition is here #define __no_sanitize_address __attribute__((no_sanitize_address)) ^ 1 warning generated. PCOMM PID PPID RET ARGS ldd 3733034 2553 0 /usr/bin/ldd /usr/bin/crio ld64.so.2 3733035 3733034 0 /lib64/ld64.so.2 --verify /usr/bin/crio ld64.so.2 3733038 3733037 0 /lib64/ld64.so.2 /usr/bin/crio sh 3733039 5068 0 /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg awk 3733039 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg sh 3733040 5068 0 /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg awk 3733040 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg sh 3733041 5068 0 /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg awk 3733041 5068 0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg sh 3733042 5313 0 /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg awk 3733042 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg sh 3733043 5313 0 /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg awk 3733043 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg sh 3733044 5313 0 /usr/bin/awk -F = '/partition_id/ { print $2 }' /proc/ppc64/lparcfg awk 3733044 5313 0 /usr/bin/awk -F = /partition_id/ { print $2 } /proc/ppc64/lparcfg md5sum 3733046 3733045 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733047 3733045 0 /usr/bin/awk {print $1} md5sum 3733049 3733048 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733050 3733048 0 /usr/bin/awk {print $1} sleep 3733051 3709 0 /usr/bin/sleep 1 md5sum 3733053 3733052 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733054 3733052 0 /usr/bin/awk {print $1} md5sum 3733056 3733055 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733057 3733055 0 /usr/bin/awk {print $1} sleep 3733058 3709 0 /usr/bin/sleep 1 runc 3733059 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe36d95a00c-82c2-4fbe-b015-b2ab89cf3303 --process /tmp/exec-process-074617211 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3 exe 3733068 3733059 0 /proc/self/exe init test 3733070 3733059 0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf md5sum 3733077 3733076 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733078 3733076 0 /usr/bin/awk {print $1} md5sum 3733080 3733079 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733081 3733079 0 /usr/bin/awk {print $1} sleep 3733082 3709 0 /usr/bin/sleep 1 runc 3733083 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace50931519341ee0-e912-4070-a4f8-14d9196f1352 --process /tmp/exec-process-197278878 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315 exe 3733093 3733083 0 /proc/self/exe init bash 3733095 3733083 0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c etcdctl 3733101 3733095 0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json grep 3733102 3733095 0 /usr/bin/grep "health":true awk 3733112 3733110 0 /usr/bin/awk {print $1} md5sum 3733111 3733110 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token md5sum 3733114 3733113 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733115 3733113 0 /usr/bin/awk {print $1} sleep 3733116 3709 0 /usr/bin/sleep 1 runc 3733117 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664df311a28-07fb-4c11-9bb9-9ccdf005adc2 --process /tmp/exec-process-392734080 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664 exe 3733126 3733117 0 /proc/self/exe init sh 3733131 3733117 0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key grep 3733138 3733131 0 /usr/bin/grep "health":"true" curl 3733137 3733131 0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health md5sum 3733141 3733140 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733142 3733140 0 /usr/bin/awk {print $1} md5sum 3733144 3733143 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733145 3733143 0 /usr/bin/awk {print $1} sleep 3733146 3709 0 /usr/bin/sleep 1 md5sum 3733148 3733147 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733149 3733147 0 /usr/bin/awk {print $1} md5sum 3733151 3733150 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733152 3733150 0 /usr/bin/awk {print $1} sleep 3733153 3709 0 /usr/bin/sleep 1 md5sum 3733155 3733154 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733156 3733154 0 /usr/bin/awk {print $1} md5sum 3733158 3733157 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733159 3733157 0 /usr/bin/awk {print $1} sleep 3733160 3709 0 /usr/bin/sleep 1 runc 3733161 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe37b92324c-0714-46c1-b517-3ab9786ca397 --process /tmp/exec-process-050607721 1b2a759d965ccb9678cbae7b0dafde9537667d062555291284a1f5d6a2312fe3 exe 3733169 3733161 0 /proc/self/exe init test 3733173 3733161 0 /usr/bin/test -f /etc/cni/net.d/80-openshift-network.conf md5sum 3733180 3733179 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token awk 3733181 3733179 0 /usr/bin/awk {print $1} md5sum 3733183 3733182 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733184 3733182 0 /usr/bin/awk {print $1} sleep 3733185 3709 0 /usr/bin/sleep 1 runc 3733186 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace5093158e423623-9b1d-45d4-b3f3-63eea92c797b --process /tmp/exec-process-598901766 afb03a1f9dbb802fc1d1440388430462cfb60231b060aa3ebcbc249ace509315 exe 3733195 3733186 0 /proc/self/exe init bash 3733197 3733186 0 /bin/bash -c set -xe\n\n# Unix sockets are used for health checks to ensure that the pod is reporting readiness of the etcd process\n# in this c etcdctl 3733203 3733197 0 /usr/bin/etcdctl --command-timeout=2s --dial-timeout=2s --endpoints=unixs://193.168.200.231:0 endpoint health -w json grep 3733204 3733197 0 /usr/bin/grep "health":true awk 3733213 3733211 0 /usr/bin/awk {print $1} md5sum 3733212 3733211 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/token md5sum 3733215 3733214 0 /usr/bin/md5sum /var/run/secrets/kubernetes.io/serviceaccount/ca.crt awk 3733216 3733214 0 /usr/bin/awk {print $1} sleep 3733217 3709 0 /usr/bin/sleep 1 runc 3733218 2553 0 /usr/bin/runc --root /run/runc exec --pid-file /var/run/crio/exec-pid-dir/612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664fd437fe6-f5b3-4f39-86b7-efb35ddcaa25 --process /tmp/exec-process-131250605 612b6a23ee3b71174d42a96117c527fef4e4d3f5800a1257216af2fb249f4664 exe 3733226 3733218 0 /proc/self/exe init sh 3733230 3733218 0 /bin/sh -c declare -r health_endpoint="https://localhost:2379/health"\ndeclare -r cert="/var/run/secrets/etcd-client/tls.crt"\ndeclare -r key grep 3733238 3733230 0 /usr/bin/grep "health":"true" curl 3733237 3733230 0 /usr/bin/curl --max-time 2 --silent --cert /var/run/secrets/etcd-client/tls.crt --key /var/run/secrets/etcd-client/tls.key --cacert /var/run/configmaps/etcd-ca/ca-bundle.crt https://localhost:2379/health ^CTraceback (most recent call last): File "execsnoop.py", line 305, in <module> File "bcc/__init__.py", line 1445, in perf_buffer_poll KeyboardInterrupt During handling of the above exception, another exception occurred: Traceback (most recent call last): File "execsnoop.py", line 307, in <module> NameError: name 'exit' is not defined [3732956] Failed to execute script 'execsnoop' due to unhandled exception! [root@rdr-cicd-e6b7-mon01-master-1 execsnoop]# for i in `ls -t /run/crio/exec-pid-dir|head `; do cat /run/crio/exec-pid-dir/$i; echo ' '; done 3733230 3733197 3733173 3733131 3733095 3733070 3733017 3732984 3732958 3732896 --- Additional comment from Manoj Kumar on 2021-08-30 13:08:10 UTC --- This is being reported with 4.8.5 as well. i.e. Potential to be hit by customers who upgrade to the most recent release. --- Additional comment from Manoj Kumar on 2021-08-30 16:48:09 UTC --- @prashanth found that this issue was introduced by https://github.com/cri-o/cri-o/pull/5136 And it is fixed/reverted by: https://github.com/cri-o/cri-o/pull/5245 https://github.com/cri-o/cri-o/pull/5262
Setting "Blocker+" for 4.8.z release per request from the multi-arch Power team as the bug is urgent.
we need new 4.8 release images with the latest machine-os-content (https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.8-ppc64le&release=48.84.202108301604-0#48.84.202108301604-0) which has the latest cri-o version (1.21.2-15.rhaos4.8.gitcdc4f56.el8) with the fix for the issue
new images available, and 4.8.10 is updated with cri-o fix. Moving this to "ON_QA" to be able to attach it to corresponding advisory.
well, I don't have permissions to do it ...
@dmistry Can you please assign to someone for verification on power?
@vvinnako or @amosingh Can you verify this?
4.8.10 seems to resolve this problem. [root@test-hq58m-master-0 exec-pid-dir]# pwd /run/crio/exec-pid-dir [root@test-hq58m-master-0 exec-pid-dir]# ls |wc 0 0 0 [root@mihawklp106 cri-o]# oc version Client Version: 4.8.0-rc.0 Server Version: 4.8.10 Kubernetes Version: v1.21.1+9807387
[core@test-hq58m-master-0 ~]$ crio --version crio version 1.21.2-15.rhaos4.8.gitcdc4f56.el8 Version: 1.21.2-15.rhaos4.8.gitcdc4f56.el8 GoVersion: go1.16.6 Compiler: gc Platform: linux/ppc64le Linkmode: dynamic
@manokuma The bug title is about 4.8 builds but the build ID in the description is 4.9 builds. Which build ID is the one to take a look at?
@amosingh 4.8. I have already verified this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.10 packages update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3300
*** Bug 1992927 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days