In OCP cluster on KVM, the kube-apiserver pod encounters CrashLoopBackOff state with the logs indicating "SIGILL: illegal instruction" error. Description of problem: --------------------------- In the OCP cluster installed on KVM, the kube-apiserver-master-0 pods is observed in CrashLoopBackOff state. The pod logs indicate: "SIGILL: illegal instruction" as below -------------------------------------------------------------------------------------------------------- [root@bastion ~]# oc logs pod/kube-apiserver-master-0.m13lp36ocp.lnxne.boe Copying system trust bundle Waiting for port :6443 to be released. I0212 13:24:12.609371 1 loader.go:379] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig Copying termination logs to "/var/log/kube-apiserver/termination.log" I0212 13:24:12.611234 1 main.go:124] Touching termination lock file "/var/log/kube-apiserver/.terminating" I0212 13:24:12.611460 1 main.go:182] Launching sub-process "/usr/bin/hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=192.168.79.21 -v=2" SIGILL: illegal instruction PC=0xac5f24 m=0 sigcode=1 instruction bytes: 0x30 0x78 0x63 0x30 0x30 0x30 0x31 0x39 0x36 0x30 0x30 0x30 0x7d 0x3a 0x20 0x72 goroutine 1 [running, locked to thread]: k8s.io/kubernetes/vendor/k8s.io/api/extensions/v1beta1.init() /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/api/extensions/v1beta1/generated.pb.go:15312 +0x4 fp=0xc0000fde40 sp=0xc0000fde40 pc=0xac5f24 runtime.doInit(0x70e8000) /usr/lib/golang/src/runtime/proc.go:5646 +0xac fp=0xc0000fde68 sp=0xc0000fde40 pc=0x5acac runtime.doInit(0x70f83e0) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fde90 sp=0xc0000fde68 pc=0x5ac62 runtime.doInit(0x70e6c00) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdeb8 sp=0xc0000fde90 pc=0x5ac62 runtime.doInit(0x70e6d40) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdee0 sp=0xc0000fdeb8 pc=0x5ac62 runtime.doInit(0x70e2640) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf08 sp=0xc0000fdee0 pc=0x5ac62 runtime.doInit(0x70cb040) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf30 sp=0xc0000fdf08 pc=0x5ac62 runtime.doInit(0x70c63c0) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf58 sp=0xc0000fdf30 pc=0x5ac62 runtime.doInit(0x70d1da0) /usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf80 sp=0xc0000fdf58 pc=0x5ac62 runtime.main() /usr/lib/golang/src/runtime/proc.go:191 +0x1c2 fp=0xc0000fdfd8 sp=0xc0000fdf80 pc=0x4c9a2 runtime.goexit() /usr/lib/golang/src/runtime/asm_s390x.s:779 +0x2 fp=0xc0000fdfd8 sp=0xc0000fdfd8 pc=0x831b2 goroutine 6 [chan receive]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).flushDaemon(0x71a2080) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1169 +0x90 created by k8s.io/kubernetes/vendor/k8s.io/klog/v2.init.0 /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:417 +0xf8 r0 0x0 r1 0x70e8000 r2 0x80 r3 0xac5f20 r4 0x0 r5 0x0 r6 0xc0002138f0 r7 0x2 r8 0x0 r9 0x8 r10 0x60 r11 0x1 r12 0x70e8080 r13 0xc000000180 r14 0x5acac r15 0xc0000fde40 pc 0xac5f24 link 0x5acac -------------------------------------------------------------------------------------------------------- I have made available the logs of oc adm must-gather as well the logs of dmesg, dbginfo.sh from node master-0 at https://ibm.box.com/s/1yqr1odc187bxfra74dzbfmbux7t6tuh for your reference (please, ping me your email ID so that I can give you access to the shared folder). also, I have collected the logs of dmesg, dbginfo.sh for the other cluster nodes. Please, kindly,let me know if that would be of interest, so that I can make them available to you as well. Version-Release number of selected component (if applicable): ----------------------------------------------------------------- [root@bastion ~]# oc version Client Version: 4.7.0-fc.2 Server Version: 4.7.0-fc.2 Kubernetes Version: v1.20.0+394a5a3 How reproducible: ------------------ At times. The CrashLoopBackOff state with openshift-marketplace with higher restart count is observed most often for e.g: #oc get pods -A | grep CrashLoopBackOff openshift-marketplace redhat-operators-ht2sc 0/1 CrashLoopBackOff 385 47h Steps to Reproduce: ----------------------- 1. Install a OCP 4.7.0-fc.2 cluster on KVM environment 2. Run CPU-stressor on a worker node for more than 24 hours. stress-ng tool was used in this exercise. e.g: stress-ng --cpu 0 --cpu-load 100 [root@bastion ~]# oc get pods NAME READY STATUS AGE cpu-stress-test-3p-resource-limit-7c7f56d7f6-jcq2g 1/1 Running 24h cpu-stress-test-3p-resource-limit-7c7f56d7f6-kqsdn 1/1 Running 24h cpu-stress-test-3p-resource-limit-7c7f56d7f6-l697j 1/1 Running 22h cpu-stress-test-3p-resource-limit-7c7f56d7f6-pcmck 0/1 Pending 22h cpu-stress-test-3p-resource-limit-7c7f56d7f6-rd4pk 1/1 Running 24h cpu-stress-test-3p-resource-limit-7c7f56d7f6-swtmd 1/1 Running 22h 3. Watch the status of all the pods in the cluster At first, notice redhat-operators-ht2sc to go to CrashLoopBackOff with a higher restart count. [root@bastion ~]# oc get pods -A | grep "CrashLoopBackOff" openshift-marketplace redhat-operators-ht2sc 0/1 CrashLoopBackOff 385 47h other pods like openshift-cluster-storage-operator is also noticed to crash at this point at times [root@bastion ~]# oc get pods -A | grep "CrashLoopBackOff" openshift-cluster-storage-operator csi-snapshot-controller-5d45bf95b5-2qf8g 0/1 CrashLoopBackOff 14 23d openshift-marketplace redhat-operators-ht2sc 0/1 CrashLoopBackOff 385 47h 4. Watch the kube-apiserver pods in the master nodes to restart; the restart of kube-apiserver would not succeed and they move to CrashLoopBackOff state. say: [root@bastion ~]# oc get pods -A | grep "CrashLoopBackOff" openshift-cluster-storage-operator csi-snapshot-controller-5d45bf95b5-2qf8g 0/1 CrashLoopBackOff 14 23d openshift-kube-apiserver kube-apiserver-master-0.m13lp36ocp.lnxne.boe 3/5 CrashLoopBackOff 20 45m openshift-marketplace redhat-operators-ht2sc 0/1 CrashLoopBackOff 385 47h 5. Delete the stressor pods to improve the cluster state, find the openshift-marketplace, openshift-cluster-storage-operator to improve their state. kube-apiserver-master-2.m13lp36ocp.lnxne.boe is observed to improve slowly, but kube-apiserver-master-0.m13lp36ocp.lnxne.boe remains in the CrashLoopBackOff state. Actual results: ------------------ please, refer to the section above Expected results: -------------------- The CrashLoopBackOff state with any of the pods should not be observed. Additional info: --------------------- On ssh-ing to the master nodes, the below error message was noticed in all the master nodes [root@bastion ~]# ssh core.lnxne.boe --- Last login: Fri Feb 12 10:16:58 2021 from 192.168.79.1 [systemd] Failed Units: 1 multipathd.socket
I don't think a golang program can cause "SIGILL: illegal instruction" on a working compiler tool chain. Moving to golang team.
Hi @Derek, I have shared with you the must-gather, dmesg, dbginfo.sh logs as per the needinfo flag. please, let me know if you are in need of any other information. thanks!
Hi team, the bug is not replicable at my side until now.
This behavior has not been able to be replicated since the initial report, so we will close the bug due to insufficient data. Please feel free to file another bug if the issue arises again.