Bug 1928192 - In OCP cluster on KVM, the kube-apiserver pod encounters CrashLoopBackOff state with the logs indicating "SIGILL: illegal instruction" error.
Summary: In OCP cluster on KVM, the kube-apiserver pod encounters CrashLoopBackOff sta...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: DevTools
Classification: Red Hat
Component: golang
Version: unspecified
Hardware: s390x
OS: Linux
unspecified
medium
Target Milestone: rc
: 2019.3
Assignee: Derek Parker
QA Contact: Martin Cermak
URL:
Whiteboard:
Depends On:
Blocks: ocp-47-z-tracker
TreeView+ depends on / blocked
 
Reported: 2021-02-12 15:43 UTC by Lakshmi Ravichandran
Modified: 2021-05-10 15:29 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-10 15:29:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Lakshmi Ravichandran 2021-02-12 15:43:07 UTC
In OCP cluster on KVM, the kube-apiserver pod encounters CrashLoopBackOff state with the logs indicating "SIGILL: illegal instruction" error.

Description of problem:
---------------------------
In the OCP cluster installed on KVM, the kube-apiserver-master-0 pods is observed in CrashLoopBackOff state.
The pod logs indicate: "SIGILL: illegal instruction" as below
--------------------------------------------------------------------------------------------------------
[root@bastion ~]# oc logs pod/kube-apiserver-master-0.m13lp36ocp.lnxne.boe
Copying system trust bundle
Waiting for port :6443 to be released.
I0212 13:24:12.609371       1 loader.go:379] Config loaded from file:  /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig
Copying termination logs to "/var/log/kube-apiserver/termination.log"
I0212 13:24:12.611234       1 main.go:124] Touching termination lock file "/var/log/kube-apiserver/.terminating"
I0212 13:24:12.611460       1 main.go:182] Launching sub-process "/usr/bin/hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=192.168.79.21 -v=2"
SIGILL: illegal instruction
PC=0xac5f24 m=0 sigcode=1
instruction bytes: 0x30 0x78 0x63 0x30 0x30 0x30 0x31 0x39 0x36 0x30 0x30 0x30 0x7d 0x3a 0x20 0x72

goroutine 1 [running, locked to thread]:
k8s.io/kubernetes/vendor/k8s.io/api/extensions/v1beta1.init()
	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/api/extensions/v1beta1/generated.pb.go:15312 +0x4 fp=0xc0000fde40 sp=0xc0000fde40 pc=0xac5f24
runtime.doInit(0x70e8000)
	/usr/lib/golang/src/runtime/proc.go:5646 +0xac fp=0xc0000fde68 sp=0xc0000fde40 pc=0x5acac
runtime.doInit(0x70f83e0)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fde90 sp=0xc0000fde68 pc=0x5ac62
runtime.doInit(0x70e6c00)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdeb8 sp=0xc0000fde90 pc=0x5ac62
runtime.doInit(0x70e6d40)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdee0 sp=0xc0000fdeb8 pc=0x5ac62
runtime.doInit(0x70e2640)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf08 sp=0xc0000fdee0 pc=0x5ac62
runtime.doInit(0x70cb040)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf30 sp=0xc0000fdf08 pc=0x5ac62
runtime.doInit(0x70c63c0)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf58 sp=0xc0000fdf30 pc=0x5ac62
runtime.doInit(0x70d1da0)
	/usr/lib/golang/src/runtime/proc.go:5641 +0x62 fp=0xc0000fdf80 sp=0xc0000fdf58 pc=0x5ac62
runtime.main()
	/usr/lib/golang/src/runtime/proc.go:191 +0x1c2 fp=0xc0000fdfd8 sp=0xc0000fdf80 pc=0x4c9a2
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_s390x.s:779 +0x2 fp=0xc0000fdfd8 sp=0xc0000fdfd8 pc=0x831b2

goroutine 6 [chan receive]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).flushDaemon(0x71a2080)
	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1169 +0x90
created by k8s.io/kubernetes/vendor/k8s.io/klog/v2.init.0
	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:417 +0xf8

r0   0x0	r1   0x70e8000
r2   0x80	r3   0xac5f20
r4   0x0	r5   0x0
r6   0xc0002138f0	r7   0x2
r8   0x0	r9   0x8
r10  0x60	r11  0x1
r12  0x70e8080	r13  0xc000000180
r14  0x5acac	r15  0xc0000fde40
pc   0xac5f24	link 0x5acac
--------------------------------------------------------------------------------------------------------

I have made available the logs of oc adm must-gather as well the logs of dmesg, dbginfo.sh from node master-0 at https://ibm.box.com/s/1yqr1odc187bxfra74dzbfmbux7t6tuh for your reference (please, ping me your email ID so that I can give you access to the shared folder).
also, I have collected the logs of dmesg, dbginfo.sh for the other cluster nodes. 
Please, kindly,let me know if that would be of interest, so that I can make them available to you as well.

Version-Release number of selected component (if applicable):
-----------------------------------------------------------------
[root@bastion ~]# oc version
Client Version: 4.7.0-fc.2
Server Version: 4.7.0-fc.2
Kubernetes Version: v1.20.0+394a5a3

How reproducible:
------------------
At times.

The CrashLoopBackOff state with openshift-marketplace with higher restart count is observed most often 
for e.g:
#oc get pods -A | grep CrashLoopBackOff
openshift-marketplace   redhat-operators-ht2sc        0/1     CrashLoopBackOff   385        47h


Steps to Reproduce:
-----------------------
1. Install a OCP 4.7.0-fc.2 cluster on KVM environment

2. Run CPU-stressor on a worker node for more than 24 hours. stress-ng tool was used in this exercise. 
e.g: stress-ng --cpu 0 --cpu-load 100
[root@bastion ~]# oc get pods
NAME                                                 READY   STATUS      AGE
cpu-stress-test-3p-resource-limit-7c7f56d7f6-jcq2g   1/1     Running     24h
cpu-stress-test-3p-resource-limit-7c7f56d7f6-kqsdn   1/1     Running     24h
cpu-stress-test-3p-resource-limit-7c7f56d7f6-l697j   1/1     Running     22h
cpu-stress-test-3p-resource-limit-7c7f56d7f6-pcmck   0/1     Pending     22h
cpu-stress-test-3p-resource-limit-7c7f56d7f6-rd4pk   1/1     Running     24h
cpu-stress-test-3p-resource-limit-7c7f56d7f6-swtmd   1/1     Running     22h

3. Watch the status of all the pods in the cluster
At first, notice redhat-operators-ht2sc to go to CrashLoopBackOff with a higher restart count.

[root@bastion ~]# oc get pods -A | grep "CrashLoopBackOff"
openshift-marketplace   redhat-operators-ht2sc        0/1     CrashLoopBackOff   385        47h

other pods like openshift-cluster-storage-operator is also noticed to crash at this point at times

[root@bastion ~]# oc get pods -A | grep "CrashLoopBackOff"
openshift-cluster-storage-operator  csi-snapshot-controller-5d45bf95b5-2qf8g                  0/1     CrashLoopBackOff   14         23d
openshift-marketplace               redhat-operators-ht2sc                                    0/1     CrashLoopBackOff   385        47h

4. Watch the kube-apiserver pods in the master nodes to restart; the restart of kube-apiserver would not succeed and they move to CrashLoopBackOff state.

say:
[root@bastion ~]# oc get pods -A | grep "CrashLoopBackOff"
openshift-cluster-storage-operator   csi-snapshot-controller-5d45bf95b5-2qf8g                  0/1     CrashLoopBackOff   14         23d
openshift-kube-apiserver             kube-apiserver-master-0.m13lp36ocp.lnxne.boe              3/5     CrashLoopBackOff   20         45m
openshift-marketplace                redhat-operators-ht2sc                                    0/1     CrashLoopBackOff   385        47h

5. Delete the stressor pods to improve the cluster state, find the openshift-marketplace, openshift-cluster-storage-operator to improve their state.
kube-apiserver-master-2.m13lp36ocp.lnxne.boe is observed to improve slowly, but kube-apiserver-master-0.m13lp36ocp.lnxne.boe remains in the CrashLoopBackOff state.

Actual results:
------------------
please, refer to the section above

Expected results:
--------------------
The CrashLoopBackOff state with any of the pods should not be observed.

Additional info:
---------------------
On ssh-ing to the master nodes, the below error message was noticed in all the master nodes

[root@bastion ~]# ssh core.lnxne.boe
---
Last login: Fri Feb 12 10:16:58 2021 from 192.168.79.1
[systemd]
Failed Units: 1
  multipathd.socket

Comment 1 Stefan Schimanski 2021-02-12 16:11:14 UTC
I don't think a golang program can cause "SIGILL: illegal instruction" on a working compiler tool chain. Moving to golang team.

Comment 4 Lakshmi Ravichandran 2021-02-15 21:10:18 UTC
Hi @Derek, I have shared with you the must-gather, dmesg, dbginfo.sh logs as per the needinfo flag. please, let me know if you are in need of any other information. thanks!

Comment 7 Lakshmi Ravichandran 2021-04-06 15:09:53 UTC
Hi team, the bug is not replicable at my side until now.

Comment 9 David Benoit 2021-05-10 15:29:16 UTC
This behavior has not been able to be replicated since the initial report, so we will close the bug due to insufficient data.  Please feel free to file another bug if the issue arises again.


Note You need to log in before you can comment on or make changes to this bug.