Bug 1808779 - openshift-apiserver not available after cross-days following https://github.com/redhat-cop/openshift-lab-origin/blob/master/OpenShift4/Stopping_and_Resuming_OCP4_Clusters.adoc
Summary: openshift-apiserver not available after cross-days following https://github.c...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.z
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-01 06:06 UTC by Xingxing Xia
Modified: 2020-03-02 12:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-02 12:13:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 1 Xingxing Xia 2020-03-01 06:11:29 UTC
(continuing for above "Steps to Reproduce")
Here on Sun morning after cluster creation passed > 38 hours, re-start all nodes, wait 5 mins for the nodes up. Found openshift-apiserver has below x509 unknown authority issue
Per the blog doc, NotReady is expected, need approve CSR.
[xxia 2020-03-01 11:29:49 CST my]$ oc get no
NAME                                              STATUS     ROLES    AGE   VERSION
ip-10-0-129-237.ap-northeast-2.compute.internal   NotReady   worker   38h   v1.16.2
ip-10-0-137-139.ap-northeast-2.compute.internal   NotReady   master   38h   v1.16.2
ip-10-0-145-118.ap-northeast-2.compute.internal   NotReady   worker   38h   v1.16.2
ip-10-0-159-182.ap-northeast-2.compute.internal   NotReady   master   38h   v1.16.2
ip-10-0-172-23.ap-northeast-2.compute.internal    NotReady   master   38h   v1.16.2
[xxia 2020-03-01 11:30:14 CST my]$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-cst6c   6m20s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-lj76x   6m20s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-s6pfw   5m54s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-tcmrl   6m20s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-xmn2k   6m20s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
[xxia 2020-03-01 11:30:48 CST my]$ oc get csr -oname | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-cst6c approved
certificatesigningrequest.certificates.k8s.io/csr-lj76x approved
certificatesigningrequest.certificates.k8s.io/csr-s6pfw approved
certificatesigningrequest.certificates.k8s.io/csr-tcmrl approved
certificatesigningrequest.certificates.k8s.io/csr-xmn2k approved
[xxia 2020-03-01 11:31:40 CST my]$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-6kdlv   19s     system:node:ip-10-0-145-118.ap-northeast-2.compute.internal                 Approved,Issued
csr-7rm8j   19s     system:node:ip-10-0-159-182.ap-northeast-2.compute.internal                 Approved,Issued
csr-bj5xc   29s     system:node:ip-10-0-129-237.ap-northeast-2.compute.internal                 Approved,Issued
csr-cst6c   7m46s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-lfflk   25s     system:node:ip-10-0-172-23.ap-northeast-2.compute.internal                  Approved,Issued
csr-lj76x   7m46s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qk2zm   25s     system:node:ip-10-0-137-139.ap-northeast-2.compute.internal                 Approved,Issued
csr-s6pfw   7m20s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-tcmrl   7m46s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-xmn2k   7m46s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
[xxia 2020-03-01 11:33:00 CST my]$ oc get no # Now see Ready
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-129-237.ap-northeast-2.compute.internal   Ready    worker   38h   v1.16.2
ip-10-0-137-139.ap-northeast-2.compute.internal   Ready    master   39h   v1.16.2
ip-10-0-145-118.ap-northeast-2.compute.internal   Ready    worker   38h   v1.16.2
ip-10-0-159-182.ap-northeast-2.compute.internal   Ready    master   39h   v1.16.2
ip-10-0-172-23.ap-northeast-2.compute.internal    Ready    master   39h   v1.16.2
[xxia 2020-03-01 11:34:11 CST my]$ oc get co --no-headers | grep -v "True.*False.*False"
ingress                                    4.3.3     False       True          True       2m
kube-apiserver                             4.3.3     True        True          False      38h
monitoring                                 4.3.3     False       True          True       117s
openshift-apiserver                        4.3.3     False       False         False      46s
operator-lifecycle-manager-packageserver   4.3.3     False       True          False      2m11s
# Repeat `oc get co` till the output becomes stable: openshift-apiserver is seen abnormal:
[xxia 2020-03-01 11:58:19 CST my]$ oc get co --no-headers | grep -v "True.*False.*False"
ingress                                    4.3.3     False       True          True       26m
monitoring                                 4.3.3     False       False         True       26m
openshift-apiserver                        4.3.3     False       False         False      25m
[xxia 2020-03-01 11:58:55 CST my]$ ogpoas
NAME              READY   STATUS    RESTARTS   AGE
apiserver-gv9vt   1/1     Running   1          39h
apiserver-k82mx   1/1     Running   1          39h
apiserver-vmntg   1/1     Running   1          39h
[xxia 2020-03-01 12:00:45 CST my]$ oc logs apiserver-gv9vt -n openshift-apiserver > day1-apiserver-gv9vt.log
# check day1-apiserver-gv9vt.log, found many "E0301 04:00:45.693452       1 authentication.go:104] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority"
[xxia 2020-03-01 12:04:39 CST my]$ ogpkas
NAME                                                             READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ip-10-0-137-139.ap-northeast-2.compute.internal   3/3     Running   0          26m   apiserver=true,app=openshift-kube-apiserver,revision=8
kube-apiserver-ip-10-0-159-182.ap-northeast-2.compute.internal   3/3     Running   0          30m   apiserver=true,app=openshift-kube-apiserver,revision=8
kube-apiserver-ip-10-0-172-23.ap-northeast-2.compute.internal    3/3     Running   0          28m   apiserver=true,app=openshift-kube-apiserver,revision=8
[xxia 2020-03-01 12:05:51 CST my]$ oc logs -c kube-apiserver-8 kube-apiserver-ip-10-0-137-139.ap-northeast-2.compute.internal -n openshi
ft-kube-apiserver > day1-pod-kube-apiserver-ip-10-0-137-139.log
# Gather logs for below clusteroperators
[xxia 2020-03-01 12:09:44 CST my]$ oc adm inspect co openshift-apiserver kube-apiserver kube-controller-manager
Gathering data for ns/openshift-config...
Gathering data for ns/openshift-config-managed...
Gathering data for ns/openshift-apiserver-operator...
Gathering data for ns/openshift-apiserver...
E0301 12:21:05.940291   13875 portforward.go:385] error copying from local connection to remote stream: EOF
Gathering data for ns/openshift-kube-apiserver-operator...
E0301 12:27:36.491744   13875 portforward.go:385] error copying from local connection to remote stream: tls: use of closed connection
Gathering data for ns/openshift-kube-apiserver...
E0301 12:31:03.394630   13875 portforward.go:385] error copying from local connection to remote stream: tls: use of closed connection
E0301 12:34:12.981036   13875 portforward.go:385] error copying from local connection to remote stream: EOF
Gathering data for ns/openshift-kube-controller-manager...
Gathering data for ns/openshift-kube-controller-manager-operator...
Wrote inspect data to inspect.local.5068952091736834751.
error: errors ocurred while gathering data:
    unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, authorization.openshift.io/v1: the server is currently unable to handle the request ...
[xxia 2020-03-01 12:42:45 CST my]$ du -sh inspect.local.5068952091736834751
250M    inspect.local.5068952091736834751
(to be continued in next comment)

Comment 2 Xingxing Xia 2020-03-01 06:13:55 UTC
(continuing for above "Steps to Reproduce")
Now try deleting OAS pods as workaround
[xxia 2020-03-01 12:43:12 CST my]$ oc delete po apiserver-gv9vt apiserver-k82mx apiserver-vmntg -n openshift-apiserver
[xxia 2020-03-01 12:43:58 CST my]$ ogpoas
NAME              READY   STATUS    RESTARTS   AGE
apiserver-dkpq6   1/1     Running   0          30s
apiserver-wrx77   1/1     Running   0          36s
apiserver-xtsgw   1/1     Running   0          31s
[xxia 2020-03-01 12:44:14 CST my]$ oc get co --no-headers | grep -v "True.*False.*False"
console                                    4.3.3     False       True          False      26s
monitoring                                 4.3.3     False       False         True       71m
# Repeat `oc get co` till the output is stable:
[xxia 2020-03-01 12:47:05 CST my]$ oc get co --no-headers | grep -v "True.*False.*False"
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
console                                    4.3.3     False       False         False      3m28s
[xxia 2020-03-01 12:48:11 CST my]$ ogpconsole
NAME                        READY   STATUS             RESTARTS   AGE
console-6b9b547966-pd9bh    1/1     Running            17         40h
console-6b9b547966-z574b    0/1     CrashLoopBackOff   16         40h
downloads-94dc6d79f-5kppj   1/1     Running            1          40h
downloads-94dc6d79f-kgq7r   1/1     Running            1          40h
# Repeat till console pods status stable:
[xxia 2020-03-01 12:50:10 CST my]$ ogpconsole
NAME                        READY   STATUS    RESTARTS   AGE
console-6b9b547966-pd9bh    1/1     Running   17         40h
console-6b9b547966-z574b    1/1     Running   17         40h
downloads-94dc6d79f-5kppj   1/1     Running   1          40h
downloads-94dc6d79f-kgq7r   1/1     Running   1          40h

# After above deleting OAS pods, openshift-apiserver is back to normal
[xxia 2020-03-01 12:50:13 CST my]$ oc get co --no-headers | grep -v "True.*False.*False"
[xxia 2020-03-01 12:50:18 CST my]$ oc get route # OAS resource command works well

Comment 3 Xingxing Xia 2020-03-01 06:19:44 UTC
Above blog doc needs cross-days operation, so above issue has a bit many comments for reproducing / troubleshooting. In short, comment 0 is operation within the 1st day, comment 1 is operation after the 1st day that sees / debugs the issue, comment 2 is workaround for the bug

Comment 5 Stephen Cuppett 2020-03-02 12:13:42 UTC
Presently, stopping and starting an OpenShift cluster is not supported. Work to describe a procedure or provide additional code support for a known, supported procedure is targeted for 4.5 here: https://issues.redhat.com/browse/MSTR-931


Note You need to log in before you can comment on or make changes to this bug.