(continuing for above "Steps to Reproduce") Here on Sun morning after cluster creation passed > 38 hours, re-start all nodes, wait 5 mins for the nodes up. Found openshift-apiserver has below x509 unknown authority issue Per the blog doc, NotReady is expected, need approve CSR. [xxia 2020-03-01 11:29:49 CST my]$ oc get no NAME STATUS ROLES AGE VERSION ip-10-0-129-237.ap-northeast-2.compute.internal NotReady worker 38h v1.16.2 ip-10-0-137-139.ap-northeast-2.compute.internal NotReady master 38h v1.16.2 ip-10-0-145-118.ap-northeast-2.compute.internal NotReady worker 38h v1.16.2 ip-10-0-159-182.ap-northeast-2.compute.internal NotReady master 38h v1.16.2 ip-10-0-172-23.ap-northeast-2.compute.internal NotReady master 38h v1.16.2 [xxia 2020-03-01 11:30:14 CST my]$ oc get csr NAME AGE REQUESTOR CONDITION csr-cst6c 6m20s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-lj76x 6m20s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-s6pfw 5m54s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-tcmrl 6m20s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-xmn2k 6m20s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending [xxia 2020-03-01 11:30:48 CST my]$ oc get csr -oname | xargs oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-cst6c approved certificatesigningrequest.certificates.k8s.io/csr-lj76x approved certificatesigningrequest.certificates.k8s.io/csr-s6pfw approved certificatesigningrequest.certificates.k8s.io/csr-tcmrl approved certificatesigningrequest.certificates.k8s.io/csr-xmn2k approved [xxia 2020-03-01 11:31:40 CST my]$ oc get csr NAME AGE REQUESTOR CONDITION csr-6kdlv 19s system:node:ip-10-0-145-118.ap-northeast-2.compute.internal Approved,Issued csr-7rm8j 19s system:node:ip-10-0-159-182.ap-northeast-2.compute.internal Approved,Issued csr-bj5xc 29s system:node:ip-10-0-129-237.ap-northeast-2.compute.internal Approved,Issued csr-cst6c 7m46s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-lfflk 25s system:node:ip-10-0-172-23.ap-northeast-2.compute.internal Approved,Issued csr-lj76x 7m46s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-qk2zm 25s system:node:ip-10-0-137-139.ap-northeast-2.compute.internal Approved,Issued csr-s6pfw 7m20s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-tcmrl 7m46s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xmn2k 7m46s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued [xxia 2020-03-01 11:33:00 CST my]$ oc get no # Now see Ready NAME STATUS ROLES AGE VERSION ip-10-0-129-237.ap-northeast-2.compute.internal Ready worker 38h v1.16.2 ip-10-0-137-139.ap-northeast-2.compute.internal Ready master 39h v1.16.2 ip-10-0-145-118.ap-northeast-2.compute.internal Ready worker 38h v1.16.2 ip-10-0-159-182.ap-northeast-2.compute.internal Ready master 39h v1.16.2 ip-10-0-172-23.ap-northeast-2.compute.internal Ready master 39h v1.16.2 [xxia 2020-03-01 11:34:11 CST my]$ oc get co --no-headers | grep -v "True.*False.*False" ingress 4.3.3 False True True 2m kube-apiserver 4.3.3 True True False 38h monitoring 4.3.3 False True True 117s openshift-apiserver 4.3.3 False False False 46s operator-lifecycle-manager-packageserver 4.3.3 False True False 2m11s # Repeat `oc get co` till the output becomes stable: openshift-apiserver is seen abnormal: [xxia 2020-03-01 11:58:19 CST my]$ oc get co --no-headers | grep -v "True.*False.*False" ingress 4.3.3 False True True 26m monitoring 4.3.3 False False True 26m openshift-apiserver 4.3.3 False False False 25m [xxia 2020-03-01 11:58:55 CST my]$ ogpoas NAME READY STATUS RESTARTS AGE apiserver-gv9vt 1/1 Running 1 39h apiserver-k82mx 1/1 Running 1 39h apiserver-vmntg 1/1 Running 1 39h [xxia 2020-03-01 12:00:45 CST my]$ oc logs apiserver-gv9vt -n openshift-apiserver > day1-apiserver-gv9vt.log # check day1-apiserver-gv9vt.log, found many "E0301 04:00:45.693452 1 authentication.go:104] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority" [xxia 2020-03-01 12:04:39 CST my]$ ogpkas NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-137-139.ap-northeast-2.compute.internal 3/3 Running 0 26m apiserver=true,app=openshift-kube-apiserver,revision=8 kube-apiserver-ip-10-0-159-182.ap-northeast-2.compute.internal 3/3 Running 0 30m apiserver=true,app=openshift-kube-apiserver,revision=8 kube-apiserver-ip-10-0-172-23.ap-northeast-2.compute.internal 3/3 Running 0 28m apiserver=true,app=openshift-kube-apiserver,revision=8 [xxia 2020-03-01 12:05:51 CST my]$ oc logs -c kube-apiserver-8 kube-apiserver-ip-10-0-137-139.ap-northeast-2.compute.internal -n openshi ft-kube-apiserver > day1-pod-kube-apiserver-ip-10-0-137-139.log # Gather logs for below clusteroperators [xxia 2020-03-01 12:09:44 CST my]$ oc adm inspect co openshift-apiserver kube-apiserver kube-controller-manager Gathering data for ns/openshift-config... Gathering data for ns/openshift-config-managed... Gathering data for ns/openshift-apiserver-operator... Gathering data for ns/openshift-apiserver... E0301 12:21:05.940291 13875 portforward.go:385] error copying from local connection to remote stream: EOF Gathering data for ns/openshift-kube-apiserver-operator... E0301 12:27:36.491744 13875 portforward.go:385] error copying from local connection to remote stream: tls: use of closed connection Gathering data for ns/openshift-kube-apiserver... E0301 12:31:03.394630 13875 portforward.go:385] error copying from local connection to remote stream: tls: use of closed connection E0301 12:34:12.981036 13875 portforward.go:385] error copying from local connection to remote stream: EOF Gathering data for ns/openshift-kube-controller-manager... Gathering data for ns/openshift-kube-controller-manager-operator... Wrote inspect data to inspect.local.5068952091736834751. error: errors ocurred while gathering data: unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, authorization.openshift.io/v1: the server is currently unable to handle the request ... [xxia 2020-03-01 12:42:45 CST my]$ du -sh inspect.local.5068952091736834751 250M inspect.local.5068952091736834751 (to be continued in next comment)
(continuing for above "Steps to Reproduce") Now try deleting OAS pods as workaround [xxia 2020-03-01 12:43:12 CST my]$ oc delete po apiserver-gv9vt apiserver-k82mx apiserver-vmntg -n openshift-apiserver [xxia 2020-03-01 12:43:58 CST my]$ ogpoas NAME READY STATUS RESTARTS AGE apiserver-dkpq6 1/1 Running 0 30s apiserver-wrx77 1/1 Running 0 36s apiserver-xtsgw 1/1 Running 0 31s [xxia 2020-03-01 12:44:14 CST my]$ oc get co --no-headers | grep -v "True.*False.*False" console 4.3.3 False True False 26s monitoring 4.3.3 False False True 71m # Repeat `oc get co` till the output is stable: [xxia 2020-03-01 12:47:05 CST my]$ oc get co --no-headers | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE console 4.3.3 False False False 3m28s [xxia 2020-03-01 12:48:11 CST my]$ ogpconsole NAME READY STATUS RESTARTS AGE console-6b9b547966-pd9bh 1/1 Running 17 40h console-6b9b547966-z574b 0/1 CrashLoopBackOff 16 40h downloads-94dc6d79f-5kppj 1/1 Running 1 40h downloads-94dc6d79f-kgq7r 1/1 Running 1 40h # Repeat till console pods status stable: [xxia 2020-03-01 12:50:10 CST my]$ ogpconsole NAME READY STATUS RESTARTS AGE console-6b9b547966-pd9bh 1/1 Running 17 40h console-6b9b547966-z574b 1/1 Running 17 40h downloads-94dc6d79f-5kppj 1/1 Running 1 40h downloads-94dc6d79f-kgq7r 1/1 Running 1 40h # After above deleting OAS pods, openshift-apiserver is back to normal [xxia 2020-03-01 12:50:13 CST my]$ oc get co --no-headers | grep -v "True.*False.*False" [xxia 2020-03-01 12:50:18 CST my]$ oc get route # OAS resource command works well
Above blog doc needs cross-days operation, so above issue has a bit many comments for reproducing / troubleshooting. In short, comment 0 is operation within the 1st day, comment 1 is operation after the 1st day that sees / debugs the issue, comment 2 is workaround for the bug
Presently, stopping and starting an OpenShift cluster is not supported. Work to describe a procedure or provide additional code support for a known, supported procedure is targeted for 4.5 here: https://issues.redhat.com/browse/MSTR-931