Description of problem: Failed to recovery from expired certificates with all nodes "NotReady" Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-11-182924 How reproducible: Always Steps to Reproduce: 1. Follow doc to the the certificate recovery: https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html 2. After do the recovery , check the node status. Actual results: 2. Firstly two of the masters which not run the recovery were "NotReady", after about 1 hour all the master and worker nodes became with "NotReady" status. Expected results: 2. All the master and worker nodes work well. Additional info: Couldn't to run `oc adm must-gather` because all the node notready. No Pending CSR found Logs from node/master: Nov 12 09:17:22 yinzho-tjwqc-m-0.c.openshift-qe.internal hyperkube[20486]: E1112 09:17:22.332453 20486 reflector.go:123] object-"openshift-sdn"/"sdn-config": Failed to list *v1.ConfigMap: Unauthorized Nov 12 09:17:22 yinzho-tjwqc-m-0.c.openshift-qe.internal hyperkube[20486]: E1112 09:17:22.352320 20486 reflector.go:123] object-"openshift-apiserver"/"client-ca": Failed to list *v1.ConfigMap: Unauthorized Nov 12 09:06:44 yinzho-tjwqc-w-a-xfxq8.c.openshift-qe.internal hyperkube[702557]: E1112 09:06:44.074302 702557 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSIDriver: Unauthorized Logs from one master with diffrent error: Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.011584 11728 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Get https://api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dyinzho-tjwqc-m-2.c.openshift-qe.internal&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.070497 11728 kubelet.go:2275] node "yinzho-tjwqc-m-2.c.openshift-qe.internal" not found
> Version-Release number of selected component (if applicable): > 4.3.0-0.nightly-2019-11-11-182924 > https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html These don't go together. Kubelet procedure has changed over time. Use https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html or latest version of those docs
this one looks weird though: Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.011584 11728 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Get https://api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dyinzho-tjwqc-m-2.c.openshift-qe.internal&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid If you hit this, pls use openssl to dump the cert at that url and attach it to the BZ (redact modulus and other private fields), also check time on those nodes and your machine to make sure it is synced, record the time into the BZ. (like openssl s_client -connect api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443 | openssl x509 -noout -text) Does running the recovery second time help?
After run the recovery second time, the two master still notready: [root@yinzho-2bsks-m-0 ~]# oc get node NAME STATUS ROLES AGE VERSION yinzho-2bsks-m-0.c.openshift-qe.internal Ready master 5h7m v1.16.2 yinzho-2bsks-m-1.c.openshift-qe.internal NotReady master 5h7m v1.16.2 yinzho-2bsks-m-2.c.openshift-qe.internal NotReady master 5h7m v1.16.2 yinzho-2bsks-w-a-vhglm.c.openshift-qe.internal Ready worker 4h55m v1.16.2 yinzho-2bsks-w-b-b7rdw.c.openshift-qe.internal Ready worker 4h58m v1.16.2 yinzho-2bsks-w-c-wqrjw.c.openshift-qe.internal Ready worker 4h58m v1.16.2
Bug 1797897 is not seen now but hit another issue: bug 1802944
Hit bug 1797897 again, reported new bug 1806930.
I am purging the BZ deps or we can't merge fixes https://github.com/openshift/origin/pull/24630#issuecomment-594486876
removing https://bugzilla.redhat.com/show_bug.cgi?id=1811062 dependency as BZ merge bot can't get over a second dependency to the same release as this BZ and not 4.5 as the first one https://github.com/openshift/cluster-kube-scheduler-operator/pull/217#issuecomment-597568806 The dependency was added so we merge only after the origin change lands in 4.4 and it is merged now.
Verified in 4.4.0-0.nightly-2020-03-11-212258 using steps in bug 1810008#c4 . (BTW hit issue already tracked in bug 1812593)
Also hit another issue no time to analyze today. Will check next day.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581