Bug 1771410

Summary: [UPI]Failed to recovery from expired certificates with all nodes "NotReady"
Product: OpenShift Container Platform Reporter: zhou ying <yinzhou>
Component: kube-apiserverAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.0CC: aos-bugs, deads, eparis, jokerman, mfojtik, mfuruta, rphillips, sttts, tnozicka
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1805398 1810008 (view as bug list) Environment:
Last Closed: 2020-05-04 11:15:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1810008    
Bug Blocks:    

Description zhou ying 2019-11-12 10:12:08 UTC
Description of problem:
Failed to recovery from expired  certificates with all nodes "NotReady"

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-11-182924

How reproducible:
Always

Steps to Reproduce:
1. Follow doc to the the certificate recovery:
https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

2. After do the recovery , check the node status.

Actual results:
2. Firstly  two of the masters which not run the recovery  were "NotReady", after about 1 hour all the master and worker nodes became with "NotReady" status.

Expected results:
2. All the master and worker nodes work well.

Additional info:
Couldn't to run `oc adm must-gather` because all the node notready.
No Pending CSR found

Logs from node/master:
Nov 12 09:17:22 yinzho-tjwqc-m-0.c.openshift-qe.internal hyperkube[20486]: E1112 09:17:22.332453   20486 reflector.go:123] object-"openshift-sdn"/"sdn-config": Failed to list *v1.ConfigMap: Unauthorized
Nov 12 09:17:22 yinzho-tjwqc-m-0.c.openshift-qe.internal hyperkube[20486]: E1112 09:17:22.352320   20486 reflector.go:123] object-"openshift-apiserver"/"client-ca": Failed to list *v1.ConfigMap: Unauthorized

Nov 12 09:06:44 yinzho-tjwqc-w-a-xfxq8.c.openshift-qe.internal hyperkube[702557]: E1112 09:06:44.074302  702557 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSIDriver: Unauthorized



Logs from one master with diffrent error:
Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.011584   11728 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Get https://api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dyinzho-tjwqc-m-2.c.openshift-qe.internal&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.070497   11728 kubelet.go:2275] node "yinzho-tjwqc-m-2.c.openshift-qe.internal" not found

Comment 4 Tomáš Nožička 2019-11-14 13:23:36 UTC
> Version-Release number of selected component (if applicable):
> 4.3.0-0.nightly-2019-11-11-182924

> https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

These don't go together. Kubelet procedure has changed over time. Use
https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html
or
latest version of those docs

Comment 5 Tomáš Nožička 2019-11-14 13:29:59 UTC
this one looks weird though:

Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.011584   11728 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Get https://api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dyinzho-tjwqc-m-2.c.openshift-qe.internal&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid

If you hit this, pls use openssl to dump the cert at that url and attach it to the BZ (redact modulus and other private fields), also check time on those nodes and your machine to make sure it is synced, record the time into the BZ.

(like openssl s_client -connect api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443 | openssl x509 -noout -text)

Does running the recovery second time help?

Comment 9 zhou ying 2019-11-19 06:52:11 UTC
After run the recovery second time, the two master still notready:

[root@yinzho-2bsks-m-0 ~]# oc get node
NAME                                             STATUS     ROLES    AGE     VERSION
yinzho-2bsks-m-0.c.openshift-qe.internal         Ready      master   5h7m    v1.16.2
yinzho-2bsks-m-1.c.openshift-qe.internal         NotReady   master   5h7m    v1.16.2
yinzho-2bsks-m-2.c.openshift-qe.internal         NotReady   master   5h7m    v1.16.2
yinzho-2bsks-w-a-vhglm.c.openshift-qe.internal   Ready      worker   4h55m   v1.16.2
yinzho-2bsks-w-b-b7rdw.c.openshift-qe.internal   Ready      worker   4h58m   v1.16.2
yinzho-2bsks-w-c-wqrjw.c.openshift-qe.internal   Ready      worker   4h58m   v1.16.2

Comment 23 Xingxing Xia 2020-02-14 08:20:17 UTC
Bug 1797897 is not seen now but hit another issue: bug 1802944

Comment 31 Xingxing Xia 2020-02-25 10:28:02 UTC
Hit bug 1797897 again, reported new bug 1806930.

Comment 33 Tomáš Nožička 2020-03-04 12:22:50 UTC
I am purging the BZ deps or we can't merge fixes https://github.com/openshift/origin/pull/24630#issuecomment-594486876

Comment 36 Tomáš Nožička 2020-03-11 11:06:12 UTC
removing https://bugzilla.redhat.com/show_bug.cgi?id=1811062 dependency as BZ merge bot can't get over a second dependency to the same release as this BZ and not 4.5 as the first one https://github.com/openshift/cluster-kube-scheduler-operator/pull/217#issuecomment-597568806

The dependency was added so we merge only after the origin change lands in 4.4 and it is merged now.

Comment 38 Xingxing Xia 2020-03-12 11:30:10 UTC
Verified in 4.4.0-0.nightly-2020-03-11-212258 using steps in bug 1810008#c4 . (BTW hit issue already tracked in bug 1812593)

Comment 39 Xingxing Xia 2020-03-12 11:38:04 UTC
Also hit another issue no time to analyze today. Will check next day.

Comment 41 errata-xmlrpc 2020-05-04 11:15:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581