1771410 – [UPI]Failed to recovery from expired certificates with all nodes "NotReady"

Bug 1771410 - [UPI]Failed to recovery from expired certificates with all nodes "NotReady"

Summary: [UPI]Failed to recovery from expired certificates with all nodes "NotReady"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Tomáš Nožička
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:	1810008
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-12 10:12 UTC by zhou ying
Modified:	2020-05-04 11:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1805398 1810008 (view as bug list)
Environment:
Last Closed:	2020-05-04 11:15:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 775	None	closed	[release-4.4] Bug 1771410: Disable delegated auth for recovery	2021-02-21 09:26:07 UTC
Github	openshift cluster-kube-apiserver-operator pull 786	None	closed	[release-4.4] Bug 1771410: Use the new tls-server-name option in kubeconfig	2021-02-21 09:26:07 UTC
Github	openshift cluster-kube-controller-manager-operator pull 358	None	closed	[release-4.4] Bug 1771410: Use the new tls-server-name option in kubeconfig	2021-02-21 09:26:07 UTC
Github	openshift cluster-kube-controller-manager-operator pull 362	None	closed	[release-4.4] Bug 1771410: Reload client certs	2021-02-21 09:26:07 UTC
Github	openshift cluster-kube-scheduler-operator pull 217	None	closed	[release-4.4] Bug 1771410: Add cert syncer 4.4	2021-02-21 09:26:07 UTC
Github	openshift library-go pull 716	None	closed	[release-4.4] Bug 1771410: Allow disabling serving in ControllerBuilder	2021-02-21 09:26:08 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-04 11:15:34 UTC

Description zhou ying 2019-11-12 10:12:08 UTC

Description of problem:
Failed to recovery from expired  certificates with all nodes "NotReady"

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-11-182924

How reproducible:
Always

Steps to Reproduce:
1. Follow doc to the the certificate recovery:
https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

2. After do the recovery , check the node status.

Actual results:
2. Firstly  two of the masters which not run the recovery  were "NotReady", after about 1 hour all the master and worker nodes became with "NotReady" status.

Expected results:
2. All the master and worker nodes work well.

Additional info:
Couldn't to run `oc adm must-gather` because all the node notready.
No Pending CSR found

Logs from node/master:
Nov 12 09:17:22 yinzho-tjwqc-m-0.c.openshift-qe.internal hyperkube[20486]: E1112 09:17:22.332453   20486 reflector.go:123] object-"openshift-sdn"/"sdn-config": Failed to list *v1.ConfigMap: Unauthorized
Nov 12 09:17:22 yinzho-tjwqc-m-0.c.openshift-qe.internal hyperkube[20486]: E1112 09:17:22.352320   20486 reflector.go:123] object-"openshift-apiserver"/"client-ca": Failed to list *v1.ConfigMap: Unauthorized

Nov 12 09:06:44 yinzho-tjwqc-w-a-xfxq8.c.openshift-qe.internal hyperkube[702557]: E1112 09:06:44.074302  702557 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSIDriver: Unauthorized



Logs from one master with diffrent error:
Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.011584   11728 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Get https://api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dyinzho-tjwqc-m-2.c.openshift-qe.internal&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.070497   11728 kubelet.go:2275] node "yinzho-tjwqc-m-2.c.openshift-qe.internal" not found

Comment 4 Tomáš Nožička 2019-11-14 13:23:36 UTC

> Version-Release number of selected component (if applicable):
> 4.3.0-0.nightly-2019-11-11-182924

> https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

These don't go together. Kubelet procedure has changed over time. Use
https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html
or
latest version of those docs

Comment 5 Tomáš Nožička 2019-11-14 13:29:59 UTC

this one looks weird though:

Nov 12 09:27:03 yinzho-tjwqc-m-2.c.openshift-qe.internal hyperkube[11728]: E1112 09:27:03.011584   11728 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Get https://api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dyinzho-tjwqc-m-2.c.openshift-qe.internal&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid

If you hit this, pls use openssl to dump the cert at that url and attach it to the BZ (redact modulus and other private fields), also check time on those nodes and your machine to make sure it is synced, record the time into the BZ.

(like openssl s_client -connect api-int.yinzhou.qe.gcp.devcluster.openshift.com:6443 | openssl x509 -noout -text)

Does running the recovery second time help?

Comment 9 zhou ying 2019-11-19 06:52:11 UTC

After run the recovery second time, the two master still notready:

[root@yinzho-2bsks-m-0 ~]# oc get node
NAME                                             STATUS     ROLES    AGE     VERSION
yinzho-2bsks-m-0.c.openshift-qe.internal         Ready      master   5h7m    v1.16.2
yinzho-2bsks-m-1.c.openshift-qe.internal         NotReady   master   5h7m    v1.16.2
yinzho-2bsks-m-2.c.openshift-qe.internal         NotReady   master   5h7m    v1.16.2
yinzho-2bsks-w-a-vhglm.c.openshift-qe.internal   Ready      worker   4h55m   v1.16.2
yinzho-2bsks-w-b-b7rdw.c.openshift-qe.internal   Ready      worker   4h58m   v1.16.2
yinzho-2bsks-w-c-wqrjw.c.openshift-qe.internal   Ready      worker   4h58m   v1.16.2

Comment 23 Xingxing Xia 2020-02-14 08:20:17 UTC

Bug 1797897 is not seen now but hit another issue: bug 1802944

Comment 31 Xingxing Xia 2020-02-25 10:28:02 UTC

Hit bug 1797897 again, reported new bug 1806930.

Comment 33 Tomáš Nožička 2020-03-04 12:22:50 UTC

I am purging the BZ deps or we can't merge fixes https://github.com/openshift/origin/pull/24630#issuecomment-594486876

Comment 36 Tomáš Nožička 2020-03-11 11:06:12 UTC

removing https://bugzilla.redhat.com/show_bug.cgi?id=1811062 dependency as BZ merge bot can't get over a second dependency to the same release as this BZ and not 4.5 as the first one https://github.com/openshift/cluster-kube-scheduler-operator/pull/217#issuecomment-597568806

The dependency was added so we merge only after the origin change lands in 4.4 and it is merged now.

Comment 38 Xingxing Xia 2020-03-12 11:30:10 UTC

Verified in 4.4.0-0.nightly-2020-03-11-212258 using steps in bug 1810008#c4 . (BTW hit issue already tracked in bug 1812593)

Comment 39 Xingxing Xia 2020-03-12 11:38:04 UTC

Also hit another issue no time to analyze today. Will check next day.

Comment 41 errata-xmlrpc 2020-05-04 11:15:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.