1738857 – All the masters and nodes become "NotReady" after restart the kubelet during do the certificates recovery

Bug 1738857 - All the masters and nodes become "NotReady" after restart the kubelet during do the certificates recovery

Summary: All the masters and nodes become "NotReady" after restart the kubelet during ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1749271
TreeView+	depends on / blocked

Reported:	2019-08-08 09:55 UTC by zhou ying
Modified:	2019-10-16 06:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1749271 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:35:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1062	0	'None'	closed	Bug 1738857: use internal URI for recovery kubeconfig	2021-01-27 11:51:12 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:35:25 UTC

Description zhou ying 2019-08-08 09:55:14 UTC

Description of problem:
Fllow the certificates recovery doc, after restart the kubelet on mater and node, all the master/nodes become "NotReayd"


Version-Release number of selected component (if applicable):
Payload: 4.2.0-0.nightly-2019-08-07-000431

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html  to do certificate recovery;
2. Recover the kubelet on all masters.


Actual results:
2. All the nodes become "NotReady" status and found logs from the node logs: 
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.801371   45340 reflector.go:126] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-138-26.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.835579   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.935808   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found


Expected results:
2. Should succeed.

Comment 1 Tomáš Nožička 2019-08-09 08:10:16 UTC

Ryan, getting new kubelet certs is done by running `recover-kubeconfig.sh` script - would you mind taking a look why it didn't?

Also I think you have made the `/etc/kubernetes/ca.crt` managed now (+1), so we probably don't need the hack

  # oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt

the point is that QA should likely be using an updated steps, not 4.1, if we have those?

Comment 2 Ryan Phillips 2019-08-13 00:00:28 UTC

I'll take a look at the documentation steps.

Comment 3 Ryan Phillips 2019-08-13 15:16:55 UTC

Looks like the docs need to be updated to use /etc/kubernetes/kubelet-ca.crt. However, I'm still not getting a clean recovery. Still researching...

Comment 4 Ryan Phillips 2019-08-13 18:36:36 UTC

The CSR is requested but not auto-approved on my machine. `oc get csr` lists the pending csr and `oc adm certificate approve [csr-name]` will approve it. My cluster restored after these two tweaks.

I talked to Andrea on getting the doc updated for the correct kbuelet-ca.crt.

Comment 5 Ryan Phillips 2019-08-13 18:37:04 UTC

typo, kubelet-ca.crt

Comment 6 Ryan Phillips 2019-08-13 18:50:23 UTC

Docs PR: https://github.com/openshift/openshift-docs/pull/16229

Comment 7 Ryan Phillips 2019-08-14 18:06:42 UTC

Recovery script PR to use the internal URI endpoint: https://github.com/openshift/machine-config-operator/pull/1062

Comment 8 Ryan Phillips 2019-08-14 18:29:39 UTC

@zhou For this BZ a documentation patch has been generated and one MCO PR created to tweak the endpoint to use in `recover-kubeconfig.sh`. Depending on how the kubelet recovery is being tested the CSR may or may not get signed automatically. Step 12 goes over the CSR approval process within the recovery doc. Once the MCO patch merges, then this PR should be able to migrate to Modified.

Comment 10 zhou ying 2019-08-16 05:35:11 UTC

Confirmed with latest payload: 4.2.0-0.nightly-2019-08-15-205330, the issue has fixed.

Comment 11 errata-xmlrpc 2019-10-16 06:35:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.