Bug 1738857

Summary:	All the masters and nodes become "NotReady" after restart the kubelet during do the certificates recovery
Product:	OpenShift Container Platform	Reporter:	zhou ying <yinzhou>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED ERRATA	QA Contact:	Sunil Choudhary <schoudha>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	aos-bugs, jokerman, mfojtik, sjenning
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1749271 (view as bug list)		Environment:
Last Closed:	2019-10-16 06:35:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1749271

Description zhou ying 2019-08-08 09:55:14 UTC

Description of problem:
Fllow the certificates recovery doc, after restart the kubelet on mater and node, all the master/nodes become "NotReayd"


Version-Release number of selected component (if applicable):
Payload: 4.2.0-0.nightly-2019-08-07-000431

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html  to do certificate recovery;
2. Recover the kubelet on all masters.


Actual results:
2. All the nodes become "NotReady" status and found logs from the node logs: 
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.801371   45340 reflector.go:126] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-138-26.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.835579   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.935808   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found


Expected results:
2. Should succeed.

Comment 1 Tomáš Nožička 2019-08-09 08:10:16 UTC

Ryan, getting new kubelet certs is done by running `recover-kubeconfig.sh` script - would you mind taking a look why it didn't?

Also I think you have made the `/etc/kubernetes/ca.crt` managed now (+1), so we probably don't need the hack

  # oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt

the point is that QA should likely be using an updated steps, not 4.1, if we have those?

Comment 2 Ryan Phillips 2019-08-13 00:00:28 UTC

I'll take a look at the documentation steps.

Comment 3 Ryan Phillips 2019-08-13 15:16:55 UTC

Looks like the docs need to be updated to use /etc/kubernetes/kubelet-ca.crt. However, I'm still not getting a clean recovery. Still researching...

Comment 4 Ryan Phillips 2019-08-13 18:36:36 UTC

The CSR is requested but not auto-approved on my machine. `oc get csr` lists the pending csr and `oc adm certificate approve [csr-name]` will approve it. My cluster restored after these two tweaks.

I talked to Andrea on getting the doc updated for the correct kbuelet-ca.crt.

Comment 5 Ryan Phillips 2019-08-13 18:37:04 UTC

typo, kubelet-ca.crt

Comment 6 Ryan Phillips 2019-08-13 18:50:23 UTC

Docs PR: https://github.com/openshift/openshift-docs/pull/16229

Comment 7 Ryan Phillips 2019-08-14 18:06:42 UTC

Recovery script PR to use the internal URI endpoint: https://github.com/openshift/machine-config-operator/pull/1062

Comment 8 Ryan Phillips 2019-08-14 18:29:39 UTC

@zhou For this BZ a documentation patch has been generated and one MCO PR created to tweak the endpoint to use in `recover-kubeconfig.sh`. Depending on how the kubelet recovery is being tested the CSR may or may not get signed automatically. Step 12 goes over the CSR approval process within the recovery doc. Once the MCO patch merges, then this PR should be able to migrate to Modified.

Comment 10 zhou ying 2019-08-16 05:35:11 UTC

Confirmed with latest payload: 4.2.0-0.nightly-2019-08-15-205330, the issue has fixed.

Comment 11 errata-xmlrpc 2019-10-16 06:35:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922