Bug 1738857

Summary: All the masters and nodes become "NotReady" after restart the kubelet during do the certificates recovery
Product: OpenShift Container Platform Reporter: zhou ying <yinzhou>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: high    
Version: 4.2.0CC: aos-bugs, jokerman, mfojtik, sjenning
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1749271 (view as bug list) Environment:
Last Closed: 2019-10-16 06:35:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1749271    

Description zhou ying 2019-08-08 09:55:14 UTC
Description of problem:
Fllow the certificates recovery doc, after restart the kubelet on mater and node, all the master/nodes become "NotReayd"


Version-Release number of selected component (if applicable):
Payload: 4.2.0-0.nightly-2019-08-07-000431

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html  to do certificate recovery;
2. Recover the kubelet on all masters.


Actual results:
2. All the nodes become "NotReady" status and found logs from the node logs: 
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.801371   45340 reflector.go:126] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-138-26.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.835579   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.935808   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found


Expected results:
2. Should succeed.

Comment 1 Tomáš Nožička 2019-08-09 08:10:16 UTC
Ryan, getting new kubelet certs is done by running `recover-kubeconfig.sh` script - would you mind taking a look why it didn't?

Also I think you have made the `/etc/kubernetes/ca.crt` managed now (+1), so we probably don't need the hack

  # oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt

the point is that QA should likely be using an updated steps, not 4.1, if we have those?

Comment 2 Ryan Phillips 2019-08-13 00:00:28 UTC
I'll take a look at the documentation steps.

Comment 3 Ryan Phillips 2019-08-13 15:16:55 UTC
Looks like the docs need to be updated to use /etc/kubernetes/kubelet-ca.crt. However, I'm still not getting a clean recovery. Still researching...

Comment 4 Ryan Phillips 2019-08-13 18:36:36 UTC
The CSR is requested but not auto-approved on my machine. `oc get csr` lists the pending csr and `oc adm certificate approve [csr-name]` will approve it. My cluster restored after these two tweaks.

I talked to Andrea on getting the doc updated for the correct kbuelet-ca.crt.

Comment 5 Ryan Phillips 2019-08-13 18:37:04 UTC
typo, kubelet-ca.crt

Comment 6 Ryan Phillips 2019-08-13 18:50:23 UTC
Docs PR: https://github.com/openshift/openshift-docs/pull/16229

Comment 7 Ryan Phillips 2019-08-14 18:06:42 UTC
Recovery script PR to use the internal URI endpoint: https://github.com/openshift/machine-config-operator/pull/1062

Comment 8 Ryan Phillips 2019-08-14 18:29:39 UTC
@zhou For this BZ a documentation patch has been generated and one MCO PR created to tweak the endpoint to use in `recover-kubeconfig.sh`. Depending on how the kubelet recovery is being tested the CSR may or may not get signed automatically. Step 12 goes over the CSR approval process within the recovery doc. Once the MCO patch merges, then this PR should be able to migrate to Modified.

Comment 10 zhou ying 2019-08-16 05:35:11 UTC
Confirmed with latest payload: 4.2.0-0.nightly-2019-08-15-205330, the issue has fixed.

Comment 11 errata-xmlrpc 2019-10-16 06:35:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922