Bug 1738857 - All the masters and nodes become "NotReady" after restart the kubelet during do the certificates recovery
Summary: All the masters and nodes become "NotReady" after restart the kubelet during ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.2.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1749271
TreeView+ depends on / blocked
 
Reported: 2019-08-08 09:55 UTC by zhou ying
Modified: 2019-10-16 06:35 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1749271 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:35:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1062 0 'None' closed Bug 1738857: use internal URI for recovery kubeconfig 2021-01-27 11:51:12 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:35:25 UTC

Description zhou ying 2019-08-08 09:55:14 UTC
Description of problem:
Fllow the certificates recovery doc, after restart the kubelet on mater and node, all the master/nodes become "NotReayd"


Version-Release number of selected component (if applicable):
Payload: 4.2.0-0.nightly-2019-08-07-000431

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html  to do certificate recovery;
2. Recover the kubelet on all masters.


Actual results:
2. All the nodes become "NotReady" status and found logs from the node logs: 
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.801371   45340 reflector.go:126] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-138-26.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.835579   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found
Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.935808   45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found


Expected results:
2. Should succeed.

Comment 1 Tomáš Nožička 2019-08-09 08:10:16 UTC
Ryan, getting new kubelet certs is done by running `recover-kubeconfig.sh` script - would you mind taking a look why it didn't?

Also I think you have made the `/etc/kubernetes/ca.crt` managed now (+1), so we probably don't need the hack

  # oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt

the point is that QA should likely be using an updated steps, not 4.1, if we have those?

Comment 2 Ryan Phillips 2019-08-13 00:00:28 UTC
I'll take a look at the documentation steps.

Comment 3 Ryan Phillips 2019-08-13 15:16:55 UTC
Looks like the docs need to be updated to use /etc/kubernetes/kubelet-ca.crt. However, I'm still not getting a clean recovery. Still researching...

Comment 4 Ryan Phillips 2019-08-13 18:36:36 UTC
The CSR is requested but not auto-approved on my machine. `oc get csr` lists the pending csr and `oc adm certificate approve [csr-name]` will approve it. My cluster restored after these two tweaks.

I talked to Andrea on getting the doc updated for the correct kbuelet-ca.crt.

Comment 5 Ryan Phillips 2019-08-13 18:37:04 UTC
typo, kubelet-ca.crt

Comment 6 Ryan Phillips 2019-08-13 18:50:23 UTC
Docs PR: https://github.com/openshift/openshift-docs/pull/16229

Comment 7 Ryan Phillips 2019-08-14 18:06:42 UTC
Recovery script PR to use the internal URI endpoint: https://github.com/openshift/machine-config-operator/pull/1062

Comment 8 Ryan Phillips 2019-08-14 18:29:39 UTC
@zhou For this BZ a documentation patch has been generated and one MCO PR created to tweak the endpoint to use in `recover-kubeconfig.sh`. Depending on how the kubelet recovery is being tested the CSR may or may not get signed automatically. Step 12 goes over the CSR approval process within the recovery doc. Once the MCO patch merges, then this PR should be able to migrate to Modified.

Comment 10 zhou ying 2019-08-16 05:35:11 UTC
Confirmed with latest payload: 4.2.0-0.nightly-2019-08-15-205330, the issue has fixed.

Comment 11 errata-xmlrpc 2019-10-16 06:35:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.