Description of problem: Fllow the certificates recovery doc, after restart the kubelet on mater and node, all the master/nodes become "NotReayd" Version-Release number of selected component (if applicable): Payload: 4.2.0-0.nightly-2019-08-07-000431 How reproducible: Always Steps to Reproduce: 1. Follow the doc: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html to do certificate recovery; 2. Recover the kubelet on all masters. Actual results: 2. All the nodes become "NotReady" status and found logs from the node logs: Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.801371 45340 reflector.go:126] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: nodes "ip-10-0-138-26.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.835579 45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found Aug 08 09:24:11 ip-10-0-138-26 hyperkube[45340]: E0808 09:24:11.935808 45340 kubelet.go:2254] node "ip-10-0-138-26.us-east-2.compute.internal" not found Expected results: 2. Should succeed.
Ryan, getting new kubelet certs is done by running `recover-kubeconfig.sh` script - would you mind taking a look why it didn't? Also I think you have made the `/etc/kubernetes/ca.crt` managed now (+1), so we probably don't need the hack # oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/ca.crt the point is that QA should likely be using an updated steps, not 4.1, if we have those?
I'll take a look at the documentation steps.
Looks like the docs need to be updated to use /etc/kubernetes/kubelet-ca.crt. However, I'm still not getting a clean recovery. Still researching...
The CSR is requested but not auto-approved on my machine. `oc get csr` lists the pending csr and `oc adm certificate approve [csr-name]` will approve it. My cluster restored after these two tweaks. I talked to Andrea on getting the doc updated for the correct kbuelet-ca.crt.
typo, kubelet-ca.crt
Docs PR: https://github.com/openshift/openshift-docs/pull/16229
Recovery script PR to use the internal URI endpoint: https://github.com/openshift/machine-config-operator/pull/1062
@zhou For this BZ a documentation patch has been generated and one MCO PR created to tweak the endpoint to use in `recover-kubeconfig.sh`. Depending on how the kubelet recovery is being tested the CSR may or may not get signed automatically. Step 12 goes over the CSR approval process within the recovery doc. Once the MCO patch merges, then this PR should be able to migrate to Modified.
Confirmed with latest payload: 4.2.0-0.nightly-2019-08-15-205330, the issue has fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922