Hide Forgot
Description of problem: After running recovery process for the control plane certs[https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html], cluster is still inaccessible. Version-Release number of selected component (if applicable): 4.2.0 How reproducible: Consistently Steps to Reproduce: 1. Install cluster 2. Allow control plane certs to expire 3. Attempt to recover control plane certs via recovery procedure Actual results: The cluster is not operational after recovering the control plane certs Expected results: The cluster should be operational after recovering the control plane certs Additional info: Customer has had numerous clusters have control plane certs expire and they have been unable to successfully recover certs. The API logs show numerous errors related to not being able to reach endpoints on the cluster network[ref 0]. I am not sure if this could be related to the recovery API server not shutting down. It was noted that the recovery API server failed to shutdown after the execution of step 13. In fact, we actually had to remove the static pod definition manually on this cluster. However, this did not seem to get things The kubelet is failing to register due to errors connecting to the API server. A check of the exposed cert[ref 1] shows the cert for the CN kube-apiserver-service-network-signer being returned which is causing the kubelet to not authenticate the API server. A quick check of etcd was performed and all members seemed healthy. I will attach the API server logs in a separate attachment. References: [0] - 2019-10-25T18:55:29.564343631+00:00 stderr F E1025 18:55:29.564289 1 available_controller.go:407] v1.route.openshift.io failed with: failing or missing response from https://10.x.x.242:8443: Get https://10.x.x.242:8443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019-10-25T18:55:29.747316419+00:00 stderr F E1025 18:55:29.747271 1 available_controller.go:407] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.x.x.12:6443: Get https://10.x.x.12:6443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) [1] - curl -kv https://api.x.x.x.com:6443 * Rebuilt URL to: https://api.x.x.x.com:6443/ * Trying 9.x.x.81... * TCP_NODELAY set * Connected to api.x.x.x.com (9.x.x.81) port 6443 (#0) ... * Server certificate: * subject: CN=172.30.0.1 * start date: Oct 16 06:35:54 2019 GMT * expire date: Nov 15 06:35:55 2019 GMT * issuer: OU=openshift; CN=kube-apiserver-service-network-signer * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. * Using HTTP2, server supports multi-use * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * Using Stream ID: 1 (easy handle 0x562feb907630) > GET / HTTP/2 > Host: api.x.x.x.com:6443 > User-Agent: curl/7.61.1 > Accept: */* > * Connection state changed (MAX_CONCURRENT_STREAMS == 2000)! < HTTP/2 403 < cache-control: no-cache, private < content-type: application/json < x-content-type-options: nosniff < content-length: 233 < date: Fri, 25 Oct 2019 19:09:52 GMT < { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403
> The kubelet is failing to register due to errors connecting to the API server. A check of the exposed cert[ref 1] shows the cert for the CN kube-apiserver-service-network-signer being returned which is causing the kubelet to not authenticate the API server. AFAIK kubelet doesn't use service network but internal loadbalancer and its cert. Sending to node team to investigate why kubelet can't connect. The'll likely need info about the install (architecture, cloud, ...) kubelet logs, and checking the appropriate certs.
The kubelet should be talking to the API server on https://api-int.x.x.x.com:6443/ (note the -int). As Tomas mentioned, we need more information.
Confirmed payload 4.2.0-0.nightly-2019-12-11-171302 with UPI on baremetal aws, we could recovery back succeed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4181