Bug 1765759 - Control Plane Certs Expired, Unable to Recover Certs
Summary: Control Plane Certs Expired, Unable to Recover Certs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-25 21:12 UTC by rvanderp
Modified: 2020-02-24 05:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-20 00:46:48 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1263 'None' closed Bug 1765759: templates/*etcd-member.yaml: give etcd-metrics container privilege 2020-07-08 08:40:45 UTC
Red Hat Product Errata RHBA-2019:4181 None None None 2019-12-20 00:46:57 UTC

Description rvanderp 2019-10-25 21:12:21 UTC
Description of problem:
After running recovery process for the control plane certs[https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html], cluster is still inaccessible.

Version-Release number of selected component (if applicable):
4.2.0

How reproducible:
Consistently

Steps to Reproduce:
1. Install cluster
2. Allow control plane certs to expire
3. Attempt to recover control plane certs via recovery procedure

Actual results:
The cluster is not operational after recovering the control plane certs

Expected results:
The cluster should be operational after recovering the control plane certs

Additional info:
Customer has had numerous clusters have control plane certs expire and they have been unable to successfully recover certs.  

The API logs show numerous errors related to not being able to reach endpoints on the cluster network[ref 0].  I am not sure if this could be related to the recovery API server not shutting down.  It was noted that the recovery API server failed to shutdown after the execution of step 13.  In fact, we actually had to remove the static pod definition manually on this cluster.  However, this did not seem to get things 

The kubelet is failing to register due to errors connecting to the API server.  A check of the exposed cert[ref 1] shows the cert for the CN kube-apiserver-service-network-signer being returned which is causing the kubelet to not authenticate the API server.  

A quick check of etcd was performed and all members seemed healthy.  I will attach the API server logs in a separate attachment.

References:
[0] - 
2019-10-25T18:55:29.564343631+00:00 stderr F E1025 18:55:29.564289       1 available_controller.go:407] v1.route.openshift.io failed with: failing or missing response from https://10.x.x.242:8443: Get https://10.x.x.242:8443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2019-10-25T18:55:29.747316419+00:00 stderr F E1025 18:55:29.747271       1 available_controller.go:407] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.x.x.12:6443: Get https://10.x.x.12:6443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)


[1] - 
curl -kv https://api.x.x.x.com:6443
* Rebuilt URL to: https://api.x.x.x.com:6443/
*   Trying 9.x.x.81...
* TCP_NODELAY set
* Connected to api.x.x.x.com (9.x.x.81) port 6443 (#0)
...
* Server certificate:
*  subject: CN=172.30.0.1
*  start date: Oct 16 06:35:54 2019 GMT
*  expire date: Nov 15 06:35:55 2019 GMT
*  issuer: OU=openshift; CN=kube-apiserver-service-network-signer
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x562feb907630)
> GET / HTTP/2
> Host: api.x.x.x.com:6443
> User-Agent: curl/7.61.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 2000)!
< HTTP/2 403 
< cache-control: no-cache, private
< content-type: application/json
< x-content-type-options: nosniff
< content-length: 233
< date: Fri, 25 Oct 2019 19:09:52 GMT
< 
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403

Comment 2 Tomáš Nožička 2019-10-29 09:50:55 UTC
> The kubelet is failing to register due to errors connecting to the API server. A check of the exposed cert[ref 1] shows the cert for the CN kube-apiserver-service-network-signer being returned which is causing the kubelet to not authenticate the API server.  

AFAIK kubelet doesn't use service network but internal loadbalancer and its cert.

Sending to node team to investigate why kubelet can't connect.

The'll likely need info about the install (architecture, cloud, ...) kubelet logs, and checking the appropriate certs.

Comment 3 Ryan Phillips 2019-10-29 13:16:40 UTC
The kubelet should be talking to the API server on https://api-int.x.x.x.com:6443/ (note the -int). As Tomas mentioned, we need more information.

Comment 15 zhou ying 2019-12-13 07:30:21 UTC
Confirmed payload 4.2.0-0.nightly-2019-12-11-171302 with UPI on baremetal aws, we could recovery back succeed.

Comment 17 errata-xmlrpc 2019-12-20 00:46:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4181


Note You need to log in before you can comment on or make changes to this bug.