Bug 1737611 - master node certs expire and CSRs are not approved on Baremetal or any UPI
Summary: master node certs expire and CSRs are not approved on Baremetal or any UPI
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: x86_64
OS: Linux
Target Milestone: ---
: 4.2.0
Assignee: Joel Speed
QA Contact: sunzhaohua
: 1738568 1743719 1743908 1747183 (view as bug list)
Depends On:
Blocks: 1731242
TreeView+ depends on / blocked
Reported: 2019-08-05 20:32 UTC by Doug Hellmann
Modified: 2023-03-24 15:10 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-10-16 06:34:48 UTC
Target Upstream Version:
brad.ison: needinfo-

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github https://github.com/openshift-metal3 dev-scripts issues 260 0 None None None 2020-09-30 00:42:03 UTC
Github openshift cluster-machine-approver pull 38 0 'None' closed Bug 1737611: Simplified approval flow for renewing serving certs 2021-02-17 17:50:48 UTC
Red Hat Knowledge Base (Solution) 4360181 0 None None None 2021-10-22 14:12:02 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:35:05 UTC

Description Doug Hellmann 2019-08-05 20:32:34 UTC
Description of problem:

Certificates for baremetal master nodes expire and the CSRs are not automatically approved, so clusters stop working after a period of time.

See https://github.com/openshift-metal3/dev-scripts/issues/260 for additional information

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Doug Hellmann 2019-08-05 20:34:29 UTC
Clayton asked for this bug to be opened and assigned to him.

Comment 2 Clayton Coleman 2019-08-05 21:03:54 UTC
To summarize a discussion with David and Derek (will write this up in a doc shortly), it should be possible to renew a server certificate for a node from the machine approver if:

1. The machine approver can contact the node on its advertised port and address over TLS
2. The current server cert from the node:
3. Has the same private key as the new CSR
4. Has the same IP/SAN list
5. Has the same CA or the CA is known to be previously trusted
6. Is not expired

That proves that the node previously satisfied the requirements for having a cert and renewal.  The only downside is revocation, which could be accomplished by a generation indicator on the node or machine object if necessary (in general this is a problem with node revocation already for clients, so there may be overlap).

Comment 3 Doug Hellmann 2019-08-06 13:22:25 UTC
This also seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1736800

Comment 4 Michal Fojtik 2019-08-08 13:21:09 UTC
(In reply to Doug Hellmann from comment #3)
> This also seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1736800

It is not. 1736800 is about us losing one carry patch 2w ago in master branch (4.2). This problem don't exists in 4.1.

Comment 5 Brad Ison 2019-08-14 14:53:34 UTC
I'm trying to get up to speed on this since it's marked as urgent for the 4.2 release, but ownership of this component only recently transferred to the OpenShift Cloud / Cluster Infrastructure team and there's a lot of info in the linked issues. Can someone help me understand?

AFAICT, the cluster-machine-approver already handles refreshing serving certs. It does however depend on the machine-api being present and the existence of a Machine object linked to the Node via NodeRef. It looks like there were some issues with that for bare-metal because neither the ProviderID nor IP addresses were set for masters, which prevented the nodelink-controller from linking Machine and Node. That in turn means the cluster-machine-approver could not approve the CSR. There is apparently a workaround for this now.

So, what's the current status?

- Does this affect other platforms?
- Can cluster-machine-approver *not* depend on machine-api?
- If this flow is changing, is this actually expected for the 4.2 release?

Comment 6 Mrunal Patel 2019-08-14 15:10:46 UTC
This issue tracks the problem for bare metal and has a good summary of where we are https://github.com/openshift-metal3/dev-scripts/issues/260

Comment 7 Johnny Liu 2019-08-15 03:18:23 UTC
From QE side, what we are seeing is:
After installation is completed, wait for some hours, new pending csr came out, user have to approve that again and again.
The worse is if user did not notify those pending csr, or forget to approve them for a *long* time (about 24 hours), then the whole cluster become notReady, even user manually approve those csr to repair, but it is too late, take no effect. That means the cluster would never get back.

This issue happened on all UPI install. No cluster-machine-approver in UPI install, I do not think machine-api is supported in UPI install.

Comment 8 Brad Ison 2019-08-15 15:50:37 UTC
Ah, okay, so this is a problem on any UPI cluster because we can't rely on the machine-api. I'm trying to gather information so we can document some of this. Here's my read of the way things are configured currently:

*** During the initial bootstrap of the cluster, we run an `approve-csr.sh` script that loops and approves all pending CSRs.

*** After the cluster is bootstrapped, it looks like both the cluster-machine-approver AND the kube-controller-manager are configured to approve CSRs. The kube-controller-manager will only approve client certificates, and it looks to me like the RBAC configuration would only allow it to approve renewals -- not new nodes.

Does this mean approval of client renewals is a race between kube-controller-manager and cluster-machine-approver?

Is the only reason client renewals work on UPI because kube-controller-manager handles them?

*** The signer is configured with a validity of 30 days, and certificates will be renewed between 21 and 27 days due to the built-in jitter.

I've seen multiple mention of things breaking after 24 hours. I'm not sure where that comes in if the validity is set at 30 days. Anyone know?

*** This isn't directly related to the issue, but cluster-machine-approver has no ClusterOperator. Is it meant to be a fully fledged SLO?

Is there anyone I can talk to that knows about this?

Comment 11 Brad Ison 2019-08-23 15:20:12 UTC
I have a WIP PR up for this, but on my test cluster, I get new key material for every request. So the "Has the same private key as the new CSR" check will not work.

Comment 12 Brad Ison 2019-08-26 12:15:30 UTC
Just to confirm, the certificate manager in client-go definitely creates a new private key on each renewal:


Do we think the suggested algorithm is still valid without the check for the same key?

Comment 13 Brad Ison 2019-08-27 10:38:12 UTC
*** Bug 1738568 has been marked as a duplicate of this bug. ***

Comment 14 Brad Ison 2019-08-27 10:43:44 UTC
*** Bug 1743719 has been marked as a duplicate of this bug. ***

Comment 15 Brad Ison 2019-08-28 09:51:26 UTC
*** Bug 1743908 has been marked as a duplicate of this bug. ***

Comment 16 Clayton Coleman 2019-08-30 12:59:30 UTC
Let’s dig in on that, I don’t think we have to rotate the private key every time but if we do we might want the CSR to reference the old private key somehow.  Let me refresh my memory of the rotation code.

Comment 17 Clayton Coleman 2019-08-30 19:45:04 UTC
Actually, if you verify that the requester is the node, then this should be ok.  If you can pretend to be a node successfully then you can get a new serving cert.

So just make sure the approver is checking that the renewer is coming from the same node.

Comment 18 Alberto 2019-09-04 10:25:31 UTC
*** Bug 1747183 has been marked as a duplicate of this bug. ***

Comment 19 Alberto 2019-09-05 16:14:02 UTC
I'm putting together here some info to help to clarify and serve as a quick reference:

Approvers workflow
There’s two approvers in the cluster:
    1 - The kube controller manager approver.
    2 - The machine approver.
There is no possible race between them for denial. To prevent conflicts with other approvers if a CSR does not meet criteria the approvers don’t explicitly deny CSRs.

-Kubelet TLS bootstrapping is configured for requesting both client and serving CSRs.
-Kube controller manager approver has no permissions to approve client/serving CSRs for new nodes.
-So it’s up to the machine approver to approve client/serving CSRs for new nodes. If they don meet criteria manual approval is needed.

-Kubelet is configured for rotating both client and serving certificates.
-Kube controller manager approver approves kubelet renewal client CSRs via system-bootstrap-node-renewal ClusterRoleBinding
-Kube controller manager approver does not support approval of kubelet renewal serving CSRs.
-So it’s up to the machine approver to approve renewal serving CSRs. If they don meet criteria manual approval is needed.

wip PR for decoupling kubelet renewal serving CSRs from the machine API https://github.com/openshift/cluster-machine-approver/pull/38

Comment 21 Brad Ison 2019-09-13 10:14:27 UTC
The PR implementing the check based on existing certificates has merged. I think there are two areas it could use special attention in further testing:

- Testing on different UPI clusters and platforms.
- Testing that things work correctly as the CSR signing CA is rotated.

Comment 22 Alberto 2019-09-13 15:00:34 UTC
Doug Hellmann orthogonally to the renewal approval workflow, can you elaborate why do master machines in baremetal have no IP information in status?

Comment 23 Doug Hellmann 2019-09-13 15:37:04 UTC
(In reply to Alberto from comment #22)
> Doug Hellmann orthogonally to the renewal approval workflow, can you
> elaborate why do master machines in baremetal have no IP information in
> status?

IP addresses are part of the information collected through inspection. They will be available in 4.3.

Comment 24 Jianwei Hou 2019-09-19 06:42:20 UTC
I've setup 3 UPI deployments with 4.2.0-0.nightly-2019-09-17-232025, 2 on AWS, 1 on GCP.

Keep the cluster running for 24+ hours, eventually the nodes end up NotReady.

Running for 21 hours
oc get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-55-236.us-east-2.compute.internal   Ready    master   21h   v1.14.6+7f575e5af
ip-10-0-55-78.us-east-2.compute.internal    Ready    worker   21h   v1.14.6+7f575e5af
ip-10-0-56-66.us-east-2.compute.internal    Ready    worker   21h   v1.14.6+7f575e5af
ip-10-0-58-2.us-east-2.compute.internal     Ready    master   21h   v1.14.6+7f575e5af
ip-10-0-65-182.us-east-2.compute.internal   Ready    worker   21h   v1.14.6+7f575e5af
ip-10-0-72-48.us-east-2.compute.internal    Ready    master   21h   v1.14.6+7f575e5af

Running for 24 hours
oc get nodes
NAME                                        STATUS     ROLES    AGE   VERSION
ip-10-0-55-236.us-east-2.compute.internal   NotReady   master   24h   v1.14.6+7f575e5af
ip-10-0-55-78.us-east-2.compute.internal    NotReady   worker   24h   v1.14.6+7f575e5af
ip-10-0-56-66.us-east-2.compute.internal    NotReady   worker   24h   v1.14.6+7f575e5af
ip-10-0-58-2.us-east-2.compute.internal     NotReady   master   24h   v1.14.6+7f575e5af
ip-10-0-65-182.us-east-2.compute.internal   NotReady   worker   24h   v1.14.6+7f575e5af
ip-10-0-72-48.us-east-2.compute.internal    NotReady   master   24h   v1.14.6+7f575e5af

oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-29d5w   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-44dcw   60m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-86qns   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-9nnqj   45m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-bd4sv   65m   system:node:ip-10-0-55-236.us-east-2.compute.internal                       Pending
csr-bl5hb   30m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-crt8b   65m   system:node:ip-10-0-55-78.us-east-2.compute.internal                        Pending
csr-fc9qg   65m   system:node:ip-10-0-65-182.us-east-2.compute.internal                       Pending
csr-htxhd   15m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-jc4wj   60m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-khtdq   65m   system:node:ip-10-0-58-2.us-east-2.compute.internal                         Pending
csr-mdgfn   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-mrqh9   30m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-msvzh   15m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-ndb8t   60m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-r6njl   45m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-v6dcf   30m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-wnnq2   15m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-z5blb   45m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-zvrtd   65m   system:node:ip-10-0-56-66.us-east-2.compute.internal                        Pending
csr-zzsbp   65m   system:node:ip-10-0-72-48.us-east-2.compute.internal                        Pending

oc logs machine-approver-7bf6885dff-4xhml -n openshift-cluster-machine-approver
Error from server: Get
: remote error: tls: internal error

oc get nodes
NAME                                             STATUS     ROLES    AGE   VERSION
qe-jho-5k9cj-m-0.c.openshift-qe.internal         Ready      master   27h   v1.14.6+a7496a10f
qe-jho-5k9cj-m-1.c.openshift-qe.internal         Ready      master   27h   v1.14.6+a7496a10f
qe-jho-5k9cj-m-2.c.openshift-qe.internal         Ready      master   27h   v1.14.6+a7496a10f
qe-jho-5k9cj-w-a-dqq6v.c.openshift-qe.internal   NotReady   worker   26h   v1.14.6+a7496a10f
qe-jho-5k9cj-w-b-bd5jg.c.openshift-qe.internal   NotReady   worker   26h   v1.14.6+a7496a10f
qe-jho-5k9cj-w-c-kc7jv.c.openshift-qe.internal   NotReady   worker   26h   v1.14.6+a7496a10f

oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-2b8gc   47m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-4vh2v   10m     system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-698jl   2s      system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-7276f   12m     system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-72r2h   78m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-9k8p5   87s     system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-9qhg8   13m     system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-9zm2j   3m12s   system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-bpdqh   125m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-bxdgh   7m7s    system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-c4trp   13m     system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-cggbt   3h7m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-cnfrq   89s     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-ct9g9   16m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-cxhn7   13m     system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-dwggz   111s    system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-f8vh6   107s    system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-f92pf   8m31s   system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-fpbcc   5m8s    system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-fvr2r   32m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-gblwm   16s     system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-hclmx   8m7s    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-hfrlf   12m     system:node:qe-jho-5k9cj-w-c-kc7jv.c.openshift-qe.internal                  Approved,Issued
csr-jbflz   171m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-jstnl   4m21s   system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-khnzb   4m28s   system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-kwgkg   13m     system:node:qe-jho-5k9cj-w-b-bd5jg.c.openshift-qe.internal                  Approved,Issued
csr-m4vpj   10m     system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-mhvxg   5m42s   system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-n5fcg   10m     system:node:qe-jho-5k9cj-w-c-kc7jv.c.openshift-qe.internal                  Approved,Issued
csr-pf4zq   6m10s   system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-phbpw   63m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-r8shh   94m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-r9cdd   3h22m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-rff4d   9m34s   system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-rfprk   10m     system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-rzv9j   8m33s   system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-s8kpx   7m21s   system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-sd9v2   4m33s   system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-sdjrv   109m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-sghk2   12m     system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-svhjx   3m13s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-tb5qr   12m     system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal                        Approved,Issued
csr-v8vrd   3m      system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-vthj9   140m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-w9g4h   11m     system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued
csr-wkrxh   156m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-wrtw2   7m45s   system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal                        Approved,Issued
csr-z6v9b   83s     system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal                        Approved,Issued

Comment 26 Alberto 2019-09-19 07:06:24 UTC
This seems expected behaviour. All the master nodes gets approved and control plane is up which is what was reported in this bug.
kubelet renewal client CSRs depends on your Kube controller manager approver setup, e.g via system-bootstrap-node-renewal ClusterRoleBinding.

Can we also try to test that things work correctly as the CSR signing CA is rotated?

Comment 27 Jianwei Hou 2019-09-20 09:38:25 UTC
Tested on BM UPI with 4.2.0-0.nightly-2019-09-18-211009

Machine-approver auto approves kubelet renewal serving CSRs as expected.

After bootstrap, there is a 24 hours valid CA
oc -n openshift-config-managed get configmap csr-controller-ca -o json | jq -r '.data["ca-bundle.crt"]' | openssl x509 -noout -subject -dates
subject=CN = kube-csr-signer_@1568878080
notBefore=Sep 19 07:27:59 2019 GMT
notAfter=Sep 20 07:12:55 2019 GMT

It is then rotated automatically.
oc -n openshift-config-managed get configmap csr-controller-ca -o json | jq -r '.data["ca-bundle.crt"]' | openssl x509 -noout -subject -dates
subject=CN = kube-csr-signer_@1568946479
notBefore=Sep 20 02:27:58 2019 GMT
notAfter=Oct 20 02:27:59 2019 GMT

I0920 06:20:59.713350       1 main.go:139] CSR csr-zpqpc added
I0920 06:20:59.713376       1 main.go:142] CSR csr-zpqpc is already approved
I0920 06:21:41.110223       1 main.go:139] CSR csr-kd26q added
I0920 06:21:41.178172       1 csr_check.go:403] retrieving serving cert from qe-yapei-uos2-6dbch-compute-2 (
I0920 06:21:41.181390       1 csr_check.go:158] authorizing serving cert renewal for qe-yapei-uos2-6dbch-compute-2
I0920 06:21:41.195188       1 main.go:189] CSR csr-kd26q approved
I0920 06:22:28.350969       1 main.go:139] CSR csr-2pzsv added
I0920 06:22:28.362088       1 csr_check.go:403] retrieving serving cert from qe-yapei-uos2-6dbch-control-plane-1 (
I0920 06:22:28.364434       1 csr_check.go:158] authorizing serving cert renewal for qe-yapei-uos2-6dbch-control-plane-1
I0920 06:22:28.393366       1 main.go:189] CSR csr-2pzsv approved
I0920 06:23:30.570854       1 main.go:139] CSR csr-w22p7 added
I0920 06:23:30.584008       1 csr_check.go:403] retrieving serving cert from qe-yapei-uos2-6dbch-control-plane-0 (
I0920 06:23:30.586600       1 csr_check.go:158] authorizing serving cert renewal for qe-yapei-uos2-6dbch-control-plane-0
I0920 06:23:30.597992       1 main.go:189] CSR csr-w22p7 approved
I0920 06:28:32.864448       1 main.go:139] CSR csr-8ntv5 added

oc get nodes
NAME                                  STATUS   ROLES    AGE   VERSION
qe-yapei-uos2-6dbch-compute-0         Ready    worker   25h   v1.14.6+147115512
qe-yapei-uos2-6dbch-compute-1         Ready    worker   25h   v1.14.6+147115512
qe-yapei-uos2-6dbch-compute-2         Ready    worker   25h   v1.14.6+147115512
qe-yapei-uos2-6dbch-control-plane-0   Ready    master   26h   v1.14.6+147115512
qe-yapei-uos2-6dbch-control-plane-1   Ready    master   26h   v1.14.6+147115512
qe-yapei-uos2-6dbch-control-plane-2   Ready    master   26h   v1.14.6+147115512

The CSR auto-approval works well. I think this bug can be verified.

Comment 28 errata-xmlrpc 2019-10-16 06:34:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.