Description of problem: Certificates for baremetal master nodes expire and the CSRs are not automatically approved, so clusters stop working after a period of time. See https://github.com/openshift-metal3/dev-scripts/issues/260 for additional information Version-Release number of selected component (if applicable): How reproducible: Very Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Clayton asked for this bug to be opened and assigned to him.
To summarize a discussion with David and Derek (will write this up in a doc shortly), it should be possible to renew a server certificate for a node from the machine approver if: 1. The machine approver can contact the node on its advertised port and address over TLS 2. The current server cert from the node: 3. Has the same private key as the new CSR 4. Has the same IP/SAN list 5. Has the same CA or the CA is known to be previously trusted 6. Is not expired That proves that the node previously satisfied the requirements for having a cert and renewal. The only downside is revocation, which could be accomplished by a generation indicator on the node or machine object if necessary (in general this is a problem with node revocation already for clients, so there may be overlap).
This also seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1736800
(In reply to Doug Hellmann from comment #3) > This also seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1736800 It is not. 1736800 is about us losing one carry patch 2w ago in master branch (4.2). This problem don't exists in 4.1.
I'm trying to get up to speed on this since it's marked as urgent for the 4.2 release, but ownership of this component only recently transferred to the OpenShift Cloud / Cluster Infrastructure team and there's a lot of info in the linked issues. Can someone help me understand? AFAICT, the cluster-machine-approver already handles refreshing serving certs. It does however depend on the machine-api being present and the existence of a Machine object linked to the Node via NodeRef. It looks like there were some issues with that for bare-metal because neither the ProviderID nor IP addresses were set for masters, which prevented the nodelink-controller from linking Machine and Node. That in turn means the cluster-machine-approver could not approve the CSR. There is apparently a workaround for this now. So, what's the current status? - Does this affect other platforms? - Can cluster-machine-approver *not* depend on machine-api? - If this flow is changing, is this actually expected for the 4.2 release?
This issue tracks the problem for bare metal and has a good summary of where we are https://github.com/openshift-metal3/dev-scripts/issues/260
From QE side, what we are seeing is: After installation is completed, wait for some hours, new pending csr came out, user have to approve that again and again. The worse is if user did not notify those pending csr, or forget to approve them for a *long* time (about 24 hours), then the whole cluster become notReady, even user manually approve those csr to repair, but it is too late, take no effect. That means the cluster would never get back. This issue happened on all UPI install. No cluster-machine-approver in UPI install, I do not think machine-api is supported in UPI install.
Ah, okay, so this is a problem on any UPI cluster because we can't rely on the machine-api. I'm trying to gather information so we can document some of this. Here's my read of the way things are configured currently: *** During the initial bootstrap of the cluster, we run an `approve-csr.sh` script that loops and approves all pending CSRs. *** After the cluster is bootstrapped, it looks like both the cluster-machine-approver AND the kube-controller-manager are configured to approve CSRs. The kube-controller-manager will only approve client certificates, and it looks to me like the RBAC configuration would only allow it to approve renewals -- not new nodes. Does this mean approval of client renewals is a race between kube-controller-manager and cluster-machine-approver? Is the only reason client renewals work on UPI because kube-controller-manager handles them? *** The signer is configured with a validity of 30 days, and certificates will be renewed between 21 and 27 days due to the built-in jitter. I've seen multiple mention of things breaking after 24 hours. I'm not sure where that comes in if the validity is set at 30 days. Anyone know? *** This isn't directly related to the issue, but cluster-machine-approver has no ClusterOperator. Is it meant to be a fully fledged SLO? Is there anyone I can talk to that knows about this?
I have a WIP PR up for this, but on my test cluster, I get new key material for every request. So the "Has the same private key as the new CSR" check will not work.
Just to confirm, the certificate manager in client-go definitely creates a new private key on each renewal: https://github.com/kubernetes/client-go/blob/kubernetes-1.14.0/util/certificate/certificate_manager.go#L540-L545 Do we think the suggested algorithm is still valid without the check for the same key?
*** Bug 1738568 has been marked as a duplicate of this bug. ***
*** Bug 1743719 has been marked as a duplicate of this bug. ***
*** Bug 1743908 has been marked as a duplicate of this bug. ***
Let’s dig in on that, I don’t think we have to rotate the private key every time but if we do we might want the CSR to reference the old private key somehow. Let me refresh my memory of the rotation code.
Actually, if you verify that the requester is the node, then this should be ok. If you can pretend to be a node successfully then you can get a new serving cert. So just make sure the approver is checking that the renewer is coming from the same node.
*** Bug 1747183 has been marked as a duplicate of this bug. ***
I'm putting together here some info to help to clarify and serve as a quick reference: Approvers workflow There’s two approvers in the cluster: 1 - The kube controller manager approver. 2 - The machine approver. There is no possible race between them for denial. To prevent conflicts with other approvers if a CSR does not meet criteria the approvers don’t explicitly deny CSRs. Bootstrapping -Kubelet TLS bootstrapping is configured for requesting both client and serving CSRs. -Kube controller manager approver has no permissions to approve client/serving CSRs for new nodes. -So it’s up to the machine approver to approve client/serving CSRs for new nodes. If they don meet criteria manual approval is needed. Rotation -Kubelet is configured for rotating both client and serving certificates. -Kube controller manager approver approves kubelet renewal client CSRs via system-bootstrap-node-renewal ClusterRoleBinding -Kube controller manager approver does not support approval of kubelet renewal serving CSRs. -So it’s up to the machine approver to approve renewal serving CSRs. If they don meet criteria manual approval is needed. wip PR for decoupling kubelet renewal serving CSRs from the machine API https://github.com/openshift/cluster-machine-approver/pull/38
The PR implementing the check based on existing certificates has merged. I think there are two areas it could use special attention in further testing: - Testing on different UPI clusters and platforms. - Testing that things work correctly as the CSR signing CA is rotated.
Doug Hellmann orthogonally to the renewal approval workflow, can you elaborate why do master machines in baremetal have no IP information in status?
(In reply to Alberto from comment #22) > Doug Hellmann orthogonally to the renewal approval workflow, can you > elaborate why do master machines in baremetal have no IP information in > status? IP addresses are part of the information collected through inspection. They will be available in 4.3.
I've setup 3 UPI deployments with 4.2.0-0.nightly-2019-09-17-232025, 2 on AWS, 1 on GCP. Keep the cluster running for 24+ hours, eventually the nodes end up NotReady. AWS: Running for 21 hours oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-55-236.us-east-2.compute.internal Ready master 21h v1.14.6+7f575e5af ip-10-0-55-78.us-east-2.compute.internal Ready worker 21h v1.14.6+7f575e5af ip-10-0-56-66.us-east-2.compute.internal Ready worker 21h v1.14.6+7f575e5af ip-10-0-58-2.us-east-2.compute.internal Ready master 21h v1.14.6+7f575e5af ip-10-0-65-182.us-east-2.compute.internal Ready worker 21h v1.14.6+7f575e5af ip-10-0-72-48.us-east-2.compute.internal Ready master 21h v1.14.6+7f575e5af Running for 24 hours oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-55-236.us-east-2.compute.internal NotReady master 24h v1.14.6+7f575e5af ip-10-0-55-78.us-east-2.compute.internal NotReady worker 24h v1.14.6+7f575e5af ip-10-0-56-66.us-east-2.compute.internal NotReady worker 24h v1.14.6+7f575e5af ip-10-0-58-2.us-east-2.compute.internal NotReady master 24h v1.14.6+7f575e5af ip-10-0-65-182.us-east-2.compute.internal NotReady worker 24h v1.14.6+7f575e5af ip-10-0-72-48.us-east-2.compute.internal NotReady master 24h v1.14.6+7f575e5af oc get csr NAME AGE REQUESTOR CONDITION csr-29d5w 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-44dcw 60m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-86qns 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-9nnqj 45m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-bd4sv 65m system:node:ip-10-0-55-236.us-east-2.compute.internal Pending csr-bl5hb 30m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-crt8b 65m system:node:ip-10-0-55-78.us-east-2.compute.internal Pending csr-fc9qg 65m system:node:ip-10-0-65-182.us-east-2.compute.internal Pending csr-htxhd 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-jc4wj 60m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-khtdq 65m system:node:ip-10-0-58-2.us-east-2.compute.internal Pending csr-mdgfn 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-mrqh9 30m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-msvzh 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-ndb8t 60m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-r6njl 45m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-v6dcf 30m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-wnnq2 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-z5blb 45m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-zvrtd 65m system:node:ip-10-0-56-66.us-east-2.compute.internal Pending csr-zzsbp 65m system:node:ip-10-0-72-48.us-east-2.compute.internal Pending oc logs machine-approver-7bf6885dff-4xhml -n openshift-cluster-machine-approver Error from server: Get https://10.0.72.48:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-7bf6885dff-4xhml/machine-approver-controller : remote error: tls: internal error GCP: oc get nodes NAME STATUS ROLES AGE VERSION qe-jho-5k9cj-m-0.c.openshift-qe.internal Ready master 27h v1.14.6+a7496a10f qe-jho-5k9cj-m-1.c.openshift-qe.internal Ready master 27h v1.14.6+a7496a10f qe-jho-5k9cj-m-2.c.openshift-qe.internal Ready master 27h v1.14.6+a7496a10f qe-jho-5k9cj-w-a-dqq6v.c.openshift-qe.internal NotReady worker 26h v1.14.6+a7496a10f qe-jho-5k9cj-w-b-bd5jg.c.openshift-qe.internal NotReady worker 26h v1.14.6+a7496a10f qe-jho-5k9cj-w-c-kc7jv.c.openshift-qe.internal NotReady worker 26h v1.14.6+a7496a10f oc get csr NAME AGE REQUESTOR CONDITION csr-2b8gc 47m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-4vh2v 10m system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-698jl 2s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-7276f 12m system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-72r2h 78m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-9k8p5 87s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-9qhg8 13m system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-9zm2j 3m12s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-bpdqh 125m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-bxdgh 7m7s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-c4trp 13m system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-cggbt 3h7m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-cnfrq 89s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-ct9g9 16m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-cxhn7 13m system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-dwggz 111s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-f8vh6 107s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-f92pf 8m31s system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-fpbcc 5m8s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-fvr2r 32m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-gblwm 16s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-hclmx 8m7s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-hfrlf 12m system:node:qe-jho-5k9cj-w-c-kc7jv.c.openshift-qe.internal Approved,Issued csr-jbflz 171m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-jstnl 4m21s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-khnzb 4m28s system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-kwgkg 13m system:node:qe-jho-5k9cj-w-b-bd5jg.c.openshift-qe.internal Approved,Issued csr-m4vpj 10m system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-mhvxg 5m42s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-n5fcg 10m system:node:qe-jho-5k9cj-w-c-kc7jv.c.openshift-qe.internal Approved,Issued csr-pf4zq 6m10s system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-phbpw 63m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-r8shh 94m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-r9cdd 3h22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-rff4d 9m34s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-rfprk 10m system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-rzv9j 8m33s system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-s8kpx 7m21s system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-sd9v2 4m33s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-sdjrv 109m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-sghk2 12m system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-svhjx 3m13s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-tb5qr 12m system:node:qe-jho-5k9cj-m-2.c.openshift-qe.internal Approved,Issued csr-v8vrd 3m system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-vthj9 140m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-w9g4h 11m system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued csr-wkrxh 156m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-wrtw2 7m45s system:node:qe-jho-5k9cj-m-0.c.openshift-qe.internal Approved,Issued csr-z6v9b 83s system:node:qe-jho-5k9cj-m-1.c.openshift-qe.internal Approved,Issued
This seems expected behaviour. All the master nodes gets approved and control plane is up which is what was reported in this bug. kubelet renewal client CSRs depends on your Kube controller manager approver setup, e.g via system-bootstrap-node-renewal ClusterRoleBinding. Can we also try to test that things work correctly as the CSR signing CA is rotated?
Tested on BM UPI with 4.2.0-0.nightly-2019-09-18-211009 Machine-approver auto approves kubelet renewal serving CSRs as expected. After bootstrap, there is a 24 hours valid CA ``` oc -n openshift-config-managed get configmap csr-controller-ca -o json | jq -r '.data["ca-bundle.crt"]' | openssl x509 -noout -subject -dates subject=CN = kube-csr-signer_@1568878080 notBefore=Sep 19 07:27:59 2019 GMT notAfter=Sep 20 07:12:55 2019 GMT ``` It is then rotated automatically. ``` oc -n openshift-config-managed get configmap csr-controller-ca -o json | jq -r '.data["ca-bundle.crt"]' | openssl x509 -noout -subject -dates subject=CN = kube-csr-signer_@1568946479 notBefore=Sep 20 02:27:58 2019 GMT notAfter=Oct 20 02:27:59 2019 GMT ``` I0920 06:20:59.713350 1 main.go:139] CSR csr-zpqpc added I0920 06:20:59.713376 1 main.go:142] CSR csr-zpqpc is already approved I0920 06:21:41.110223 1 main.go:139] CSR csr-kd26q added I0920 06:21:41.178172 1 csr_check.go:403] retrieving serving cert from qe-yapei-uos2-6dbch-compute-2 (10.0.151.246:10250) I0920 06:21:41.181390 1 csr_check.go:158] authorizing serving cert renewal for qe-yapei-uos2-6dbch-compute-2 I0920 06:21:41.195188 1 main.go:189] CSR csr-kd26q approved I0920 06:22:28.350969 1 main.go:139] CSR csr-2pzsv added I0920 06:22:28.362088 1 csr_check.go:403] retrieving serving cert from qe-yapei-uos2-6dbch-control-plane-1 (10.0.150.172:10250) I0920 06:22:28.364434 1 csr_check.go:158] authorizing serving cert renewal for qe-yapei-uos2-6dbch-control-plane-1 I0920 06:22:28.393366 1 main.go:189] CSR csr-2pzsv approved I0920 06:23:30.570854 1 main.go:139] CSR csr-w22p7 added I0920 06:23:30.584008 1 csr_check.go:403] retrieving serving cert from qe-yapei-uos2-6dbch-control-plane-0 (10.0.149.59:10250) I0920 06:23:30.586600 1 csr_check.go:158] authorizing serving cert renewal for qe-yapei-uos2-6dbch-control-plane-0 I0920 06:23:30.597992 1 main.go:189] CSR csr-w22p7 approved I0920 06:28:32.864448 1 main.go:139] CSR csr-8ntv5 added oc get nodes NAME STATUS ROLES AGE VERSION qe-yapei-uos2-6dbch-compute-0 Ready worker 25h v1.14.6+147115512 qe-yapei-uos2-6dbch-compute-1 Ready worker 25h v1.14.6+147115512 qe-yapei-uos2-6dbch-compute-2 Ready worker 25h v1.14.6+147115512 qe-yapei-uos2-6dbch-control-plane-0 Ready master 26h v1.14.6+147115512 qe-yapei-uos2-6dbch-control-plane-1 Ready master 26h v1.14.6+147115512 qe-yapei-uos2-6dbch-control-plane-2 Ready master 26h v1.14.6+147115512 The CSR auto-approval works well. I think this bug can be verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922