When a client CSR is approved, the kubelet joins the cluster and a node is registered. When this happens, the node link controller updates the machine object with the node name.
Sometimes, the CSR approver attempts to approve the kubelet's serving certificate prior to the node link controller completing the machine-to-node link. This causes the CSR approver to ignore the unapproved CSR.
We should make sure we requeue this CSR instead of ignoring forever.
Hi @Michael , Can you help with the steps to try to reproduce condition , may be using logs to confirm etc .?
This will be a tricky one as it doesn't always happen. Look at some recent CI runs, and search the cluster-machine-approver logs like this one (the original test in this BZ): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/198/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws/1382831777525010432/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-6644869bdb-d62gk_machine-approver-controller.log
Here, we can see the following log lines:
E0415 23:29:07.305130 1 csr_check.go:196] csr-rxkvd: Serving Cert: No target machine for node "ip-10-0-242-145.us-east-2.compute.internal"
I0415 23:29:07.305136 1 controller.go:172] csr-rxkvd: CSR not authorized
csr-rxkvd (randomly generated name) has the message "No target machine for node". Subsequently, we see "CSR not authorized", and then we never see csr-rxkvd show up again in the logs.
If this is behaving correctly, we might see the "No target machine for node" (this is the non-deterministic bit due to race) for a given CSR in a newer run. We should see "CSR not authorized" and then we should see the same csr-xxxx again, eventually it should be approved.
Thanks Michael , I could see them in Azure serial run below are the details moved to VERIFIED based on them
Validated on - Validated on - 4.8.0-0.nightly-2021-05-12-072240
for the certificate - csr-z7z44 from the logs :https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-6458fd8964-wnv7q_machine-approver-controller.log
I0512 07:57:59.359043 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250)
I0512 07:57:59.361271 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0512 07:57:59.361300 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw
E0512 07:57:59.361310 1 csr_check.go:196] csr-z7z44: Serving Cert: No target machine for node "ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw"
I0512 07:57:59.361325 1 controller.go:172] csr-z7z44: CSR not authorized
I0512 07:59:21.282281 1 controller.go:114] Reconciling CSR: csr-z7z44
I0512 07:59:21.282797 1 csr_check.go:150] csr-z7z44: CSR does not appear to be client csr
I0512 07:59:21.282851 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250)
I0512 07:59:21.285221 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0512 07:59:21.285244 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw
I0512 07:59:21.297589 1 controller.go:179] CSR csr-z7z44 approved.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.