https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/198/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws/1382831777525010432/artifacts/e2e-aws/gather-extra/artifacts/ When a client CSR is approved, the kubelet joins the cluster and a node is registered. When this happens, the node link controller updates the machine object with the node name. Sometimes, the CSR approver attempts to approve the kubelet's serving certificate prior to the node link controller completing the machine-to-node link. This causes the CSR approver to ignore the unapproved CSR. We should make sure we requeue this CSR instead of ignoring forever.
Hi @Michael , Can you help with the steps to try to reproduce condition , may be using logs to confirm etc .?
Hi Milind, This will be a tricky one as it doesn't always happen. Look at some recent CI runs, and search the cluster-machine-approver logs like this one (the original test in this BZ): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/198/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws/1382831777525010432/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-6644869bdb-d62gk_machine-approver-controller.log Here, we can see the following log lines: E0415 23:29:07.305130 1 csr_check.go:196] csr-rxkvd: Serving Cert: No target machine for node "ip-10-0-242-145.us-east-2.compute.internal" I0415 23:29:07.305136 1 controller.go:172] csr-rxkvd: CSR not authorized csr-rxkvd (randomly generated name) has the message "No target machine for node". Subsequently, we see "CSR not authorized", and then we never see csr-rxkvd show up again in the logs. If this is behaving correctly, we might see the "No target machine for node" (this is the non-deterministic bit due to race) for a given CSR in a newer run. We should see "CSR not authorized" and then we should see the same csr-xxxx again, eventually it should be approved.
Thanks Michael , I could see them in Azure serial run below are the details moved to VERIFIED based on them Validated on - Validated on - 4.8.0-0.nightly-2021-05-12-072240 for the certificate - csr-z7z44 from the logs :https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-6458fd8964-wnv7q_machine-approver-controller.log . . I0512 07:57:59.359043 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250) I0512 07:57:59.361271 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0512 07:57:59.361300 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw E0512 07:57:59.361310 1 csr_check.go:196] csr-z7z44: Serving Cert: No target machine for node "ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw" I0512 07:57:59.361325 1 controller.go:172] csr-z7z44: CSR not authorized .. . . . I0512 07:59:21.282281 1 controller.go:114] Reconciling CSR: csr-z7z44 I0512 07:59:21.282797 1 csr_check.go:150] csr-z7z44: CSR does not appear to be client csr I0512 07:59:21.282851 1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250) I0512 07:59:21.285221 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I0512 07:59:21.285244 1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw I0512 07:59:21.297589 1 controller.go:179] CSR csr-z7z44 approved. . . .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438