Bug 1952282 - CSR approver races with nodelink controller and does not requeue
Summary: CSR approver races with nodelink controller and does not requeue
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Michael Gugino
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks: 1949288
TreeView+ depends on / blocked
 
Reported: 2021-04-21 23:55 UTC by Michael Gugino
Modified: 2023-01-20 07:30 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:02:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-machine-approver pull 118 0 None closed Bug 1952282: Ensure serving certs requeued 2021-05-10 12:23:02 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:02:54 UTC

Description Michael Gugino 2021-04-21 23:55:54 UTC
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/198/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws/1382831777525010432/artifacts/e2e-aws/gather-extra/artifacts/

When a client CSR is approved, the kubelet joins the cluster and a node is registered.  When this happens, the node link controller updates the machine object with the node name.

Sometimes, the CSR approver attempts to approve the kubelet's serving certificate prior to the node link controller completing the machine-to-node link.  This causes the CSR approver to ignore the unapproved CSR.

We should make sure we requeue this CSR instead of ignoring forever.

Comment 2 Milind Yadav 2021-05-07 10:25:02 UTC
Hi @Michael , Can you help with the steps to try to reproduce condition , may be using logs to confirm etc .?

Comment 3 Michael Gugino 2021-05-10 12:27:55 UTC
Hi Milind,

This will be a tricky one as it doesn't always happen.  Look at some recent CI runs, and search the cluster-machine-approver logs like this one (the original test in this BZ): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/198/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws/1382831777525010432/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-6644869bdb-d62gk_machine-approver-controller.log

Here, we can see the following log lines:

E0415 23:29:07.305130       1 csr_check.go:196] csr-rxkvd: Serving Cert: No target machine for node "ip-10-0-242-145.us-east-2.compute.internal"
I0415 23:29:07.305136       1 controller.go:172] csr-rxkvd: CSR not authorized

csr-rxkvd (randomly generated name) has the message "No target machine for node".  Subsequently, we see "CSR not authorized", and then we never see csr-rxkvd show up again in the logs.

If this is behaving correctly, we might see the "No target machine for node" (this is the non-deterministic bit due to race) for a given CSR in a newer run.  We should see "CSR not authorized" and then we should see the same csr-xxxx again, eventually it should be approved.

Comment 4 Milind Yadav 2021-05-12 12:56:12 UTC
Thanks Michael , I could see them in Azure serial run below are the details moved to VERIFIED based on them 

Validated on - Validated on - 4.8.0-0.nightly-2021-05-12-072240


for the certificate - csr-z7z44 from the logs :https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1392380279766650880/artifacts/e2e-azure-serial/pods/openshift-cluster-machine-approver_machine-approver-6458fd8964-wnv7q_machine-approver-controller.log
.
.
I0512 07:57:59.359043       1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250)
I0512 07:57:59.361271       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0512 07:57:59.361300       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw
E0512 07:57:59.361310       1 csr_check.go:196] csr-z7z44: Serving Cert: No target machine for node "ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw"
I0512 07:57:59.361325       1 controller.go:172] csr-z7z44: CSR not authorized
..
.
.
.
I0512 07:59:21.282281       1 controller.go:114] Reconciling CSR: csr-z7z44
I0512 07:59:21.282797       1 csr_check.go:150] csr-z7z44: CSR does not appear to be client csr
I0512 07:59:21.282851       1 csr_check.go:442] retrieving serving cert from ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw (10.0.32.4:10250)
I0512 07:59:21.285221       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I0512 07:59:21.285244       1 csr_check.go:191] Falling back to machine-api authorization for ci-op-khktrhkg-ce9aa-bvlxm-worker-centralus1-g59bw
I0512 07:59:21.297589       1 controller.go:179] CSR csr-z7z44 approved.
.
.
.

Comment 7 errata-xmlrpc 2021-07-27 23:02:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.