Bug 1942120 - windows node fails to approve CSRs when several nodes are being added to vSphere cluster
Summary: windows node fails to approve CSRs when several nodes are being added to vSph...
Keywords:
Status: CLOSED DUPLICATE of bug 1860774
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Windows
unspecified
high
Target Milestone: ---
: ---
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-23 17:16 UTC by milti leonard
Modified: 2024-06-14 00:58 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-22 14:38:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description milti leonard 2021-03-23 17:16:40 UTC
Description of problem:
when adding multiple windows nodes to OCP cluster, successive nodes after the first node added fail to approve CSRs automagickly 

Version-Release number of selected component (if applicable):
windows-machine-config-operator.v2.0.0

How reproducible:
unsure

Steps to Reproduce:
1. create vSphere cluster
2. add multiple WMCO nodes to it
3.

Actual results:
first node automagickly approves CSRs, successive nodes do not

Expected results:
all nodes get added w auto-approved CSRs

Additional info:

Comment 1 milti leonard 2021-03-23 17:19:13 UTC
attached case#02893295, must-gather already attached; requested the WMCO inspection file bundle

Comment 2 milti leonard 2021-03-23 18:10:01 UTC
cu has attached the WMCO namespace inspection: should include logs and configuration YAMLs; available now in supportshell

Comment 3 Aravindh Puthiyaparambil 2021-03-23 20:02:36 UTC
Looking at the WMCO logs, I see that it prepared the Windows nodes successfully. The CSR approval is not done by WMCO but by cluster-machine-approver and I see the following in its logs: 

2021-03-22T15:00:42.148817811Z I0322 15:00:42.148004       1 main.go:147] CSR csr-7wvgl added
2021-03-22T15:00:42.189493822Z I0322 15:00:42.189443       1 csr_check.go:419] retrieving serving cert from esp01-win-8sdb4 (10.10.30.73:10250)
2021-03-22T15:00:42.191126441Z I0322 15:00:42.191089       1 csr_check.go:163] Found existing serving cert for esp01-win-8sdb4
2021-03-22T15:00:42.191275980Z W0322 15:00:42.191242       1 csr_check.go:172] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
2021-03-22T15:00:42.191275980Z W0322 15:00:42.191259       1 csr_check.go:173] Current SAN Values: [esp01-win-8sdb4 10.10.30.73], CSR SAN Values: [esp01-win-8sdb4 10.10.30.73 10.9.2.106]
2021-03-22T15:00:42.191275980Z I0322 15:00:42.191267       1 csr_check.go:183] Falling back to machine-api authorization for esp01-win-8sdb4
2021-03-22T15:00:42.191293984Z I0322 15:00:42.191282       1 main.go:182] CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73
2021-03-22T15:00:42.191293984Z I0322 15:00:42.191288       1 main.go:218] Error syncing csr csr-7wvgl: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73

I am going to assign this to the cloud team to figure out what the message "CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses" implies. Not sure why this is not an issue for the fist Windows Machine in the MachineSet and is only an issue for the subsequent ones.

Comment 4 Michael Gugino 2021-03-23 21:27:40 UTC
This looks like another instance of stale machine object cache in the cluster-machine-approver code.

cluster-machine-approver log message:

2021-03-22T15:00:42.191293984Z I0322 15:00:42.191282       1 main.go:182] CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73



Yet, we have the following on the machine in question:

addresses:
- address: fe80::cdcd:7083:fc20:b944
  type: InternalIP
- address: 10.10.30.73
  type: InternalIP
- address: fe80::4085:5291:d793:4dff
  type: InternalIP
- address: 10.9.2.106
  type: InternalIP
- address: esp01-win-8sdb4
  type: InternalDNS


I've seen one other instances of this in CI: https://bugzilla.redhat.com/show_bug.cgi?id=1940899

In 4.7, we're using a typed client to list machines on each CSR reconcile.  There's no reason for the list of machines to be stale, it appears the API is sending us stale data.

Comment 5 Michael Gugino 2021-03-24 00:54:25 UTC
I reviewed the logs from the must gather.  The IP 10.9.2.106 was added to the host by something other than vSphere, thus the machine-api doesn't know about it.  Someone manually approved the CSR, and then the (kube) cloud provider added that IP to the interfaces on vSphere.

This is a known issue we're tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1860774

Comment 7 milti leonard 2021-03-25 14:37:23 UTC
@mgugino, i shared the cu commonet above, would the addition of a second nic on the node contribute to the issue being seen w unapproved CSRs?

Comment 8 Michael Gugino 2021-03-25 14:40:21 UTC
(In reply to milti leonard from comment #7)
> @mgugino, i shared the cu commonet above, would the addition of a second nic
> on the node contribute to the issue being seen w unapproved CSRs?

Yes.  There's a handful of known-issues around the CSR process right now, unfortunately this is one of them.

Comment 11 milti leonard 2021-03-29 15:00:49 UTC
@mgugino, cu confirms that the CSR on the node w additional NIC is the one where the CSR did not auto-approve. cu also offers to recreate the issue for log capture. pls let me know if you would want that and the deliverables (beyond the must-gather/WMCO inspection) that you want attached.

Comment 12 Michael Gugino 2021-03-29 17:59:33 UTC
(In reply to milti leonard from comment #11)
> @mgugino, cu confirms that the CSR on the node w additional NIC is the one
> where the CSR did not auto-approve. cu also offers to recreate the issue for
> log capture. pls let me know if you would want that and the deliverables
> (beyond the must-gather/WMCO inspection) that you want attached.

This is a known issue we're tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1860774

We don't require any further information at this time.

For now, the work around is to manually approve the serving CSR certificate for any instance that has multiple network interfaces.  This should only be required for the first approval of a particular machine, subsequent renewals should be handled automatically by the CSR approver logic.

Comment 14 Michael Gugino 2021-04-22 14:38:39 UTC

*** This bug has been marked as a duplicate of bug 1860774 ***


Note You need to log in before you can comment on or make changes to this bug.