Bug 1942120

Summary: windows node fails to approve CSRs when several nodes are being added to vSphere cluster
Product: OpenShift Container Platform Reporter: milti leonard <mleonard>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, aravindh, mgugino
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-22 14:38:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description milti leonard 2021-03-23 17:16:40 UTC
Description of problem:
when adding multiple windows nodes to OCP cluster, successive nodes after the first node added fail to approve CSRs automagickly 

Version-Release number of selected component (if applicable):
windows-machine-config-operator.v2.0.0

How reproducible:
unsure

Steps to Reproduce:
1. create vSphere cluster
2. add multiple WMCO nodes to it
3.

Actual results:
first node automagickly approves CSRs, successive nodes do not

Expected results:
all nodes get added w auto-approved CSRs

Additional info:

Comment 1 milti leonard 2021-03-23 17:19:13 UTC
attached case#02893295, must-gather already attached; requested the WMCO inspection file bundle

Comment 2 milti leonard 2021-03-23 18:10:01 UTC
cu has attached the WMCO namespace inspection: should include logs and configuration YAMLs; available now in supportshell

Comment 3 Aravindh Puthiyaparambil 2021-03-23 20:02:36 UTC
Looking at the WMCO logs, I see that it prepared the Windows nodes successfully. The CSR approval is not done by WMCO but by cluster-machine-approver and I see the following in its logs: 

2021-03-22T15:00:42.148817811Z I0322 15:00:42.148004       1 main.go:147] CSR csr-7wvgl added
2021-03-22T15:00:42.189493822Z I0322 15:00:42.189443       1 csr_check.go:419] retrieving serving cert from esp01-win-8sdb4 (10.10.30.73:10250)
2021-03-22T15:00:42.191126441Z I0322 15:00:42.191089       1 csr_check.go:163] Found existing serving cert for esp01-win-8sdb4
2021-03-22T15:00:42.191275980Z W0322 15:00:42.191242       1 csr_check.go:172] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
2021-03-22T15:00:42.191275980Z W0322 15:00:42.191259       1 csr_check.go:173] Current SAN Values: [esp01-win-8sdb4 10.10.30.73], CSR SAN Values: [esp01-win-8sdb4 10.10.30.73 10.9.2.106]
2021-03-22T15:00:42.191275980Z I0322 15:00:42.191267       1 csr_check.go:183] Falling back to machine-api authorization for esp01-win-8sdb4
2021-03-22T15:00:42.191293984Z I0322 15:00:42.191282       1 main.go:182] CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73
2021-03-22T15:00:42.191293984Z I0322 15:00:42.191288       1 main.go:218] Error syncing csr csr-7wvgl: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73

I am going to assign this to the cloud team to figure out what the message "CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses" implies. Not sure why this is not an issue for the fist Windows Machine in the MachineSet and is only an issue for the subsequent ones.

Comment 4 Michael Gugino 2021-03-23 21:27:40 UTC
This looks like another instance of stale machine object cache in the cluster-machine-approver code.

cluster-machine-approver log message:

2021-03-22T15:00:42.191293984Z I0322 15:00:42.191282       1 main.go:182] CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73



Yet, we have the following on the machine in question:

addresses:
- address: fe80::cdcd:7083:fc20:b944
  type: InternalIP
- address: 10.10.30.73
  type: InternalIP
- address: fe80::4085:5291:d793:4dff
  type: InternalIP
- address: 10.9.2.106
  type: InternalIP
- address: esp01-win-8sdb4
  type: InternalDNS


I've seen one other instances of this in CI: https://bugzilla.redhat.com/show_bug.cgi?id=1940899

In 4.7, we're using a typed client to list machines on each CSR reconcile.  There's no reason for the list of machines to be stale, it appears the API is sending us stale data.

Comment 5 Michael Gugino 2021-03-24 00:54:25 UTC
I reviewed the logs from the must gather.  The IP 10.9.2.106 was added to the host by something other than vSphere, thus the machine-api doesn't know about it.  Someone manually approved the CSR, and then the (kube) cloud provider added that IP to the interfaces on vSphere.

This is a known issue we're tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1860774

Comment 7 milti leonard 2021-03-25 14:37:23 UTC
@mgugino, i shared the cu commonet above, would the addition of a second nic on the node contribute to the issue being seen w unapproved CSRs?

Comment 8 Michael Gugino 2021-03-25 14:40:21 UTC
(In reply to milti leonard from comment #7)
> @mgugino, i shared the cu commonet above, would the addition of a second nic
> on the node contribute to the issue being seen w unapproved CSRs?

Yes.  There's a handful of known-issues around the CSR process right now, unfortunately this is one of them.

Comment 11 milti leonard 2021-03-29 15:00:49 UTC
@mgugino, cu confirms that the CSR on the node w additional NIC is the one where the CSR did not auto-approve. cu also offers to recreate the issue for log capture. pls let me know if you would want that and the deliverables (beyond the must-gather/WMCO inspection) that you want attached.

Comment 12 Michael Gugino 2021-03-29 17:59:33 UTC
(In reply to milti leonard from comment #11)
> @mgugino, cu confirms that the CSR on the node w additional NIC is the one
> where the CSR did not auto-approve. cu also offers to recreate the issue for
> log capture. pls let me know if you would want that and the deliverables
> (beyond the must-gather/WMCO inspection) that you want attached.

This is a known issue we're tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1860774

We don't require any further information at this time.

For now, the work around is to manually approve the serving CSR certificate for any instance that has multiple network interfaces.  This should only be required for the first approval of a particular machine, subsequent renewals should be handled automatically by the CSR approver logic.

Comment 14 Michael Gugino 2021-04-22 14:38:39 UTC

*** This bug has been marked as a duplicate of bug 1860774 ***