Description of problem: when adding multiple windows nodes to OCP cluster, successive nodes after the first node added fail to approve CSRs automagickly Version-Release number of selected component (if applicable): windows-machine-config-operator.v2.0.0 How reproducible: unsure Steps to Reproduce: 1. create vSphere cluster 2. add multiple WMCO nodes to it 3. Actual results: first node automagickly approves CSRs, successive nodes do not Expected results: all nodes get added w auto-approved CSRs Additional info:
attached case#02893295, must-gather already attached; requested the WMCO inspection file bundle
cu has attached the WMCO namespace inspection: should include logs and configuration YAMLs; available now in supportshell
Looking at the WMCO logs, I see that it prepared the Windows nodes successfully. The CSR approval is not done by WMCO but by cluster-machine-approver and I see the following in its logs: 2021-03-22T15:00:42.148817811Z I0322 15:00:42.148004 1 main.go:147] CSR csr-7wvgl added 2021-03-22T15:00:42.189493822Z I0322 15:00:42.189443 1 csr_check.go:419] retrieving serving cert from esp01-win-8sdb4 (10.10.30.73:10250) 2021-03-22T15:00:42.191126441Z I0322 15:00:42.191089 1 csr_check.go:163] Found existing serving cert for esp01-win-8sdb4 2021-03-22T15:00:42.191275980Z W0322 15:00:42.191242 1 csr_check.go:172] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate 2021-03-22T15:00:42.191275980Z W0322 15:00:42.191259 1 csr_check.go:173] Current SAN Values: [esp01-win-8sdb4 10.10.30.73], CSR SAN Values: [esp01-win-8sdb4 10.10.30.73 10.9.2.106] 2021-03-22T15:00:42.191275980Z I0322 15:00:42.191267 1 csr_check.go:183] Falling back to machine-api authorization for esp01-win-8sdb4 2021-03-22T15:00:42.191293984Z I0322 15:00:42.191282 1 main.go:182] CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73 2021-03-22T15:00:42.191293984Z I0322 15:00:42.191288 1 main.go:218] Error syncing csr csr-7wvgl: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73 I am going to assign this to the cloud team to figure out what the message "CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses" implies. Not sure why this is not an issue for the fist Windows Machine in the MachineSet and is only an issue for the subsequent ones.
This looks like another instance of stale machine object cache in the cluster-machine-approver code. cluster-machine-approver log message: 2021-03-22T15:00:42.191293984Z I0322 15:00:42.191282 1 main.go:182] CSR csr-7wvgl not authorized: IP address '10.9.2.106' not in machine addresses: fe80::cdcd:7083:fc20:b944 10.10.30.73 Yet, we have the following on the machine in question: addresses: - address: fe80::cdcd:7083:fc20:b944 type: InternalIP - address: 10.10.30.73 type: InternalIP - address: fe80::4085:5291:d793:4dff type: InternalIP - address: 10.9.2.106 type: InternalIP - address: esp01-win-8sdb4 type: InternalDNS I've seen one other instances of this in CI: https://bugzilla.redhat.com/show_bug.cgi?id=1940899 In 4.7, we're using a typed client to list machines on each CSR reconcile. There's no reason for the list of machines to be stale, it appears the API is sending us stale data.
I reviewed the logs from the must gather. The IP 10.9.2.106 was added to the host by something other than vSphere, thus the machine-api doesn't know about it. Someone manually approved the CSR, and then the (kube) cloud provider added that IP to the interfaces on vSphere. This is a known issue we're tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1860774
@mgugino, i shared the cu commonet above, would the addition of a second nic on the node contribute to the issue being seen w unapproved CSRs?
(In reply to milti leonard from comment #7) > @mgugino, i shared the cu commonet above, would the addition of a second nic > on the node contribute to the issue being seen w unapproved CSRs? Yes. There's a handful of known-issues around the CSR process right now, unfortunately this is one of them.
@mgugino, cu confirms that the CSR on the node w additional NIC is the one where the CSR did not auto-approve. cu also offers to recreate the issue for log capture. pls let me know if you would want that and the deliverables (beyond the must-gather/WMCO inspection) that you want attached.
(In reply to milti leonard from comment #11) > @mgugino, cu confirms that the CSR on the node w additional NIC is the one > where the CSR did not auto-approve. cu also offers to recreate the issue for > log capture. pls let me know if you would want that and the deliverables > (beyond the must-gather/WMCO inspection) that you want attached. This is a known issue we're tracking here: https://bugzilla.redhat.com/show_bug.cgi?id=1860774 We don't require any further information at this time. For now, the work around is to manually approve the serving CSR certificate for any instance that has multiple network interfaces. This should only be required for the first approval of a particular machine, subsequent renewals should be handled automatically by the CSR approver logic.
*** This bug has been marked as a duplicate of bug 1860774 ***