Bug 2106666
Summary: | Windows workers in Not Ready state when deploying CCM cluster on vSphere due to missing Internal IP | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jose Luis Franco <jfrancoa> | ||||
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | ||||
Cloud Compute sub component: | Cloud Controller Manager | QA Contact: | sunzhaohua <zhsun> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | medium | CC: | jspeed, wsun | ||||
Version: | 4.10 | Keywords: | TestBlocker | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.10.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-11-22 07:19:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 2087042 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Jose Luis Franco
2022-07-13 09:22:24 UTC
Hey, it seems like it works. I created a job to test this bug in our CI: https://github.com/openshift/release/pull/31108 and it passed well: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/31108/rehearse-31108-pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-ccm-install/1556645589649723392 Looking at the original issues, I can see from the nodelink logs that the Machine was first observed at 5:45, and the Node was linked by providerID at 06:19, this is within the bounds of our expectations time wise for a vSphere install. The CCM found the instance and initialised it by 06:05, this means node link was a bit slow, which maybe implies that Machine API wasn't reporting the correct status for the Machine until 06:19. The CCM claims to have set the IP addresses, but it also seems to set these a few times, I'm wondering if something may have cleared them by accident of if this is an eventually consistent error. The WMCO logs are complaining that the Machine is invalid and that it has no internal IP set, which also makes me want to look at Machine API. According to the Machine controller logs, the VM went into provisioned at 5:45 which can only happen once MAPI has configured the IP addresses on the Machine. Looking further it seems initially at least it is only setting the internal DNS address of the VM. I can see at 05:49 that the Machine API reports it's adding the correct internal IP addresses. So where does that leave us, one of the two components is lying about having set the IP addresses as far as I can tell. @jfrancoa Have you seen this recently? I want to make sure this is still happening on 4.12/4.13 builds, it's possible, that, given Mike's CI job is passing, this has been fixed in newer releases. Ok, I've managed to reproduce and understand the issue today. What is happening, is that the machine is created, and is coming online. When it comes online, the CCM initializes it and sets the IP addresses on the host. You can see in the logs of the WMCO attached, that it is observing the IP address on the Node, {"level":"info","ts":1657691294.7190697,"logger":"controller.windowsmachine","msg":"processing","windowsmachine":"openshift-machine-api/winworker-lms4l","address":"172.31.249.231"} {"level":"info","ts":1657691325.7952998,"logger":"wc 172.31.249.231","msg":"configuring"} However, shortly after this, the CCM removes the IP addresses from the Node, which then leads to the WMCO reporting issues again. Eventually the Kubelet stops reporting status and the node goes into an unready state. I found https://github.com/kubernetes/cloud-provider-vsphere/issues/576 upstream which explains the issue, and was fixed in https://github.com/kubernetes/cloud-provider-vsphere/pull/585 We don't have this fix in our 4.10 branch, but we do have it in 4.11 and 4.12 which explains why CI on the newer versions is passing. I will look to backport the fix to 4.10 to resolve this bug. Thanks for the clear explanation, this explains why it was only being observed in 4.10. I was trying to rerun the job in AWS with CCM enabled to remove vSphere out of the equation (I have also observed some issues with the InternalIP missing from the machine, but being there present in the node, although I think this has something to do with the golden image being used), but seeing that you have already narrowed down the issue I will stop the deployment. Verfied, tested 3 times for clusterversion 4.10.0-0.ci-2022-11-14-173021, all success. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.42 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8496 |