Hide Forgot
Created attachment 1896643 [details] must-gather Description of problem: Windows workers are in not ready state after deploying a 4.10 OCP cluster with CCM enabled in vSphere: [jfrancoa@localhost 120243]$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME jfrancoa-1307-410ccm-qgh2m-master-0 Ready master 3h23m v1.23.5+3afdacb 172.31.249.242 172.31.249.242 Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8 jfrancoa-1307-410ccm-qgh2m-master-1 Ready master 3h23m v1.23.5+3afdacb 172.31.249.210 172.31.249.210 Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8 jfrancoa-1307-410ccm-qgh2m-master-2 Ready master 3h23m v1.23.5+3afdacb 172.31.249.180 172.31.249.180 Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8 jfrancoa-1307-410ccm-qgh2m-worker-9g7sv Ready worker 3h9m v1.23.5+3afdacb 172.31.249.40 172.31.249.40 Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8 jfrancoa-1307-410ccm-qgh2m-worker-z6dht Ready worker 3h9m v1.23.5+3afdacb 172.31.249.9 172.31.249.9 Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8 winworker-lms4l NotReady,SchedulingDisabled worker 149m v1.23.5-rc.0.2060+1f952b3ba7b9ff <none> <none> Windows Server 2022 Datacenter 10.0.20348.230 docker://20.10.9 winworker-m2f8x NotReady,SchedulingDisabled worker 165m v1.23.5-rc.0.2060+1f952b3ba7b9ff <none> <none> Windows Server 2022 Datacenter 10.0.20348.230 docker://20.10.9 As it can be observed, both two nodes are also missing information about their internal IPs. However,CCM is reporting the node to have an IP: $ oc logs cluster-cloud-controller-manager-operator-675dcd89f-4wkk9 -n openshift-cloud-controller-manager-operator I0713 06:05:06.107737 1 instances.go:197] VM=422c381a-4ec3-bbc6-880d-c108d5ea7859 IsActive=true I0713 06:05:10.397669 1 search.go:69] WhichVCandDCByNodeID by UUID I0713 06:05:10.397743 1 search.go:76] WhichVCandDCByNodeID nodeID: 422c1516-a5f5-bf05-4876-00fbf9e51542 I0713 06:05:10.397671 1 node_controller.go:391] Initializing node winworker-lms4l with cloud provider I0713 06:05:10.398084 1 search.go:74] WhichVCandDCByNodeID by Name I0713 06:05:10.398129 1 search.go:76] WhichVCandDCByNodeID nodeID: winworker-lms4l I0713 06:05:10.418240 1 search.go:208] Found node 422c1516-a5f5-bf05-4876-00fbf9e51542 as vm=VirtualMachine:vm-889422 in vc=vcenter.sddc-44-236-21-251.vmwarevmc.com and datacenter=SDDC-Datacenter I0713 06:05:10.418277 1 search.go:210] Hostname: winworker-lms4l, UUID: 422c1516-a5f5-bf05-4876-00fbf9e51542 I0713 06:05:10.418296 1 nodemanager.go:148] Discovered VM using normal UUID format I0713 06:05:10.419407 1 search.go:208] Found node winworker-lms4l as vm=VirtualMachine:vm-889422 in vc=vcenter.sddc-44-236-21-251.vmwarevmc.com and datacenter=SDDC-Datacenter I0713 06:05:10.419430 1 search.go:210] Hostname: winworker-lms4l, UUID: 422c1516-a5f5-bf05-4876-00fbf9e51542 I0713 06:05:10.419443 1 nodemanager.go:127] Discovered VM using FQDN or short-hand name I0713 06:05:10.421638 1 nodemanager.go:252] Adding Hostname: winworker-lms4l I0713 06:05:10.421717 1 nodemanager.go:408] Adding Internal IP: 172.31.249.231 I0713 06:05:10.421735 1 nodemanager.go:413] Adding External IP: 172.31.249.231 I0713 06:05:10.421753 1 nodemanager.go:319] Found node 422c1516-a5f5-bf05-4876-00fbf9e51542 as vm=VirtualMachine:vm-889422 in vc=vcenter.sddc-44-236-21-251.vmwarevmc.com and datacenter=SDDC-Datacenter And openshift-machine-api reports the very same IP when moving the machine into Running state: $ oc logs machine-api-controllers-dbf4bd94d-644v4 -c machine-controller -n openshift-machine-api I0713 06:05:13.279996 1 controller.go:504] winworker-lms4l: going into phase "Running" I0713 06:05:13.291357 1 controller.go:175] winworker-lms4l: reconciling Machine I0713 06:05:13.291382 1 actuator.go:109] winworker-lms4l: actuator checking if machine exists I0713 06:05:13.301068 1 session.go:137] Find template by instance uuid: ff5b31b0-0a9a-447f-8df6-6a38034f907d I0713 06:05:13.303685 1 reconciler.go:222] winworker-lms4l: already exists I0713 06:05:13.303709 1 controller.go:319] winworker-lms4l: reconciling machine triggers idempotent update I0713 06:05:13.303714 1 actuator.go:123] winworker-lms4l: actuator updating machine I0713 06:05:13.316852 1 session.go:137] Find template by instance uuid: ff5b31b0-0a9a-447f-8df6-6a38034f907d I0713 06:05:13.462864 1 reconciler.go:980] winworker-lms4l: Reconciling attached tags I0713 06:05:15.870684 1 reconciler.go:353] winworker-lms4l: reconciling machine with cloud state I0713 06:05:16.220780 1 reconciler.go:361] winworker-lms4l: reconciling providerID I0713 06:05:16.223441 1 reconciler.go:366] winworker-lms4l: reconciling network I0713 06:05:16.227519 1 reconciler.go:1079] Getting network status: object reference: vm-889422 I0713 06:05:16.227541 1 reconciler.go:1088] Getting network status: device: DVSwitch: 94 3e 2a 59 c6 ff 49 a8-af 6a 92 f0 00 07 e1 4f, macAddress: 00:50:56:ac:32:1b I0713 06:05:16.227546 1 reconciler.go:1093] Getting network status: getting guest info I0713 06:05:16.227550 1 reconciler.go:1095] Getting network status: getting guest info: network: {DynamicData:{} Network:qe-segment IpAddress:[fe80::c9ee:69bf:e078:ab3c 172.31.249.231] MacAddress:00:50:56:ac:32:1b Connected:true DeviceConfigId:4000 DnsConfig:0xc000a17e60 IpConfig:0xc00027ac00 NetBIOSConfig:<nil>} I0713 06:05:16.230009 1 reconciler.go:471] winworker-lms4l: reconciling network: IP addresses: [{InternalIP fe80::c9ee:69bf:e078:ab3c} {InternalIP 172.31.249.231} {InternalDNS winworker-lms4l}] I0713 06:05:16.230039 1 reconciler.go:371] winworker-lms4l: reconciling powerstate annotation I0713 06:05:16.231997 1 reconciler.go:832] winworker-lms4l: Updating provider status But WMCO expects the nodes list (https://github.com/openshift/windows-machine-config-operator/blob/92c7cd53c2b738fbe9c50802f6593fdce6979058/pkg/nodeconfig/nodeconfig.go#L365-L375 ) to contain the Windows node IP, therefore it times out waiting for it: {"level":"error","ts":1657693161.8716238,"logger":"controller.machine","msg":"Reconciler error","reconciler group":"machine.openshift.io","reconciler kind":"Machine","name":"winworker-lms4l","namespace":"openshift-machine-api","error":"unable to configure instance 422c1516-a5f5-bf05-4876-00fbf9e51542: error getting node object: unable to find node with address 172.31.249.231: timed out waiting for the condition","errorVerbose":"timed out waiting for the condition\nunable to find node with address 172.31.249.231\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).setNode\n\t/remote-source/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:381\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/remote-source/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:255 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.1 Deploy OCP 4.10 cluster with CCM enabled on vSphere applying the following manifest: cat <<EOF >manifests/manifest_feature_gate.yaml apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" name: cluster spec: featureSet: TechPreviewNoUpgrade EOF 2 Install WMCO and create Windows machines 3 The windows workers get into NotReady state The following job can be used to reproduce it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/119693 Or the flexy-templates profile: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_10/ipi-on-vsphere/versioned-installer-vmc7-ovn-winc_ccm-ci Actual results: Windows workers get into NotReady state and their internal IP is missing Expected results: Windows workers join the cluster and become Ready Additional info: The very same test works fine with CCM disabled using the same 4.10 in vSphere
Hey, it seems like it works. I created a job to test this bug in our CI: https://github.com/openshift/release/pull/31108 and it passed well: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/31108/rehearse-31108-pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-ccm-install/1556645589649723392
Looking at the original issues, I can see from the nodelink logs that the Machine was first observed at 5:45, and the Node was linked by providerID at 06:19, this is within the bounds of our expectations time wise for a vSphere install. The CCM found the instance and initialised it by 06:05, this means node link was a bit slow, which maybe implies that Machine API wasn't reporting the correct status for the Machine until 06:19. The CCM claims to have set the IP addresses, but it also seems to set these a few times, I'm wondering if something may have cleared them by accident of if this is an eventually consistent error. The WMCO logs are complaining that the Machine is invalid and that it has no internal IP set, which also makes me want to look at Machine API. According to the Machine controller logs, the VM went into provisioned at 5:45 which can only happen once MAPI has configured the IP addresses on the Machine. Looking further it seems initially at least it is only setting the internal DNS address of the VM. I can see at 05:49 that the Machine API reports it's adding the correct internal IP addresses. So where does that leave us, one of the two components is lying about having set the IP addresses as far as I can tell. @jfrancoa Have you seen this recently? I want to make sure this is still happening on 4.12/4.13 builds, it's possible, that, given Mike's CI job is passing, this has been fixed in newer releases.
Ok, I've managed to reproduce and understand the issue today. What is happening, is that the machine is created, and is coming online. When it comes online, the CCM initializes it and sets the IP addresses on the host. You can see in the logs of the WMCO attached, that it is observing the IP address on the Node, {"level":"info","ts":1657691294.7190697,"logger":"controller.windowsmachine","msg":"processing","windowsmachine":"openshift-machine-api/winworker-lms4l","address":"172.31.249.231"} {"level":"info","ts":1657691325.7952998,"logger":"wc 172.31.249.231","msg":"configuring"} However, shortly after this, the CCM removes the IP addresses from the Node, which then leads to the WMCO reporting issues again. Eventually the Kubelet stops reporting status and the node goes into an unready state. I found https://github.com/kubernetes/cloud-provider-vsphere/issues/576 upstream which explains the issue, and was fixed in https://github.com/kubernetes/cloud-provider-vsphere/pull/585 We don't have this fix in our 4.10 branch, but we do have it in 4.11 and 4.12 which explains why CI on the newer versions is passing. I will look to backport the fix to 4.10 to resolve this bug.
Thanks for the clear explanation, this explains why it was only being observed in 4.10. I was trying to rerun the job in AWS with CCM enabled to remove vSphere out of the equation (I have also observed some issues with the InternalIP missing from the machine, but being there present in the node, although I think this has something to do with the golden image being used), but seeing that you have already narrowed down the issue I will stop the deployment.
Verfied, tested 3 times for clusterversion 4.10.0-0.ci-2022-11-14-173021, all success.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.42 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8496