2106666 – Windows workers in Not Ready state when deploying CCM cluster on vSphere due to missing Internal IP

Bug 2106666 - Windows workers in Not Ready state when deploying CCM cluster on vSphere due to missing Internal IP

Summary: Windows workers in Not Ready state when deploying CCM cluster on vSphere due...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:	2087042
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-13 09:22 UTC by Jose Luis Franco
Modified:	2022-11-22 07:19 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-22 07:19:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather (10.44 MB, application/x-xz) 2022-07-13 09:22 UTC, Jose Luis Franco	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cloud-provider-vsphere pull 29	0	None	open	Bug 2106666: Return error when VM's info returned from VC doesn't contain IP address	2022-11-10 11:30:04 UTC
Red Hat Product Errata	RHBA-2022:8496	0	None	None	None	2022-11-22 07:19:48 UTC

Description Jose Luis Franco 2022-07-13 09:22:24 UTC

Created attachment 1896643 [details]
must-gather

Description of problem:

Windows workers are in not ready state after deploying a 4.10 OCP cluster with CCM enabled in vSphere:

[jfrancoa@localhost 120243]$ oc get nodes -o wide
NAME                                      STATUS                        ROLES    AGE     VERSION                            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
jfrancoa-1307-410ccm-qgh2m-master-0       Ready                         master   3h23m   v1.23.5+3afdacb                    172.31.249.242   172.31.249.242   Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8
jfrancoa-1307-410ccm-qgh2m-master-1       Ready                         master   3h23m   v1.23.5+3afdacb                    172.31.249.210   172.31.249.210   Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8
jfrancoa-1307-410ccm-qgh2m-master-2       Ready                         master   3h23m   v1.23.5+3afdacb                    172.31.249.180   172.31.249.180   Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8
jfrancoa-1307-410ccm-qgh2m-worker-9g7sv   Ready                         worker   3h9m    v1.23.5+3afdacb                    172.31.249.40    172.31.249.40    Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8
jfrancoa-1307-410ccm-qgh2m-worker-z6dht   Ready                         worker   3h9m    v1.23.5+3afdacb                    172.31.249.9     172.31.249.9     Red Hat Enterprise Linux CoreOS 410.84.202206071439-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.23.3-3.rhaos4.10.git5fe1720.el8
winworker-lms4l                           NotReady,SchedulingDisabled   worker   149m    v1.23.5-rc.0.2060+1f952b3ba7b9ff   <none>           <none>           Windows Server 2022 Datacenter                                  10.0.20348.230                 docker://20.10.9
winworker-m2f8x                           NotReady,SchedulingDisabled   worker   165m    v1.23.5-rc.0.2060+1f952b3ba7b9ff   <none>           <none>           Windows Server 2022 Datacenter                                  10.0.20348.230                 docker://20.10.9

As it can be observed, both two nodes are also missing information about their internal IPs.

However,CCM is reporting the node to have an IP:

$ oc logs cluster-cloud-controller-manager-operator-675dcd89f-4wkk9 -n openshift-cloud-controller-manager-operator

I0713 06:05:06.107737       1 instances.go:197] VM=422c381a-4ec3-bbc6-880d-c108d5ea7859 IsActive=true
I0713 06:05:10.397669       1 search.go:69] WhichVCandDCByNodeID by UUID
I0713 06:05:10.397743       1 search.go:76] WhichVCandDCByNodeID nodeID: 422c1516-a5f5-bf05-4876-00fbf9e51542
I0713 06:05:10.397671       1 node_controller.go:391] Initializing node winworker-lms4l with cloud provider
I0713 06:05:10.398084       1 search.go:74] WhichVCandDCByNodeID by Name
I0713 06:05:10.398129       1 search.go:76] WhichVCandDCByNodeID nodeID: winworker-lms4l
I0713 06:05:10.418240       1 search.go:208] Found node 422c1516-a5f5-bf05-4876-00fbf9e51542 as vm=VirtualMachine:vm-889422 in vc=vcenter.sddc-44-236-21-251.vmwarevmc.com and datacenter=SDDC-Datacenter
I0713 06:05:10.418277       1 search.go:210] Hostname: winworker-lms4l, UUID: 422c1516-a5f5-bf05-4876-00fbf9e51542
I0713 06:05:10.418296       1 nodemanager.go:148] Discovered VM using normal UUID format
I0713 06:05:10.419407       1 search.go:208] Found node winworker-lms4l as vm=VirtualMachine:vm-889422 in vc=vcenter.sddc-44-236-21-251.vmwarevmc.com and datacenter=SDDC-Datacenter
I0713 06:05:10.419430       1 search.go:210] Hostname: winworker-lms4l, UUID: 422c1516-a5f5-bf05-4876-00fbf9e51542
I0713 06:05:10.419443       1 nodemanager.go:127] Discovered VM using FQDN or short-hand name
I0713 06:05:10.421638       1 nodemanager.go:252] Adding Hostname: winworker-lms4l
I0713 06:05:10.421717       1 nodemanager.go:408] Adding Internal IP: 172.31.249.231
I0713 06:05:10.421735       1 nodemanager.go:413] Adding External IP: 172.31.249.231
I0713 06:05:10.421753       1 nodemanager.go:319] Found node 422c1516-a5f5-bf05-4876-00fbf9e51542 as vm=VirtualMachine:vm-889422 in vc=vcenter.sddc-44-236-21-251.vmwarevmc.com and datacenter=SDDC-Datacenter

And openshift-machine-api reports the very same IP when moving the machine into Running state:

$ oc logs machine-api-controllers-dbf4bd94d-644v4 -c machine-controller -n openshift-machine-api

I0713 06:05:13.279996       1 controller.go:504] winworker-lms4l: going into phase "Running"
I0713 06:05:13.291357       1 controller.go:175] winworker-lms4l: reconciling Machine
I0713 06:05:13.291382       1 actuator.go:109] winworker-lms4l: actuator checking if machine exists
I0713 06:05:13.301068       1 session.go:137] Find template by instance uuid: ff5b31b0-0a9a-447f-8df6-6a38034f907d
I0713 06:05:13.303685       1 reconciler.go:222] winworker-lms4l: already exists
I0713 06:05:13.303709       1 controller.go:319] winworker-lms4l: reconciling machine triggers idempotent update
I0713 06:05:13.303714       1 actuator.go:123] winworker-lms4l: actuator updating machine
I0713 06:05:13.316852       1 session.go:137] Find template by instance uuid: ff5b31b0-0a9a-447f-8df6-6a38034f907d
I0713 06:05:13.462864       1 reconciler.go:980] winworker-lms4l: Reconciling attached tags
I0713 06:05:15.870684       1 reconciler.go:353] winworker-lms4l: reconciling machine with cloud state
I0713 06:05:16.220780       1 reconciler.go:361] winworker-lms4l: reconciling providerID
I0713 06:05:16.223441       1 reconciler.go:366] winworker-lms4l: reconciling network
I0713 06:05:16.227519       1 reconciler.go:1079] Getting network status: object reference: vm-889422
I0713 06:05:16.227541       1 reconciler.go:1088] Getting network status: device: DVSwitch: 94 3e 2a 59 c6 ff 49 a8-af 6a 92 f0 00 07 e1 4f, macAddress: 00:50:56:ac:32:1b
I0713 06:05:16.227546       1 reconciler.go:1093] Getting network status: getting guest info
I0713 06:05:16.227550       1 reconciler.go:1095] Getting network status: getting guest info: network: {DynamicData:{} Network:qe-segment IpAddress:[fe80::c9ee:69bf:e078:ab3c 172.31.249.231] MacAddress:00:50:56:ac:32:1b Connected:true DeviceConfigId:4000 DnsConfig:0xc000a17e60 IpConfig:0xc00027ac00 NetBIOSConfig:<nil>}
I0713 06:05:16.230009       1 reconciler.go:471] winworker-lms4l: reconciling network: IP addresses: [{InternalIP fe80::c9ee:69bf:e078:ab3c} {InternalIP 172.31.249.231} {InternalDNS winworker-lms4l}]
I0713 06:05:16.230039       1 reconciler.go:371] winworker-lms4l: reconciling powerstate annotation
I0713 06:05:16.231997       1 reconciler.go:832] winworker-lms4l: Updating provider status

But WMCO expects the nodes list (https://github.com/openshift/windows-machine-config-operator/blob/92c7cd53c2b738fbe9c50802f6593fdce6979058/pkg/nodeconfig/nodeconfig.go#L365-L375 ) to contain the Windows node IP, therefore it times out waiting for it:

{"level":"error","ts":1657693161.8716238,"logger":"controller.machine","msg":"Reconciler error","reconciler group":"machine.openshift.io","reconciler kind":"Machine","name":"winworker-lms4l","namespace":"openshift-machine-api","error":"unable to configure instance 422c1516-a5f5-bf05-4876-00fbf9e51542: error getting node object: unable to find node with address 172.31.249.231: timed out waiting for the condition","errorVerbose":"timed out waiting for the condition\nunable to find node with address 172.31.249.231\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).setNode\n\t/remote-source/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:381\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/remote-source/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:255

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.1 Deploy OCP 4.10 cluster with CCM enabled on vSphere applying the following manifest:

cat <<EOF >manifests/manifest_feature_gate.yaml
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  featureSet: TechPreviewNoUpgrade 
EOF

2 Install WMCO and create Windows machines 
3 The windows workers get into NotReady state

The following job can be used to reproduce it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/119693

Or the flexy-templates profile: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_10/ipi-on-vsphere/versioned-installer-vmc7-ovn-winc_ccm-ci


Actual results:

Windows workers get into NotReady state and their internal IP is missing

Expected results:
Windows workers join the cluster and become Ready

Additional info:

The very same test works fine with CCM disabled using the same 4.10 in vSphere

Comment 11 Mike Fedosin 2022-11-09 12:07:18 UTC

Hey, it seems like it works. I created a job to test this bug in our CI: https://github.com/openshift/release/pull/31108 and it passed well: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/31108/rehearse-31108-pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-ccm-install/1556645589649723392

Comment 12 Joel Speed 2022-11-09 12:30:04 UTC

Looking at the original issues, I can see from the nodelink logs that the Machine was first observed at 5:45, and the Node was linked by providerID at 06:19, this is within the bounds of our expectations time wise for a vSphere install.

The CCM found the instance and initialised it by 06:05, this means node link was a bit slow, which maybe implies that Machine API wasn't reporting the correct status for the Machine until 06:19.

The CCM claims to have set the IP addresses, but it also seems to set these a few times, I'm wondering if something may have cleared them by accident of if this is an eventually consistent error.

The WMCO logs are complaining that the Machine is invalid and that it has no internal IP set, which also makes me want to look at Machine API.

According to the Machine controller logs, the VM went into provisioned at 5:45 which can only happen once MAPI has configured the IP addresses on the Machine. Looking further it seems initially at least it is only setting the internal DNS address of the VM.

I can see at 05:49 that the Machine API reports it's adding the correct internal IP addresses.

So where does that leave us, one of the two components is lying about having set the IP addresses as far as I can tell.

@jfrancoa Have you seen this recently? I want to make sure this is still happening on 4.12/4.13 builds, it's possible, that, given Mike's CI job is passing, this has been fixed in newer releases.

Comment 13 Joel Speed 2022-11-10 11:13:41 UTC

Ok, I've managed to reproduce and understand the issue today.

What is happening, is that the machine is created, and is coming online. When it comes online, the CCM initializes it and sets the IP addresses on the host.

You can see in the logs of the WMCO attached, that it is observing the IP address on the Node, 

{"level":"info","ts":1657691294.7190697,"logger":"controller.windowsmachine","msg":"processing","windowsmachine":"openshift-machine-api/winworker-lms4l","address":"172.31.249.231"}
{"level":"info","ts":1657691325.7952998,"logger":"wc 172.31.249.231","msg":"configuring"}

However, shortly after this, the CCM removes the IP addresses from the Node, which then leads to the WMCO reporting issues again.
Eventually the Kubelet stops reporting status and the node goes into an unready state.

I found https://github.com/kubernetes/cloud-provider-vsphere/issues/576 upstream which explains the issue, and was fixed in https://github.com/kubernetes/cloud-provider-vsphere/pull/585

We don't have this fix in our 4.10 branch, but we do have it in 4.11 and 4.12 which explains why CI on the newer versions is passing.

I will look to backport the fix to 4.10 to resolve this bug.

Comment 14 Jose Luis Franco 2022-11-10 11:48:37 UTC

Thanks for the clear explanation, this explains why it was only being observed in 4.10. I was trying to rerun the job in AWS with CCM enabled to remove vSphere out of the equation (I have also observed some issues with the InternalIP missing from the machine, but being there present in the node, although I think this has something to do with the golden image being used), but seeing that you have already narrowed down the issue I will stop the deployment.

Comment 16 sunzhaohua 2022-11-15 07:28:26 UTC

Verfied, tested 3 times for clusterversion 4.10.0-0.ci-2022-11-14-173021, all success.

Comment 19 errata-xmlrpc 2022-11-22 07:19:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.42 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8496

Note You need to log in before you can comment on or make changes to this bug.