Description of problem: On UPI cluster, after running wmcb initialize-kubelet, Windows worker is using node IP, see 10.0.75.176 in [1]. After WMCO configured OVNKubernetesHybridOverlayNetwork, an overlay network IP is added to Windows, see 10.132.0.51 in [3]. Then after running wmcb.exe configure-cni, found that Windows node IP is replaced by hybrid overlay IP, see [2]. This will cause Windows node left in SchedulingDisabled status and keep reconciling on UPI cluster. [1] # oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ... sgao-winworker Ready,SchedulingDisabled worker 3m14s v1.21.1-1397+a678cfd2c37e87 10.0.75.176 <none> Windows Server 2019 Datacenter 10.0.17763.2061 [2] # oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ... sgao-winworker Ready,SchedulingDisabled worker 3m26s v1.21.1-1397+a678cfd2c37e87 10.132.0.51 <none> Windows Server 2019 Datacenter 10.0.17763.2061 docker://20.10.6 [3] PS C:\Users\Administrator> ipconfig Windows IP Configuration Ethernet adapter vEthernet (Ethernet 2): Connection-specific DNS Suffix . : us-east-2.compute.internal Link-local IPv6 Address . . . . . : fe80::1932:9ff0:36d3:8b02%15 IPv4 Address. . . . . . . . . . . : 10.0.75.176 Subnet Mask . . . . . . . . . . . : 255.255.240.0 Default Gateway . . . . . . . . . : 10.0.64.1 Ethernet adapter vEthernet (VIPEndpoint): Connection-specific DNS Suffix . : us-east-2.compute.internal Link-local IPv6 Address . . . . . : fe80::819b:4b41:708:cb05%31 IPv4 Address. . . . . . . . . . . : 10.132.0.51 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : Ethernet adapter vEthernet (nat): Connection-specific DNS Suffix . : Link-local IPv6 Address . . . . . : fe80::5cb9:4b8e:63ec:d3c%10 IPv4 Address. . . . . . . . . . . : 192.168.192.1 Subnet Mask . . . . . . . . . . . : 255.255.240.0 Default Gateway . . . . . . . . . : Ethernet adapter vEthernet (nat): Connection-specific DNS Suffix . : Link-local IPv6 Address . . . . . : fe80::19c2:2df3:8584:173%10 IPv4 Address. . . . . . . . . . . : 172.19.16.1 Subnet Mask . . . . . . . . . . . : 255.255.240.0 Default Gateway . . . . . . . . . : Version-Release number of selected component (if applicable): OCP version: 4.8.0-0.nightly-2021-08-05-031749 WMCO master commit: ccae1dd992a0f34702df23c76f3659f796ec64e0 How reproducible: Always Steps to Reproduce: 1. Install UPI cluster on baremetal 2. Create Windows machine manually, change hostname to lowercase, install openssh 3. Add Windows IP to windows-instances configmap 4. Wait and check WMCO bootstrapping Windows machine Actual results: Windows node IP is replaced by hybrid overlay IP Expected results: Windows node IP should not be replaced by hybrid overlay IP Additional info:
I've created an upstream issue for work that needs to be done in the kubelet: https://github.com/kubernetes/kubernetes/issues/104269 This issue does not affect all UPI clusters, this will only be present in clusters with platform none. It is very possible to have UPI cluster with a platform such as VSphere. I was able to add a BYOH node to a VSphere UPI cluster with no issue.
When the cloud provider is set to none, the kubelet is picking the first DNS entry that meets the node ip criteria. Heres what an example DNS lookup from a VM looks like: ``` PS C:\Users\Administrator> Resolve-DnsName -Name winhost Name Type TTL Section IPAddress ---- ---- --- ------- --------- winhost AAAA 1200 Question fe80::4d3a:3fc1:320a:6b winhost AAAA 1200 Question fe80::51b3:e88:9465:abfd winhost AAAA 1200 Question fe80::c825:26be:4a2:308f winhost A 1200 Question 10.132.0.153 winhost A 1200 Question 172.31.251.232 winhost A 1200 Question 172.29.144.1 ``` The IP of the VM is 172.31.251.232, and that is the IP that the kubelet should set nodeIP to. The IP 10.132.0.153 is the IP given to the hybrid overlay HNS endpoint. When kubelet goes to pick the IP, it chooses the hybrid overlay HNS endpoint IP as it is the first ipv4 result. This is happening in the code here: https://github.com/openshift/kubernetes/blob/9b1230e88478e693f3a3a9a19fdecd3ec524788b/pkg/kubelet/nodestatus/setters.go#L224-L236 A possible solution to this is the removal of the hybrid overlay IP from the DNS entry.
The removal of the hybrid overlay IP from the DNS entry will make things better, but it doesn't completely solve this issue. If another network interface is added, or the ordering of DNS entries changes for whatever reason, the node's IP is likely to be changed. I think that the only way that this can be truly fixed is prescribing the node's IP via the `node-ip` flag. There's a lot to take into account with this, so whether this should be done right now still needs to be seen.
I have also ran into this running on vSphere using `platform: none`
Marking the bug VERIFIED for the release-4.8 PR to merge, will move back to ON_QA.
This bug has been verified on OCP 4.9.0-0.nightly-2021-09-05-204238 and passed, thanks. On baremetal cluster with `platform: none`, BYOH Windows node bootstrapped with correct IP address. # oc get nodes -owide -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME sgao-win1 Ready worker 17m v1.21.1-1398+98073871f173ba 10.0.55.187 <none> Windows Server 2019 Datacenter 10.0.17763.2061 docker://20.10.6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Windows Container Support for Red Hat OpenShift 4.0.0 product release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3702