1991282 – After CNI config, Windows node IP is replaced by hybrid overlay IP on UPI cluster

Bug 1991282 - After CNI config, Windows node IP is replaced by hybrid overlay IP on UPI cluster

Summary: After CNI config, Windows node IP is replaced by hybrid overlay IP on UPI clu...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Windows Containers
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Sebastian Soto
QA Contact:	gaoshang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2000351
TreeView+	depends on / blocked

Reported:	2021-08-08 21:40 UTC by gaoshang
Modified:	2021-10-28 17:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-28 17:41:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift windows-machine-config-operator pull 630	0	None	None	None	2021-08-25 14:15:06 UTC
Red Hat Product Errata	RHBA-2021:3702	0	None	None	None	2021-10-28 17:41:40 UTC

Description gaoshang 2021-08-08 21:40:34 UTC

Description of problem:
On UPI cluster, after running wmcb initialize-kubelet, Windows worker is using node IP, see 10.0.75.176 in [1]. After WMCO configured OVNKubernetesHybridOverlayNetwork, an overlay network IP is added to Windows, see 10.132.0.51 in [3]. Then after running wmcb.exe configure-cni, found that Windows node IP is replaced by hybrid overlay IP, see [2]. This will cause Windows node left in SchedulingDisabled status and keep reconciling on UPI cluster.

[1]
# oc get node -owide
NAME             STATUS                     ROLES    AGE     VERSION                       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
...
sgao-winworker   Ready,SchedulingDisabled   worker   3m14s   v1.21.1-1397+a678cfd2c37e87   10.0.75.176   <none>        Windows Server 2019 Datacenter                                 10.0.17763.2061

[2]
# oc get node -owide
NAME             STATUS                     ROLES    AGE     VERSION                       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
...
sgao-winworker   Ready,SchedulingDisabled   worker   3m26s   v1.21.1-1397+a678cfd2c37e87   10.132.0.51   <none>        Windows Server 2019 Datacenter                                 10.0.17763.2061                docker://20.10.6

[3]
PS C:\Users\Administrator> ipconfig

Windows IP Configuration


Ethernet adapter vEthernet (Ethernet 2):

   Connection-specific DNS Suffix  . : us-east-2.compute.internal
   Link-local IPv6 Address . . . . . : fe80::1932:9ff0:36d3:8b02%15
   IPv4 Address. . . . . . . . . . . : 10.0.75.176
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . : 10.0.64.1

Ethernet adapter vEthernet (VIPEndpoint):

   Connection-specific DNS Suffix  . : us-east-2.compute.internal
   Link-local IPv6 Address . . . . . : fe80::819b:4b41:708:cb05%31
   IPv4 Address. . . . . . . . . . . : 10.132.0.51
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . :

Ethernet adapter vEthernet (nat):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::5cb9:4b8e:63ec:d3c%10
   IPv4 Address. . . . . . . . . . . : 192.168.192.1
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :

Ethernet adapter vEthernet (nat):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::19c2:2df3:8584:173%10
   IPv4 Address. . . . . . . . . . . : 172.19.16.1
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :


Version-Release number of selected component (if applicable):
OCP version: 4.8.0-0.nightly-2021-08-05-031749
WMCO master commit: ccae1dd992a0f34702df23c76f3659f796ec64e0

How reproducible:
Always

Steps to Reproduce:
1. Install UPI cluster on baremetal
2. Create Windows machine manually, change hostname to lowercase, install openssh
3. Add Windows IP to windows-instances configmap
4. Wait and check WMCO bootstrapping Windows machine

Actual results:
Windows node IP is replaced by hybrid overlay IP

Expected results:
Windows node IP should not be replaced by hybrid overlay IP

Additional info:

Comment 3 Sebastian Soto 2021-08-11 18:05:56 UTC

I've created an upstream issue for work that needs to be done in the kubelet: https://github.com/kubernetes/kubernetes/issues/104269

This issue does not affect all UPI clusters, this will only be present in clusters with platform none.
It is very possible to have UPI cluster with a platform such as VSphere. I was able to add a BYOH node to a VSphere UPI cluster with no issue.

Comment 4 Sebastian Soto 2021-08-11 20:08:17 UTC

When the cloud provider is set to none, the kubelet is picking the first DNS entry that meets the node ip criteria.

Heres what an example DNS lookup from a VM looks like:
```
PS C:\Users\Administrator> Resolve-DnsName -Name winhost

Name                                           Type   TTL   Section    IPAddress
----                                           ----   ---   -------    ---------
winhost                                        AAAA   1200  Question   fe80::4d3a:3fc1:320a:6b
winhost                                        AAAA   1200  Question   fe80::51b3:e88:9465:abfd
winhost                                        AAAA   1200  Question   fe80::c825:26be:4a2:308f
winhost                                        A      1200  Question   10.132.0.153
winhost                                        A      1200  Question   172.31.251.232
winhost                                        A      1200  Question   172.29.144.1
```

The IP of the VM is 172.31.251.232, and that is the IP that the kubelet should set nodeIP to.
The IP 10.132.0.153 is the IP given to the hybrid overlay HNS endpoint.

When kubelet goes to pick the IP, it chooses the hybrid overlay HNS endpoint IP as it is the first ipv4 result.
This is happening in the code here: https://github.com/openshift/kubernetes/blob/9b1230e88478e693f3a3a9a19fdecd3ec524788b/pkg/kubelet/nodestatus/setters.go#L224-L236

A possible solution to this is the removal of the hybrid overlay IP from the DNS entry.

Comment 5 Sebastian Soto 2021-08-12 20:56:51 UTC

The removal of the hybrid overlay IP from the DNS entry will make things better, but it doesn't completely solve this issue.
If another network interface is added, or the ordering of DNS entries changes for whatever reason, the node's IP is likely to be changed.

I think that the only way that this can be truly fixed is prescribing the node's IP via the `node-ip` flag.
There's a lot to take into account with this, so whether this should be done right now still needs to be seen.

Comment 6 Christian Hernandez 2021-08-17 19:00:52 UTC

I have also ran into this running on vSphere using `platform: none`

Comment 7 Mansi Kulkarni 2021-09-02 20:21:07 UTC

Marking the bug VERIFIED for the release-4.8 PR to merge, will move back to ON_QA.

Comment 8 gaoshang 2021-09-06 04:24:53 UTC

This bug has been verified on OCP 4.9.0-0.nightly-2021-09-05-204238 and passed, thanks.

On baremetal cluster with `platform: none`, BYOH Windows node bootstrapped with correct IP address.


# oc get nodes -owide -l kubernetes.io/os=windows
NAME        STATUS   ROLES    AGE   VERSION                       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
sgao-win1   Ready    worker   17m   v1.21.1-1398+98073871f173ba   10.0.55.187   <none>        Windows Server 2019 Datacenter   10.0.17763.2061   docker://20.10.6

Comment 13 errata-xmlrpc 2021-10-28 17:41:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Windows Container Support for Red Hat OpenShift 4.0.0 product release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3702

Note You need to log in before you can comment on or make changes to this bug.