Description of problem: Consider the following scenario: 1. A Windows instance was configured as a Windows node by WMCO 2. This node was then deconfigured which resulted in the Node objected being deleted. 3. If this same Windows instance is reconfigured, configuration fails. The reason is that when hybrid-overlay-node runs on the reconfiguration, it detects that the HNS networks it previously created are present and moves on but does not update the new Node object. How reproducible: Always Steps to Reproduce: 1. Described in the description 2. 3. Actual results: Reconfiguration of previously configured Windows instances fail Expected results: Reconfiguration of previously configured Windows instances should succeed. Additional info:
What do you mean by "does not update the new Node object" more specifically? What fields do you expect to be updated, that are not? Also what does deconfiguration do? Does it simply remove Kube API resources for the node, but leaves the Windows instance itself configured? Do you have startup logs of the hybrid-overlay-node process when it does the wrong thing?
(In reply to Dan Williams from comment #1) > What do you mean by "does not update the new Node object" more specifically? > What fields do you expect to be updated, that are not? The annotation k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac > > Also what does deconfiguration do? Does it simply remove Kube API resources > for the node, but leaves the Windows instance itself configured? It uninstalls all the k8s components removes all the logs and binaries. i.e. stops and deletes hybrid-overlay, kubelet, kube-proxy service and so on. So WMCO reverts all the steps it performed to configure the instance into a node. > Do you have startup logs of the hybrid-overlay-node process when it does the > wrong thing? I am attaching the log.
Created attachment 1804313 [details] hybrid-overlay-node log on reconfiguration
This is what the "new" node object looks like which gets created on reconfiguration: ``` # oc describe node winhost Name: winhost Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=windows kubernetes.io/arch=amd64 kubernetes.io/hostname=winhost kubernetes.io/os=windows node-role.kubernetes.io/worker= node.kubernetes.io/windows-build=10.0.19041 node.openshift.io/os_id=Windows Annotations: k8s.ovn.org/hybrid-overlay-node-subnet: 10.132.2.0/24 machine.openshift.io/machine: openshift-machine-api/winbyoh-tqhfq volumes.kubernetes.io/controller-managed-attach-detach: true windowsmachineconfig.openshift.io/byoh: true windowsmachineconfig.openshift.io/pub-key-hash: 7c00ba8122aa764a192fe7d2d9ac4d3627b9c443c09480b18c055c2e178a6019 windowsmachineconfig.openshift.io/username: Administrator CreationTimestamp: Wed, 21 Jul 2021 18:10:56 -0700 Taints: node.kubernetes.io/unschedulable:NoSchedule os=Windows:NoSchedule Unschedulable: true Lease: HolderIdentity: winhost AcquireTime: <unset> RenewTime: Wed, 21 Jul 2021 18:23:43 -0700 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletReady kubelet is posting ready status Addresses: ExternalIP: 172.31.251.162 InternalIP: 172.31.251.162 Hostname: winhost Capacity: cpu: 4 ephemeral-storage: 41428988Ki memory: 16776692Ki pods: 250 Allocatable: cpu: 3500m ephemeral-storage: 37107213454 memory: 15625716Ki pods: 250 System Info: Machine ID: winhost System UUID: DB992C42-0D1F-E12D-7BA2-5439752D8140 Boot ID: Kernel Version: 10.0.19041.508 OS Image: Windows Server Standard Operating System: windows Architecture: amd64 Container Runtime Version: docker://20.10.6 Kubelet Version: v1.21.1-1397+a678cfd2c37e87 Kube-Proxy Version: v1.21.1-1397+a678cfd2c37e87 ProviderID: vsphere://422c99db-1f0d-2de1-7ba2-5439752d8140 Non-terminated Pods: (0 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 0 (0%) 0 (0%) memory 0 (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 92m kubelet Starting kubelet. Warning CheckLimitsForResolvConf 92m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified. Normal NodeHasSufficientMemory 92m (x2 over 92m) kubelet Node winhost status is now: NodeHasSufficientPID Normal NodeReady 92m kubelet Node winhost status is now: NodeReady Normal Starting 88m kubelet Starting kubelet. Normal NodeHasSufficientMemory 88m kubelet Node winhost status is now: NodeHasSufficientMemory Normal Starting 88m kube-proxy Starting kube-proxy. Normal NodeSchedulable 87m kubelet Node winhost status is now: NodeSchedulable Normal NodeNotSchedulable 83m (x2 over 88m) kubelet Node winhost status is now: NodeNotSchedulable Normal Starting 33m kubelet Starting kubelet. Warning CheckLimitsForResolvConf 33m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified. Normal NodeHasSufficientMemory 33m (x2 over 33m) kubelet Node winhost status is now: NodeHasSufficientPID Normal NodeReady 33m kubelet Node winhost status is now: NodeReady Normal NodeHasSufficientMemory 30m kubelet Node winhost status is now: NodeHasSufficientMemory Normal Starting 30m kubelet Starting kubelet. Normal Starting 29m kube-proxy Starting kube-proxy. Normal NodeSchedulable 29m kubelet Node winhost status is now: NodeSchedulable Normal NodeNotSchedulable 26m (x2 over 30m) kubelet Node winhost status is now: NodeNotSchedulable Normal Starting 12m kubelet Starting kubelet. Warning CheckLimitsForResolvConf 12m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified. Normal NodeHasSufficientMemory 12m (x2 over 12m) kubelet Node winhost status is now: NodeHasSufficientPID Normal NodeReady 12m kubelet Node winhost status is now: NodeReady Normal NodeNotSchedulable 12m kubelet Node winhost status is now: NodeNotSchedulable Normal Starting 11m kubelet Starting kubelet. Warning CheckLimitsForResolvConf 11m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified. Normal NodeHasSufficientMemory 11m (x2 over 11m) kubelet Node winhost status is now: NodeHasSufficientPID Normal NodeReady 11m kubelet Node winhost status is now: NodeReady ``` It is unclear to me why GetExistingNetwork() [0] is returning nil which I suspect is the root cause of this issue. [0] https://github.com/openshift/ovn-kubernetes/blob/a10433c51f38c2b73ea4019e69512d427d88542d/go-controller/hybrid-overlay/pkg/controller/node_windows.go#L259
Probably because the node object got deleted, so the node's subnet changed, so GetExistingNetwork() doesn't think it's the same network. I guess the code should just get the network by name, and if it's got the same subnet + GW then re-use it. Otherwise delete + recreate it.
(In reply to Dan Williams from comment #5) > Probably because the node object got deleted, so the node's subnet changed, > so GetExistingNetwork() doesn't think it's the same network. Yes. That must be it. > I guess the code should just get the network by name, and if it's got the > same subnet + GW then re-use it. Otherwise delete + recreate it. Agree. The other orthogonal question we have is should hybrid-overlay-node have a "deconfigure" option which will direct it to delete the networks it created as part of it "configuring" the node?
This landed in the 4.10 development via https://github.com/openshift/ovn-kubernetes/pull/796
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.4 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0811