Bug 1984201 - hybrid-overlay-node does not update Node object if HNS networks are present
Summary: hybrid-overlay-node does not update Node object if HNS networks are present
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: x86_64
OS: Windows
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Jacob Tanenbaum
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-20 23:17 UTC by Aravindh Puthiyaparambil
Modified: 2022-03-16 11:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 11:12:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
hybrid-overlay-node log on reconfiguration (38.32 KB, text/plain)
2021-07-22 01:19 UTC, Aravindh Puthiyaparambil
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ovn-org ovn-kubernetes pull 2374 0 None open Bug 1984201 ensure that named networks are valid for windows 2021-07-29 16:00:50 UTC
Red Hat Product Errata RHBA-2022:0811 0 None None None 2022-03-16 11:12:32 UTC

Description Aravindh Puthiyaparambil 2021-07-20 23:17:45 UTC
Description of problem:

Consider the following scenario:

1. A Windows instance was configured as a Windows node by WMCO
2. This node was then deconfigured which resulted in the Node objected being deleted.
3. If this same Windows instance is reconfigured, configuration fails.

The reason is that when hybrid-overlay-node runs on the reconfiguration, it detects that the HNS networks it previously created are present and moves on but does not update the new Node object.

How reproducible: Always

Steps to Reproduce:
1. Described in the description
2.
3.

Actual results: Reconfiguration of previously configured Windows instances fail


Expected results: Reconfiguration of previously configured Windows instances should succeed.


Additional info:

Comment 1 Dan Williams 2021-07-21 13:30:58 UTC
What do you mean by "does not update the new Node object" more specifically? What fields do you expect to be updated, that are not?

Also what does deconfiguration do? Does it simply remove Kube API resources for the node, but leaves the Windows instance itself configured?

Do you have startup logs of the hybrid-overlay-node process when it does the wrong thing?

Comment 2 Aravindh Puthiyaparambil 2021-07-22 01:18:50 UTC
(In reply to Dan Williams from comment #1)
> What do you mean by "does not update the new Node object" more specifically?
> What fields do you expect to be updated, that are not?

The annotation k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac

> 
> Also what does deconfiguration do? Does it simply remove Kube API resources
> for the node, but leaves the Windows instance itself configured?

It uninstalls all the k8s components removes all the logs and binaries. i.e. stops and deletes hybrid-overlay, kubelet, kube-proxy service and so on. So WMCO reverts all the steps it performed to configure the instance into a node.

> Do you have startup logs of the hybrid-overlay-node process when it does the
> wrong thing?

I am attaching the log.

Comment 3 Aravindh Puthiyaparambil 2021-07-22 01:19:55 UTC
Created attachment 1804313 [details]
hybrid-overlay-node log on reconfiguration

Comment 4 Aravindh Puthiyaparambil 2021-07-22 01:27:27 UTC
This is what the "new" node object looks like which gets created on reconfiguration:

```
# oc describe node winhost 
Name:               winhost
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=windows
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=winhost
                    kubernetes.io/os=windows
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/windows-build=10.0.19041
                    node.openshift.io/os_id=Windows
Annotations:        k8s.ovn.org/hybrid-overlay-node-subnet: 10.132.2.0/24
                    machine.openshift.io/machine: openshift-machine-api/winbyoh-tqhfq
                    volumes.kubernetes.io/controller-managed-attach-detach: true
                    windowsmachineconfig.openshift.io/byoh: true
                    windowsmachineconfig.openshift.io/pub-key-hash: 7c00ba8122aa764a192fe7d2d9ac4d3627b9c443c09480b18c055c2e178a6019
                    windowsmachineconfig.openshift.io/username: Administrator
CreationTimestamp:  Wed, 21 Jul 2021 18:10:56 -0700
Taints:             node.kubernetes.io/unschedulable:NoSchedule
                    os=Windows:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  winhost
  AcquireTime:     <unset>
  RenewTime:       Wed, 21 Jul 2021 18:23:43 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 21 Jul 2021 18:22:03 -0700   Wed, 21 Jul 2021 18:12:02 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 21 Jul 2021 18:22:03 -0700   Wed, 21 Jul 2021 18:12:02 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 21 Jul 2021 18:22:03 -0700   Wed, 21 Jul 2021 18:12:02 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 21 Jul 2021 18:22:03 -0700   Wed, 21 Jul 2021 18:12:02 -0700   KubeletReady                 kubelet is posting ready status
Addresses:
  ExternalIP:  172.31.251.162
  InternalIP:  172.31.251.162
  Hostname:    winhost
Capacity:
  cpu:                4
  ephemeral-storage:  41428988Ki
  memory:             16776692Ki
  pods:               250
Allocatable:
  cpu:                3500m
  ephemeral-storage:  37107213454
  memory:             15625716Ki
  pods:               250
System Info:
  Machine ID:                 winhost
  System UUID:                DB992C42-0D1F-E12D-7BA2-5439752D8140
  Boot ID:                    
  Kernel Version:             10.0.19041.508
  OS Image:                   Windows Server Standard
  Operating System:           windows
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.6
  Kubelet Version:            v1.21.1-1397+a678cfd2c37e87
  Kube-Proxy Version:         v1.21.1-1397+a678cfd2c37e87
ProviderID:                   vsphere://422c99db-1f0d-2de1-7ba2-5439752d8140
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
Events:
  Type     Reason                    Age                From        Message
  ----     ------                    ----               ----        -------
  Normal   Starting                  92m                kubelet     Starting kubelet.
  Warning  CheckLimitsForResolvConf  92m                kubelet     open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeHasSufficientMemory   92m (x2 over 92m)  kubelet     Node winhost status is now: NodeHasSufficientPID
  Normal   NodeReady                 92m                kubelet     Node winhost status is now: NodeReady
  Normal   Starting                  88m                kubelet     Starting kubelet.
  Normal   NodeHasSufficientMemory   88m                kubelet     Node winhost status is now: NodeHasSufficientMemory
  Normal   Starting                  88m                kube-proxy  Starting kube-proxy.
  Normal   NodeSchedulable           87m                kubelet     Node winhost status is now: NodeSchedulable
  Normal   NodeNotSchedulable        83m (x2 over 88m)  kubelet     Node winhost status is now: NodeNotSchedulable
  Normal   Starting                  33m                kubelet     Starting kubelet.
  Warning  CheckLimitsForResolvConf  33m                kubelet     open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeHasSufficientMemory   33m (x2 over 33m)  kubelet     Node winhost status is now: NodeHasSufficientPID
  Normal   NodeReady                 33m                kubelet     Node winhost status is now: NodeReady
  Normal   NodeHasSufficientMemory   30m                kubelet     Node winhost status is now: NodeHasSufficientMemory
  Normal   Starting                  30m                kubelet     Starting kubelet.
  Normal   Starting                  29m                kube-proxy  Starting kube-proxy.
  Normal   NodeSchedulable           29m                kubelet     Node winhost status is now: NodeSchedulable
  Normal   NodeNotSchedulable        26m (x2 over 30m)  kubelet     Node winhost status is now: NodeNotSchedulable
  Normal   Starting                  12m                kubelet     Starting kubelet.
  Warning  CheckLimitsForResolvConf  12m                kubelet     open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeHasSufficientMemory   12m (x2 over 12m)  kubelet     Node winhost status is now: NodeHasSufficientPID
  Normal   NodeReady                 12m                kubelet     Node winhost status is now: NodeReady
  Normal   NodeNotSchedulable        12m                kubelet     Node winhost status is now: NodeNotSchedulable
  Normal   Starting                  11m                kubelet     Starting kubelet.
  Warning  CheckLimitsForResolvConf  11m                kubelet     open c:\k\etc\resolv.conf: The system cannot find the file specified.
  Normal   NodeHasSufficientMemory   11m (x2 over 11m)  kubelet     Node winhost status is now: NodeHasSufficientPID
  Normal   NodeReady                 11m                kubelet     Node winhost status is now: NodeReady
```

It is unclear to me why GetExistingNetwork() [0] is returning nil which I suspect is the root cause of this issue.

[0] https://github.com/openshift/ovn-kubernetes/blob/a10433c51f38c2b73ea4019e69512d427d88542d/go-controller/hybrid-overlay/pkg/controller/node_windows.go#L259

Comment 5 Dan Williams 2021-07-22 14:24:42 UTC
Probably because the node object got deleted, so the node's subnet changed, so GetExistingNetwork() doesn't think it's the same network.

I guess the code should just get the network by name, and if it's got the same subnet + GW then re-use it. Otherwise delete + recreate it.

Comment 6 Aravindh Puthiyaparambil 2021-07-22 15:59:44 UTC
(In reply to Dan Williams from comment #5)
> Probably because the node object got deleted, so the node's subnet changed,
> so GetExistingNetwork() doesn't think it's the same network.

Yes. That must be it.

> I guess the code should just get the network by name, and if it's got the
> same subnet + GW then re-use it. Otherwise delete + recreate it.

Agree.

The other orthogonal question we have is should hybrid-overlay-node have a "deconfigure" option which will direct it to delete the networks it created as part of it "configuring" the node?

Comment 9 Dan Williams 2022-02-28 15:42:39 UTC
This landed in the 4.10 development via https://github.com/openshift/ovn-kubernetes/pull/796

Comment 14 errata-xmlrpc 2022-03-16 11:12:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0811


Note You need to log in before you can comment on or make changes to this bug.