Bug 1984201
| Summary: | hybrid-overlay-node does not update Node object if HNS networks are present | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Aravindh Puthiyaparambil <aravindh> | ||||
| Component: | Networking | Assignee: | Jacob Tanenbaum <jtanenba> | ||||
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | unspecified | CC: | anbhat, dcbw, team-winc | ||||
| Version: | 4.8 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.10.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Windows | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-03-16 11:12:07 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Aravindh Puthiyaparambil
2021-07-20 23:17:45 UTC
What do you mean by "does not update the new Node object" more specifically? What fields do you expect to be updated, that are not? Also what does deconfiguration do? Does it simply remove Kube API resources for the node, but leaves the Windows instance itself configured? Do you have startup logs of the hybrid-overlay-node process when it does the wrong thing? (In reply to Dan Williams from comment #1) > What do you mean by "does not update the new Node object" more specifically? > What fields do you expect to be updated, that are not? The annotation k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac > > Also what does deconfiguration do? Does it simply remove Kube API resources > for the node, but leaves the Windows instance itself configured? It uninstalls all the k8s components removes all the logs and binaries. i.e. stops and deletes hybrid-overlay, kubelet, kube-proxy service and so on. So WMCO reverts all the steps it performed to configure the instance into a node. > Do you have startup logs of the hybrid-overlay-node process when it does the > wrong thing? I am attaching the log. Created attachment 1804313 [details]
hybrid-overlay-node log on reconfiguration
This is what the "new" node object looks like which gets created on reconfiguration:
```
# oc describe node winhost
Name: winhost
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=windows
kubernetes.io/arch=amd64
kubernetes.io/hostname=winhost
kubernetes.io/os=windows
node-role.kubernetes.io/worker=
node.kubernetes.io/windows-build=10.0.19041
node.openshift.io/os_id=Windows
Annotations: k8s.ovn.org/hybrid-overlay-node-subnet: 10.132.2.0/24
machine.openshift.io/machine: openshift-machine-api/winbyoh-tqhfq
volumes.kubernetes.io/controller-managed-attach-detach: true
windowsmachineconfig.openshift.io/byoh: true
windowsmachineconfig.openshift.io/pub-key-hash: 7c00ba8122aa764a192fe7d2d9ac4d3627b9c443c09480b18c055c2e178a6019
windowsmachineconfig.openshift.io/username: Administrator
CreationTimestamp: Wed, 21 Jul 2021 18:10:56 -0700
Taints: node.kubernetes.io/unschedulable:NoSchedule
os=Windows:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: winhost
AcquireTime: <unset>
RenewTime: Wed, 21 Jul 2021 18:23:43 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 21 Jul 2021 18:22:03 -0700 Wed, 21 Jul 2021 18:12:02 -0700 KubeletReady kubelet is posting ready status
Addresses:
ExternalIP: 172.31.251.162
InternalIP: 172.31.251.162
Hostname: winhost
Capacity:
cpu: 4
ephemeral-storage: 41428988Ki
memory: 16776692Ki
pods: 250
Allocatable:
cpu: 3500m
ephemeral-storage: 37107213454
memory: 15625716Ki
pods: 250
System Info:
Machine ID: winhost
System UUID: DB992C42-0D1F-E12D-7BA2-5439752D8140
Boot ID:
Kernel Version: 10.0.19041.508
OS Image: Windows Server Standard
Operating System: windows
Architecture: amd64
Container Runtime Version: docker://20.10.6
Kubelet Version: v1.21.1-1397+a678cfd2c37e87
Kube-Proxy Version: v1.21.1-1397+a678cfd2c37e87
ProviderID: vsphere://422c99db-1f0d-2de1-7ba2-5439752d8140
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 92m kubelet Starting kubelet.
Warning CheckLimitsForResolvConf 92m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified.
Normal NodeHasSufficientMemory 92m (x2 over 92m) kubelet Node winhost status is now: NodeHasSufficientPID
Normal NodeReady 92m kubelet Node winhost status is now: NodeReady
Normal Starting 88m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 88m kubelet Node winhost status is now: NodeHasSufficientMemory
Normal Starting 88m kube-proxy Starting kube-proxy.
Normal NodeSchedulable 87m kubelet Node winhost status is now: NodeSchedulable
Normal NodeNotSchedulable 83m (x2 over 88m) kubelet Node winhost status is now: NodeNotSchedulable
Normal Starting 33m kubelet Starting kubelet.
Warning CheckLimitsForResolvConf 33m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified.
Normal NodeHasSufficientMemory 33m (x2 over 33m) kubelet Node winhost status is now: NodeHasSufficientPID
Normal NodeReady 33m kubelet Node winhost status is now: NodeReady
Normal NodeHasSufficientMemory 30m kubelet Node winhost status is now: NodeHasSufficientMemory
Normal Starting 30m kubelet Starting kubelet.
Normal Starting 29m kube-proxy Starting kube-proxy.
Normal NodeSchedulable 29m kubelet Node winhost status is now: NodeSchedulable
Normal NodeNotSchedulable 26m (x2 over 30m) kubelet Node winhost status is now: NodeNotSchedulable
Normal Starting 12m kubelet Starting kubelet.
Warning CheckLimitsForResolvConf 12m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified.
Normal NodeHasSufficientMemory 12m (x2 over 12m) kubelet Node winhost status is now: NodeHasSufficientPID
Normal NodeReady 12m kubelet Node winhost status is now: NodeReady
Normal NodeNotSchedulable 12m kubelet Node winhost status is now: NodeNotSchedulable
Normal Starting 11m kubelet Starting kubelet.
Warning CheckLimitsForResolvConf 11m kubelet open c:\k\etc\resolv.conf: The system cannot find the file specified.
Normal NodeHasSufficientMemory 11m (x2 over 11m) kubelet Node winhost status is now: NodeHasSufficientPID
Normal NodeReady 11m kubelet Node winhost status is now: NodeReady
```
It is unclear to me why GetExistingNetwork() [0] is returning nil which I suspect is the root cause of this issue.
[0] https://github.com/openshift/ovn-kubernetes/blob/a10433c51f38c2b73ea4019e69512d427d88542d/go-controller/hybrid-overlay/pkg/controller/node_windows.go#L259
Probably because the node object got deleted, so the node's subnet changed, so GetExistingNetwork() doesn't think it's the same network. I guess the code should just get the network by name, and if it's got the same subnet + GW then re-use it. Otherwise delete + recreate it. (In reply to Dan Williams from comment #5) > Probably because the node object got deleted, so the node's subnet changed, > so GetExistingNetwork() doesn't think it's the same network. Yes. That must be it. > I guess the code should just get the network by name, and if it's got the > same subnet + GW then re-use it. Otherwise delete + recreate it. Agree. The other orthogonal question we have is should hybrid-overlay-node have a "deconfigure" option which will direct it to delete the networks it created as part of it "configuring" the node? This landed in the 4.10 development via https://github.com/openshift/ovn-kubernetes/pull/796 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.4 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0811 |