Bug 2079012

Summary: egressIP not migrated to correct workers after deleting machineset it was assigned
Product: OpenShift Container Platform Reporter: Anand T N <atn>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: anusaxen, ffernand, huirwang, pdiak, surya
Version: 4.10Flags: surya: needinfo-
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2105657 (view as bug list) Environment:
Last Closed: 2022-08-10 11:08:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2094039    
Bug Blocks: 2105657    

Description Anand T N 2022-04-26 17:22:56 UTC
How reproducible:

Steps to Reproduce:

Step 1: Created new machninesets

Step 2: Delete the old machineset that contains the egressIPs
oc get egressip
NAME                             EGRESSIPS       ASSIGNED NODE             ASSIGNED EGRESSIPS
ub-egress-ip-production          128.205.248.7   caas-v6lqn-worker-pdcfw   128.205.248.7
ub-egress-ip-quality-assurance   128.205.248.8   caas-v6lqn-worker-nf2bz   128.205.248.8

Step 3: scale the caas-v6lqn-worker to 0

$ omg get machineset -A
NAMESPACE              NAME                DESIRED  CURRENT  READY  AVAILABLE  AGE
openshift-machine-api  caas-v6lqn-infra    3        3        3      3          26d
openshift-machine-api  caas-v6lqn-worker   0        0                          27d          <<<<<<---------
openshift-machine-api  caas-v6lqn-worker1  10       10       10     10         12h
openshift-machine-api  caas-v6lqn-worker2  1        1        1      1          11h
openshift-machine-api  caas-v6lqn-worker3  9        9        9      9          11h


Step 4: Is the egressIP moved to a new worker? Yes, but to one that is already being deleted:
 spr(ocp:openshift-config) $ oc get egressip
NAME                             EGRESSIPS       ASSIGNED NODE             ASSIGNED EGRESSIPS
ub-egress-ip-production          128.205.248.7   caas-v6lqn-worker-8wtbz   128.205.248.7
ub-egress-ip-quality-assurance   128.205.248.8   caas-v6lqn-worker-8wtbz   128.205.248.8

$ omg get nodes | grep caas-v6lqn-worker-8wtbz | wc -l
0


Step 5:  All "worker" machines are now deleted, but egressIP is still assigned to the above "worker" ndoe.  Does the egressIP still work correctly? No

Step 6: ovnkube-masters are showing errors:
ovnkube-master-bbppx ovnkube-master E0413 13:59:28.811334       1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluster-health-qa, err: unable to create logical router policy, err: unable to retrieve gateway IP for node: caas-v6lqn-worker-8wtbz, protocol is IPv6: false, err: could not find node caas-v6lqn-worker-8wtbz gateway router: no IPv4 value available
ovnkube-master-bbppx ovnkube-master E0413 13:59:28.815447       1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluste

----
ovnkube-master-7h5mk ovnkube-master E0413 14:01:58.755580       1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluster-health-qa, err: unable to create logical router policy, err: unable to retrieve gateway IP for node: caas-v6lqn-worker-8wtbz, protocol is IPv6: false, err: could not find node caas-v6lqn-worker-8wtbz gateway router: no IPv4 value available
ovnkube-master-7h5mk ovnkube-master E0413 14:01:58.764757       1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluste

Step 6: restart the ovnkube-masters and the egressIP is still not moved: though other nodes has egress-assignable label 
spr(ocp:openshift-config) $ oc get egressip
NAME                             EGRESSIPS       ASSIGNED NODE             ASSIGNED EGRESSIPS
ub-egress-ip-production          128.205.248.7   caas-v6lqn-worker-8wtbz   128.205.248.7
ub-egress-ip-quality-assurance   128.205.248.8   caas-v6lqn-worker-8wtbz   128.205.248.8

$ omg get nodes --show-labels  | grep egress-assignable | wc -l
20


Step 7: Delete egressIPs to have them re-assigned:
 oc delete $(oc get egressip -o name)

Step 8: Re-create the egressIPs:
spr(ocp:openshift-config) $ oc get egressip
NAME                             EGRESSIPS       ASSIGNED NODE              ASSIGNED EGRESSIPS
ub-egress-ip-production          128.205.248.7   caas-v6lqn-worker3-vkg2k   128.205.248.7
ub-egress-ip-quality-assurance   128.205.248.8   caas-v6lqn-worker2-9v9g5   128.205.248.8

Comment 6 Surya Seetharaman 2022-05-10 17:11:48 UTC
initial status:

$ oc get machineset -A
NAMESPACE               NAME                               DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ci-ln-8tni372-c1627-72fgn-worker   3         3         3       3           20m
$ oc get nodes -owide -A
NAME                                     STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
ci-ln-8tni372-c1627-72fgn-master-0       Ready    master   20m   v1.23.5+1f952b3   192.168.51.12   192.168.51.12   Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8
ci-ln-8tni372-c1627-72fgn-master-1       Ready    master   20m   v1.23.5+1f952b3   192.168.51.21   192.168.51.21   Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8
ci-ln-8tni372-c1627-72fgn-master-2       Ready    master   20m   v1.23.5+1f952b3   192.168.51.20   192.168.51.20   Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8
ci-ln-8tni372-c1627-72fgn-worker-952kx   Ready    worker   10m   v1.23.5+1f952b3   192.168.51.14   192.168.51.14   Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8
ci-ln-8tni372-c1627-72fgn-worker-pndm8   Ready    worker   10m   v1.23.5+1f952b3   192.168.51.19   192.168.51.19   Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8
ci-ln-8tni372-c1627-72fgn-worker-svzph   Ready    worker   10m   v1.23.5+1f952b3   192.168.51.30   192.168.51.30   Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8
$ oc get machineset -A
NAMESPACE               NAME                               DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ci-ln-8tni372-c1627-72fgn-worker   3         3         3       3           3h3m


Spec:
  Egress I Ps:
    192.168.51.15
    192.168.51.16
  Namespace Selector:
    Match Labels:
      Env:  prod
Status:
  Items:
    Egress IP:  192.168.51.15
    Node:       ci-ln-8tni372-c1627-72fgn-worker-952kx
    Egress IP:  192.168.51.16
    Node:       ci-ln-8tni372-c1627-72fgn-worker-pndm8
Events:         <none>



I tried a test with egressIPs when machineset is scaled down and reassignment is indeed happening:

[surya@hidden-temple openshift]$ oc scale machineset -n openshift-machine-api  ci-ln-8tni372-c1627-72fgn-worker --replicas=2
machineset.machine.openshift.io/ci-ln-8tni372-c1627-72fgn-worker scaled
[surya@hidden-temple openshift]$ oc get machineset -A
NAMESPACE               NAME                               DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ci-ln-8tni372-c1627-72fgn-worker   2         2         2       2           3h4m
[surya@hidden-temple openshift]$ oc get nodes
NAME                                     STATUS                     ROLES    AGE    VERSION
ci-ln-8tni372-c1627-72fgn-master-0       Ready                      master   3h4m   v1.23.5+1f952b3
ci-ln-8tni372-c1627-72fgn-master-1       Ready                      master   3h4m   v1.23.5+1f952b3
ci-ln-8tni372-c1627-72fgn-master-2       Ready                      master   3h4m   v1.23.5+1f952b3
ci-ln-8tni372-c1627-72fgn-worker-952kx   Ready,SchedulingDisabled   worker   174m   v1.23.5+1f952b3
ci-ln-8tni372-c1627-72fgn-worker-pndm8   Ready                      worker   174m   v1.23.5+1f952b3
ci-ln-8tni372-c1627-72fgn-worker-svzph   Ready                      worker   174m   v1.23.5+1f952b3

Spec:
  Egress I Ps:
    192.168.51.15
    192.168.51.16
  Namespace Selector:
    Match Labels:
      Env:  prod
Status:
  Items:
    Egress IP:  192.168.51.16
    Node:       ci-ln-8tni372-c1627-72fgn-worker-pndm8
    Egress IP:  192.168.51.15
    Node:       ci-ln-8tni372-c1627-72fgn-worker-svzph
Events:         <none>


E0510 17:05:32.588294       1 egressip.go:747] Allocator error: EgressIP: egressips-prod assigned to node: ci-ln-8tni372-c1627-72fgn-worker-952kx which is not reachable, will attempt rebalancing
I0510 17:05:32.588697       1 egressip.go:1353] Successful assignment of egress IP: 192.168.51.15 on node: &{egressIPConfig:0xc000180600 mgmtIPs:[[10 128 2 2]] allocations:map[] isReady:true isReachable:true isEgressAssignable:true name:ci-ln-8tni372-c1627-72fgn-worker-svzph}
I0510 17:05:32.588825       1 kube.go:244] Patching status on EgressIP egressips-prod

it moves correctly to the other node.

now let me see how I can create a new machine-set and do the same test as shown in the bug.

Comment 17 Surya Seetharaman 2022-06-01 13:46:00 UTC
Hey Huiran!

Could you please try to reproduce the same issue with this image: quay.io/itssurya/dev-images:e5a4884b-83c7-4728-9117-936093a25c7d.

Basically install OCP; scale down CVO, edit CNO deployment and change OVNK image to the one ^ provided and redo the MC test you did in: https://bugzilla.redhat.com/show_bug.cgi?id=2079012#c9?

Comment 22 Surya Seetharaman 2022-06-06 16:14:19 UTC
We need to solve https://bugzilla.redhat.com/show_bug.cgi?id=2094039#c0 first because everytime we try to reproduce this using machine set deletion, we seem to be hitting that bug which causes panic and restarts in ovnk hence making it difficult to actually verify this bug's fix. Marking this as dependent on 2094039.

Comment 30 errata-xmlrpc 2022-08-10 11:08:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069