How reproducible: Steps to Reproduce: Step 1: Created new machninesets Step 2: Delete the old machineset that contains the egressIPs oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS ub-egress-ip-production 128.205.248.7 caas-v6lqn-worker-pdcfw 128.205.248.7 ub-egress-ip-quality-assurance 128.205.248.8 caas-v6lqn-worker-nf2bz 128.205.248.8 Step 3: scale the caas-v6lqn-worker to 0 $ omg get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api caas-v6lqn-infra 3 3 3 3 26d openshift-machine-api caas-v6lqn-worker 0 0 27d <<<<<<--------- openshift-machine-api caas-v6lqn-worker1 10 10 10 10 12h openshift-machine-api caas-v6lqn-worker2 1 1 1 1 11h openshift-machine-api caas-v6lqn-worker3 9 9 9 9 11h Step 4: Is the egressIP moved to a new worker? Yes, but to one that is already being deleted: spr(ocp:openshift-config) $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS ub-egress-ip-production 128.205.248.7 caas-v6lqn-worker-8wtbz 128.205.248.7 ub-egress-ip-quality-assurance 128.205.248.8 caas-v6lqn-worker-8wtbz 128.205.248.8 $ omg get nodes | grep caas-v6lqn-worker-8wtbz | wc -l 0 Step 5: All "worker" machines are now deleted, but egressIP is still assigned to the above "worker" ndoe. Does the egressIP still work correctly? No Step 6: ovnkube-masters are showing errors: ovnkube-master-bbppx ovnkube-master E0413 13:59:28.811334 1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluster-health-qa, err: unable to create logical router policy, err: unable to retrieve gateway IP for node: caas-v6lqn-worker-8wtbz, protocol is IPv6: false, err: could not find node caas-v6lqn-worker-8wtbz gateway router: no IPv4 value available ovnkube-master-bbppx ovnkube-master E0413 13:59:28.815447 1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluste ---- ovnkube-master-7h5mk ovnkube-master E0413 14:01:58.755580 1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluster-health-qa, err: unable to create logical router policy, err: unable to retrieve gateway IP for node: caas-v6lqn-worker-8wtbz, protocol is IPv6: false, err: could not find node caas-v6lqn-worker-8wtbz gateway router: no IPv4 value available ovnkube-master-7h5mk ovnkube-master E0413 14:01:58.764757 1 ovn.go:1102] Unable to add egress IP matching pod: tools/cluste Step 6: restart the ovnkube-masters and the egressIP is still not moved: though other nodes has egress-assignable label spr(ocp:openshift-config) $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS ub-egress-ip-production 128.205.248.7 caas-v6lqn-worker-8wtbz 128.205.248.7 ub-egress-ip-quality-assurance 128.205.248.8 caas-v6lqn-worker-8wtbz 128.205.248.8 $ omg get nodes --show-labels | grep egress-assignable | wc -l 20 Step 7: Delete egressIPs to have them re-assigned: oc delete $(oc get egressip -o name) Step 8: Re-create the egressIPs: spr(ocp:openshift-config) $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS ub-egress-ip-production 128.205.248.7 caas-v6lqn-worker3-vkg2k 128.205.248.7 ub-egress-ip-quality-assurance 128.205.248.8 caas-v6lqn-worker2-9v9g5 128.205.248.8
initial status: $ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ci-ln-8tni372-c1627-72fgn-worker 3 3 3 3 20m $ oc get nodes -owide -A NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ci-ln-8tni372-c1627-72fgn-master-0 Ready master 20m v1.23.5+1f952b3 192.168.51.12 192.168.51.12 Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8 ci-ln-8tni372-c1627-72fgn-master-1 Ready master 20m v1.23.5+1f952b3 192.168.51.21 192.168.51.21 Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8 ci-ln-8tni372-c1627-72fgn-master-2 Ready master 20m v1.23.5+1f952b3 192.168.51.20 192.168.51.20 Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8 ci-ln-8tni372-c1627-72fgn-worker-952kx Ready worker 10m v1.23.5+1f952b3 192.168.51.14 192.168.51.14 Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8 ci-ln-8tni372-c1627-72fgn-worker-pndm8 Ready worker 10m v1.23.5+1f952b3 192.168.51.19 192.168.51.19 Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8 ci-ln-8tni372-c1627-72fgn-worker-svzph Ready worker 10m v1.23.5+1f952b3 192.168.51.30 192.168.51.30 Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-3.rhaos4.10.gitcbe78bd.el8 $ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ci-ln-8tni372-c1627-72fgn-worker 3 3 3 3 3h3m Spec: Egress I Ps: 192.168.51.15 192.168.51.16 Namespace Selector: Match Labels: Env: prod Status: Items: Egress IP: 192.168.51.15 Node: ci-ln-8tni372-c1627-72fgn-worker-952kx Egress IP: 192.168.51.16 Node: ci-ln-8tni372-c1627-72fgn-worker-pndm8 Events: <none> I tried a test with egressIPs when machineset is scaled down and reassignment is indeed happening: [surya@hidden-temple openshift]$ oc scale machineset -n openshift-machine-api ci-ln-8tni372-c1627-72fgn-worker --replicas=2 machineset.machine.openshift.io/ci-ln-8tni372-c1627-72fgn-worker scaled [surya@hidden-temple openshift]$ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ci-ln-8tni372-c1627-72fgn-worker 2 2 2 2 3h4m [surya@hidden-temple openshift]$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-8tni372-c1627-72fgn-master-0 Ready master 3h4m v1.23.5+1f952b3 ci-ln-8tni372-c1627-72fgn-master-1 Ready master 3h4m v1.23.5+1f952b3 ci-ln-8tni372-c1627-72fgn-master-2 Ready master 3h4m v1.23.5+1f952b3 ci-ln-8tni372-c1627-72fgn-worker-952kx Ready,SchedulingDisabled worker 174m v1.23.5+1f952b3 ci-ln-8tni372-c1627-72fgn-worker-pndm8 Ready worker 174m v1.23.5+1f952b3 ci-ln-8tni372-c1627-72fgn-worker-svzph Ready worker 174m v1.23.5+1f952b3 Spec: Egress I Ps: 192.168.51.15 192.168.51.16 Namespace Selector: Match Labels: Env: prod Status: Items: Egress IP: 192.168.51.16 Node: ci-ln-8tni372-c1627-72fgn-worker-pndm8 Egress IP: 192.168.51.15 Node: ci-ln-8tni372-c1627-72fgn-worker-svzph Events: <none> E0510 17:05:32.588294 1 egressip.go:747] Allocator error: EgressIP: egressips-prod assigned to node: ci-ln-8tni372-c1627-72fgn-worker-952kx which is not reachable, will attempt rebalancing I0510 17:05:32.588697 1 egressip.go:1353] Successful assignment of egress IP: 192.168.51.15 on node: &{egressIPConfig:0xc000180600 mgmtIPs:[[10 128 2 2]] allocations:map[] isReady:true isReachable:true isEgressAssignable:true name:ci-ln-8tni372-c1627-72fgn-worker-svzph} I0510 17:05:32.588825 1 kube.go:244] Patching status on EgressIP egressips-prod it moves correctly to the other node. now let me see how I can create a new machine-set and do the same test as shown in the bug.
Hey Huiran! Could you please try to reproduce the same issue with this image: quay.io/itssurya/dev-images:e5a4884b-83c7-4728-9117-936093a25c7d. Basically install OCP; scale down CVO, edit CNO deployment and change OVNK image to the one ^ provided and redo the MC test you did in: https://bugzilla.redhat.com/show_bug.cgi?id=2079012#c9?
We need to solve https://bugzilla.redhat.com/show_bug.cgi?id=2094039#c0 first because everytime we try to reproduce this using machine set deletion, we seem to be hitting that bug which causes panic and restarts in ovnk hence making it difficult to actually verify this bug's fix. Marking this as dependent on 2094039.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069