Description of problem: Two EgressIP nodes are configured with EgressCIDRs as a HA pair, netnamespace has a single egressIP configured. Pods using the netnamespace use the first egressIP node for outgoing traffic. But after the first egressIP node is shutdown, and egressIP is switched to the second egress node, egressIP stopped working, pod can not access outside. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-21-192636 True False 18m Cluster version is 4.10.0-0.nightly-2022-01-21-192636 $ oc get node NAME STATUS ROLES AGE VERSION jechen-0121d-564rn-master-0.c.openshift-qe.internal Ready master 34m v1.23.0+112af52 jechen-0121d-564rn-master-1.c.openshift-qe.internal Ready master 34m v1.23.0+112af52 jechen-0121d-564rn-master-2.c.openshift-qe.internal Ready master 34m v1.23.0+112af52 jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal Ready worker 25m v1.23.0+112af52 jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal Ready worker 26m v1.23.0+112af52 jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal Ready worker 25m v1.23.0+112af52 $ oc describe node jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal Name: jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=n1-standard-4 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-central1 failure-domain.beta.kubernetes.io/zone=us-central1-a kubernetes.io/arch=amd64 kubernetes.io/hostname=jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=n1-standard-4 node.openshift.io/os_id=rhcos topology.gke.io/zone=us-central1-a topology.kubernetes.io/region=us-central1 topology.kubernetes.io/zone=us-central1-a Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"nic0","ifaddr":{"ipv4":"10.0.128.0/17"},"capacity":{"ip":10}}] csi.volume.kubernetes.io/nodeid: {"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/jechen-0121d-564rn-worker-a-89mhb"} How reproducible: Steps to Reproduce: 1.Configure two nodes as the egress node $ oc patch hostsubnet jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal --type=merge -p '{"egressCIDRs":["10.0.128.0/17"]}' hostsubnet.network.openshift.io/jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal patched $ oc patch hostsubnet jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal --type=merge -p '{"egressCIDRs":["10.0.128.0/17"]}' hostsubnet.network.openshift.io/jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal patched $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS jechen-0121d-564rn-master-0.c.openshift-qe.internal jechen-0121d-564rn-master-0.c.openshift-qe.internal 10.0.0.5 10.130.0.0/23 jechen-0121d-564rn-master-1.c.openshift-qe.internal jechen-0121d-564rn-master-1.c.openshift-qe.internal 10.0.0.6 10.128.0.0/23 jechen-0121d-564rn-master-2.c.openshift-qe.internal jechen-0121d-564rn-master-2.c.openshift-qe.internal 10.0.0.7 10.129.0.0/23 jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal 10.0.128.4 10.129.2.0/23 ["10.0.128.0/17"] jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal 10.0.128.2 10.131.0.0/23 ["10.0.128.0/17"] jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal 10.0.128.3 10.128.2.0/23 2. configure a test project, and configure netnamespace to it, configure test pods in the project $ oc new-project test $ oc patch netnamespace test --type=merge -p '{"egressIPs":["10.0.128.100"]}' netnamespace.network.openshift.io/test patched $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS jechen-0121d-564rn-master-0.c.openshift-qe.internal jechen-0121d-564rn-master-0.c.openshift-qe.internal 10.0.0.5 10.130.0.0/23 jechen-0121d-564rn-master-1.c.openshift-qe.internal jechen-0121d-564rn-master-1.c.openshift-qe.internal 10.0.0.6 10.128.0.0/23 jechen-0121d-564rn-master-2.c.openshift-qe.internal jechen-0121d-564rn-master-2.c.openshift-qe.internal 10.0.0.7 10.129.0.0/23 jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal 10.0.128.4 10.129.2.0/23 ["10.0.128.0/17"] ["10.0.128.100"] jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal 10.0.128.2 10.131.0.0/23 ["10.0.128.0/17"] jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal 10.0.128.3 10.128.2.0/23 $ oc create -f ./SDN-1332-test/list_for_pods.json replicationcontroller/test-rc created service/test-service created $ oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-kzlh7 0/1 ContainerCreating 0 5s <none> jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal <none> <none> test-rc-qcdrx 0/1 ContainerCreating 0 5s <none> jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal <none> <none> test-rc-tpktp 0/1 ContainerCreating 0 5s <none> jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal <none> <none> 3. while egressIP is on node jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal, from each test pod curl external ip-echo service, egressIP is returned as source IP correctly $ oc rsh test-rc-qcdrx ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ ~ $ exit $ oc rsh test-rc-kzlh7 ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ exit $ oc rsh test-rc-tpktp ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ exit 4. shutdown the current egress node jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal $ oc debug node/jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal Starting pod/jechen-0121d-564rn-worker-a-89mhbcopenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.4 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# shutdown Shutdown scheduled for Sat 2022-01-22 01:54:27 UTC, use 'shutdown -c' to cancel. sh-4.4# Removing debug pod ... 5. Wait a little bit, check hostsubnet, egressIP is switched to the second egress node correctly $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS jechen-0121d-564rn-master-0.c.openshift-qe.internal jechen-0121d-564rn-master-0.c.openshift-qe.internal 10.0.0.5 10.130.0.0/23 jechen-0121d-564rn-master-1.c.openshift-qe.internal jechen-0121d-564rn-master-1.c.openshift-qe.internal 10.0.0.6 10.128.0.0/23 jechen-0121d-564rn-master-2.c.openshift-qe.internal jechen-0121d-564rn-master-2.c.openshift-qe.internal 10.0.0.7 10.129.0.0/23 jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal jechen-0121d-564rn-worker-a-89mhb.c.openshift-qe.internal 10.0.128.4 10.129.2.0/23 ["10.0.128.0/17"] jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal jechen-0121d-564rn-worker-b-78mbp.c.openshift-qe.internal 10.0.128.2 10.131.0.0/23 ["10.0.128.0/17"] ["10.0.128.100"] jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal jechen-0121d-564rn-worker-c-x4b85.c.openshift-qe.internal 10.0.128.3 10.128.2.0/23 6. from test pods curl external ip-echo service, $ oc rsh test-rc-qcdrx Error from server: error dialing backend: dial tcp 10.0.128.4:10250: i/o timeout $ oc rsh test-rc-kzlh7 ~ $ curl 10.0.0.2:8888 ^C ~ $ curl 10.0.0.2:8888 ^C ~ $ exit command terminated with exit code 130 $ oc rsh test-rc-tpktp ~ $ curl 10.0.0.2:8888 ^C~ $ exit command terminated with exit code 130 Actual results: egressIP no longer works after egressIP is switched to the second node Expected results: egressIP should continue working after egressIP is switched to the second node, curl external ip-echo service should see egressIP returned as source IP. Additional info:
Please share the must-gather
Verified in 4.10.0-0.nightly-2022-01-27-104747 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-27-104747 True False 2m39s Cluster version is 4.10.0-0.nightly-2022-01-27-104747 $ oc get node NAME STATUS ROLES AGE VERSION jechen-0127b-55qb8-master-0.c.openshift-qe.internal Ready master 17m v1.23.0+d30ebbc jechen-0127b-55qb8-master-1.c.openshift-qe.internal Ready master 17m v1.23.0+d30ebbc jechen-0127b-55qb8-master-2.c.openshift-qe.internal Ready master 17m v1.23.0+d30ebbc jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal Ready worker 10m v1.23.0+d30ebbc jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal Ready worker 10m v1.23.0+d30ebbc jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal Ready worker 10m v1.23.0+d30ebbc $ oc patch hostsubnet jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal --type=merge -p '{"egressCIDRs":["10.0.128.0/17"]}' hostsubnet.network.openshift.io/jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal patched $ oc patch hostsubnet jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal --type=merge -p '{"egressCIDRs":["10.0.128.0/17"]}' hostsubnet.network.openshift.io/jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal patched $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS jechen-0127b-55qb8-master-0.c.openshift-qe.internal jechen-0127b-55qb8-master-0.c.openshift-qe.internal 10.0.0.6 10.128.0.0/23 jechen-0127b-55qb8-master-1.c.openshift-qe.internal jechen-0127b-55qb8-master-1.c.openshift-qe.internal 10.0.0.7 10.130.0.0/23 jechen-0127b-55qb8-master-2.c.openshift-qe.internal jechen-0127b-55qb8-master-2.c.openshift-qe.internal 10.0.0.5 10.129.0.0/23 jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal 10.0.128.2 10.129.2.0/23 ["10.0.128.0/17"] jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal 10.0.128.3 10.128.2.0/23 ["10.0.128.0/17"] jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal 10.0.128.4 10.131.0.0/23 $ oc new-project test $ oc patch netnamespace test --type=merge -p '{"egressIPs":["10.0.128.100"]}' netnamespace.network.openshift.io/test patched $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS jechen-0127b-55qb8-master-0.c.openshift-qe.internal jechen-0127b-55qb8-master-0.c.openshift-qe.internal 10.0.0.6 10.128.0.0/23 jechen-0127b-55qb8-master-1.c.openshift-qe.internal jechen-0127b-55qb8-master-1.c.openshift-qe.internal 10.0.0.7 10.130.0.0/23 jechen-0127b-55qb8-master-2.c.openshift-qe.internal jechen-0127b-55qb8-master-2.c.openshift-qe.internal 10.0.0.5 10.129.0.0/23 jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal 10.0.128.2 10.129.2.0/23 ["10.0.128.0/17"] ["10.0.128.100"] jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal 10.0.128.3 10.128.2.0/23 ["10.0.128.0/17"] jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal 10.0.128.4 10.131.0.0/23 $ oc create -f ./SDN-1332-test/list_for_pods.json replicationcontroller/test-rc created service/test-service created $ oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-48s46 0/1 ContainerCreating 0 5s <none> jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal <none> <none> test-rc-5l6nc 0/1 ContainerCreating 0 5s <none> jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal <none> <none> test-rc-fz5l6 0/1 ContainerCreating 0 5s <none> jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal <none> <none> $ oc rsh test-rc-48s46 ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ exit $ oc rsh test-rc-5l6nc ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ exit $ oc rsh test-rc-fz5l6 ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ exit $ oc debug node/jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal Starting pod/jechen-0127b-55qb8-worker-a-8m784copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.2 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# shutdown Shutdown scheduled for Thu 2022-01-27 16:06:15 UTC, use 'shutdown -c' to cancel. sh-4.4# Removing debug pod ... $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS jechen-0127b-55qb8-master-0.c.openshift-qe.internal jechen-0127b-55qb8-master-0.c.openshift-qe.internal 10.0.0.6 10.128.0.0/23 jechen-0127b-55qb8-master-1.c.openshift-qe.internal jechen-0127b-55qb8-master-1.c.openshift-qe.internal 10.0.0.7 10.130.0.0/23 jechen-0127b-55qb8-master-2.c.openshift-qe.internal jechen-0127b-55qb8-master-2.c.openshift-qe.internal 10.0.0.5 10.129.0.0/23 jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal jechen-0127b-55qb8-worker-a-8m784.c.openshift-qe.internal 10.0.128.2 10.129.2.0/23 ["10.0.128.0/17"] jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal jechen-0127b-55qb8-worker-b-hzrfx.c.openshift-qe.internal 10.0.128.3 10.128.2.0/23 ["10.0.128.0/17"] ["10.0.128.100"] jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal jechen-0127b-55qb8-worker-c-c89pw.c.openshift-qe.internal 10.0.128.4 10.131.0.0/23 $ oc rsh test-rc-48s46 Error from server: error dialing backend: dial tcp 10.0.128.2:10250: i/o timeout $ oc rsh test-rc-5l6nc ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ ~ $ exit $ oc rsh test-rc-fz5l6 ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ curl 10.0.0.2:8888 10.0.128.100~ $ ~ $ exit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056