Description of problem: Tested on SDN AWS, after reboot the egressip node, egressip was lost and egress traffic for the configured namespace pod was broken. Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-09-195852 How reproducible: Always Steps to Reproduce: 1. Patch one node as egress node and patch the egressip to one namespace test $ oc get netnamespace test NAME NETID EGRESS IPS test 13098326 ["10.0.57.100"] $ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS ip-10-0-51-186.us-east-2.compute.internal ip-10-0-51-186.us-east-2.compute.internal 10.0.51.186 10.129.0.0/23 ip-10-0-57-103.us-east-2.compute.internal ip-10-0-57-103.us-east-2.compute.internal 10.0.57.103 10.129.2.0/23 ["10.0.57.100"] ip-10-0-57-202.us-east-2.compute.internal ip-10-0-57-202.us-east-2.compute.internal 10.0.57.202 10.128.2.0/23 [] ip-10-0-67-247.us-east-2.compute.internal ip-10-0-67-247.us-east-2.compute.internal 10.0.67.247 10.128.0.0/23 ip-10-0-71-99.us-east-2.compute.internal ip-10-0-71-99.us-east-2.compute.internal 10.0.71.99 10.131.0.0/23 [] ip-10-0-73-87.us-east-2.compute.internal ip-10-0-73-87.us-east-2.compute.internal 10.0.73.87 10.130.0.0/23 2. From the pod, checking egressip worked well. $ oc rsh -n test hello-pod / # curl -s --connect-timeout 10 10.0.12.118:9095 10.0.57.100 3. Reboot egress node ip-10-0-57-103.us-east-2.compute.internal 4. Wait for the egress node back to ready. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-51-186.us-east-2.compute.internal Ready master 5h12m v1.22.1+6859754 ip-10-0-57-103.us-east-2.compute.internal Ready worker 5h3m v1.22.1+6859754 ip-10-0-57-202.us-east-2.compute.internal Ready worker 5h5m v1.22.1+6859754 ip-10-0-67-247.us-east-2.compute.internal Ready master 5h13m v1.22.1+6859754 ip-10-0-71-99.us-east-2.compute.internal Ready worker 5h5m v1.22.1+6859754 ip-10-0-73-87.us-east-2.compute.internal Ready master 5h12m v1.22.1+6859754 Actual results: The egressip was lost from the node ip-10-0-57-103.us-east-2.compute.internal $ oc debug node/ip-10-0-57-103.us-east-2.compute.internal Starting pod/ip-10-0-57-103us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.57.103 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ip a show ens5 2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000 link/ether 02:00:1a:c3:a3:c6 brd ff:ff:ff:ff:ff:ff inet 10.0.57.103/20 brd 10.0.63.255 scope global dynamic noprefixroute ens5 valid_lft 3300sec preferred_lft 3300sec inet6 fe80::af97:f722:f5ff:883f/64 scope link noprefixroute valid_lft forever preferred_lft forever Egress traffic was also broken. $ oc rsh -n test hello-pod / # while true; do curl --connect-timeout 2 10.0.12.118:9095 ;sleep 2;date;done curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:19 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:23 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:27 UTC 2022 curl: (28) Connection timed out after 2000 milliseconds Mon Jan 10 06:10:31 UTC 2022 curl: (28) Connection timed out after 2000 milliseconds Mon Jan 10 06:10:35 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:39 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:43 UTC 2022 curl: (28) Connection timed out after 2000 milliseconds Mon Jan 10 06:10:47 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:51 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:55 UTC 2022 curl: (28) Connection timed out after 2001 milliseconds Mon Jan 10 06:10:59 UTC 2022 curl: (28) Connection timed out after 2000 milliseconds Mon Jan 10 06:11:03 UTC 2022 curl: (28) Connection timed out after 2000 milliseconds Mon Jan 10 06:11:07 UTC 2022 curl: (28) Connection timed out after 2000 milliseconds SDN logs: E0110 06:06:53.427240 1869 egressip.go:250] Ignoring invalid HostSubnet ip-10-0-57-103.us-east-2.compute.internal (host: "ip-10-0-57-103.us-east-2.compute.internal", ip: "10.0.57.103 ", subnet: "10.129.2.0/23"): error retrieving related node object, err: node "ip-10-0-57-103.us-east-2.compute.internal" not found $ oc get cloudprivateipconfigs 10.0.57.100 -o yaml apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: creationTimestamp: "2022-01-10T06:03:48Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.57.100 resourceVersion: "123351" uid: feb729da-53f0-4d2a-9f9a-8df07813e21b spec: node: ip-10-0-57-103.us-east-2.compute.internal status: conditions: - lastTransitionTime: "2022-01-10T06:03:48Z" message: IP address successfully added observedGeneration: 1 reason: CloudResponseSuccess status: "True" type: Assigned node: ip-10-0-57-103.us-east-2.compute.internal Expected results: EgressIP was added back and worked well. Additional info: Workaround is deleting the sdn pod
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056