2038793 – [SDN EgressIP] After reboot egress node, the egressip was lost from egress node

Bug 2038793 - [SDN EgressIP] After reboot egress node, the egressip was lost from egress node

Summary: [SDN EgressIP] After reboot egress node, the egressip was lost from egress node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Ben Bennett
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-10 06:25 UTC by huirwang
Modified:	2022-03-10 16:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:38:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 389	0	None	open	Bug 2038793: Use the kubeClient instead of the informer cache to fetch nodes for egress IP validation.	2022-01-12 12:38:20 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:38:22 UTC

Description huirwang 2022-01-10 06:25:59 UTC

Description of problem:
Tested on SDN AWS, after reboot the egressip node, egressip was lost and egress traffic for the configured namespace pod was broken.

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-09-195852 

How reproducible:
Always

Steps to Reproduce:
1. Patch one node as egress node and patch the egressip to one namespace test
$ oc get netnamespace test
NAME   NETID      EGRESS IPS
test   13098326   ["10.0.57.100"]

$ oc get hostsubnet
NAME                                        HOST                                        HOST IP       SUBNET          EGRESS CIDRS   EGRESS IPS
ip-10-0-51-186.us-east-2.compute.internal   ip-10-0-51-186.us-east-2.compute.internal   10.0.51.186   10.129.0.0/23                  
ip-10-0-57-103.us-east-2.compute.internal   ip-10-0-57-103.us-east-2.compute.internal   10.0.57.103   10.129.2.0/23                  ["10.0.57.100"]
ip-10-0-57-202.us-east-2.compute.internal   ip-10-0-57-202.us-east-2.compute.internal   10.0.57.202   10.128.2.0/23                  []
ip-10-0-67-247.us-east-2.compute.internal   ip-10-0-67-247.us-east-2.compute.internal   10.0.67.247   10.128.0.0/23                  
ip-10-0-71-99.us-east-2.compute.internal    ip-10-0-71-99.us-east-2.compute.internal    10.0.71.99    10.131.0.0/23                  []
ip-10-0-73-87.us-east-2.compute.internal    ip-10-0-73-87.us-east-2.compute.internal    10.0.73.87    10.130.0.0/23 

2. From the pod, checking egressip worked well.
$ oc rsh -n test hello-pod
/ # curl -s --connect-timeout 10 10.0.12.118:9095
10.0.57.100
3. Reboot egress node ip-10-0-57-103.us-east-2.compute.internal
4. Wait for the egress node back to ready.
$ oc get nodes
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-51-186.us-east-2.compute.internal   Ready    master   5h12m   v1.22.1+6859754
ip-10-0-57-103.us-east-2.compute.internal   Ready    worker   5h3m    v1.22.1+6859754
ip-10-0-57-202.us-east-2.compute.internal   Ready    worker   5h5m    v1.22.1+6859754
ip-10-0-67-247.us-east-2.compute.internal   Ready    master   5h13m   v1.22.1+6859754
ip-10-0-71-99.us-east-2.compute.internal    Ready    worker   5h5m    v1.22.1+6859754
ip-10-0-73-87.us-east-2.compute.internal    Ready    master   5h12m   v1.22.1+6859754

Actual results:
The egressip was lost from the node ip-10-0-57-103.us-east-2.compute.internal
$ oc debug node/ip-10-0-57-103.us-east-2.compute.internal
Starting pod/ip-10-0-57-103us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.57.103
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ip a show ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 02:00:1a:c3:a3:c6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.57.103/20 brd 10.0.63.255 scope global dynamic noprefixroute ens5
       valid_lft 3300sec preferred_lft 3300sec
    inet6 fe80::af97:f722:f5ff:883f/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Egress traffic was also broken.
$ oc rsh -n test hello-pod
/ # while true; do curl  --connect-timeout 2  10.0.12.118:9095 ;sleep 2;date;done
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:19 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:23 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:27 UTC 2022
curl: (28) Connection timed out after 2000 milliseconds
Mon Jan 10 06:10:31 UTC 2022
curl: (28) Connection timed out after 2000 milliseconds
Mon Jan 10 06:10:35 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:39 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:43 UTC 2022
curl: (28) Connection timed out after 2000 milliseconds
Mon Jan 10 06:10:47 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:51 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:55 UTC 2022
curl: (28) Connection timed out after 2001 milliseconds
Mon Jan 10 06:10:59 UTC 2022
curl: (28) Connection timed out after 2000 milliseconds
Mon Jan 10 06:11:03 UTC 2022
curl: (28) Connection timed out after 2000 milliseconds
Mon Jan 10 06:11:07 UTC 2022
curl: (28) Connection timed out after 2000 milliseconds

SDN logs:
E0110 06:06:53.427240    1869 egressip.go:250] Ignoring invalid HostSubnet ip-10-0-57-103.us-east-2.compute.internal (host: "ip-10-0-57-103.us-east-2.compute.internal", ip: "10.0.57.103    ", subnet: "10.129.2.0/23"): error retrieving related node object, err: node "ip-10-0-57-103.us-east-2.compute.internal" not found

$ oc get cloudprivateipconfigs  10.0.57.100 -o yaml
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  creationTimestamp: "2022-01-10T06:03:48Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 1
  name: 10.0.57.100
  resourceVersion: "123351"
  uid: feb729da-53f0-4d2a-9f9a-8df07813e21b
spec:
  node: ip-10-0-57-103.us-east-2.compute.internal
status:
  conditions:
  - lastTransitionTime: "2022-01-10T06:03:48Z"
    message: IP address successfully added
    observedGeneration: 1
    reason: CloudResponseSuccess
    status: "True"
    type: Assigned
  node: ip-10-0-57-103.us-east-2.compute.internal

Expected results:
EgressIP was added back and worked well.

Additional info:
Workaround is deleting the sdn pod

Comment 8 errata-xmlrpc 2022-03-10 16:38:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.