Bug 1948436

Summary: The outbound traffic was broken intermittently after shutdown one egressIP node
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:58:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description huirwang 2021-04-12 07:37:21 UTC
Description of problem:
The outbound traffic was broken intermittently after shutdown one egressIP node

Version-Release number of selected component (if applicable):
4.8.0-0.ci-2021-04-12-041028

How reproducible:
Always

Steps to Reproduce:
1. Patch EgressIPs to 3 nodes manually.
oc get hostsubnet
NAME              HOST              HOST IP         SUBNET          EGRESS CIDRS   EGRESS IPS
compute-0         compute-0         172.31.248.75   10.128.2.0/23                  ["172.31.248.202"]
compute-1         compute-1         172.31.248.80   10.129.2.0/23                  
compute-2         compute-2         172.31.248.86   10.131.0.0/23                  ["172.31.248.203"]
control-plane-0   control-plane-0   172.31.248.81   10.130.0.0/23                  ["172.31.248.201"]
control-plane-1   control-plane-1   172.31.248.83   10.128.0.0/23                  
control-plane-2   control-plane-2   172.31.248.85   10.129.0.0/23 

2. Create a namespace test, patch multiple EgressIPs to "test", then create a pod under test.
oc get netnamespace test
NAME   NETID      EGRESS IPS
test   15436181   ["172.31.248.201","172.31.248.203","172.31.248.202"]

oc get pods -n test -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
hello-pod   1/1     Running   0          21m   10.129.2.34   compute-1   <none>           <none>


3. Check the source IP of the outbound traffic---access an ip-echo service which is outside the cluster.
oc rsh -n test hello-pod
/ # while true; do  curl 172.31.249.80:9095 --connect-timeout 2 ;sleep 2 ;echo ""; done
172.31.248.202
172.31.248.203
172.31.248.201
172.31.248.203
172.31.248.203
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.203
We can see the EgressIPs were  load-balanced among different nodes.

4. Shutdown one EgressIP node, here is compute-2

Actual results:
The outbound traffic was intermittently broken.

while true; do  curl 172.31.249.80:9095 --connect-timeout 2 ;sleep 2 ;echo ""; done
172.31.248.202
172.31.248.203
172.31.248.201
172.31.248.203
172.31.248.203
172.31.248.202
......
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

curl: (28) Connection timed out after 2001 milliseconds

curl: (28) Connection timed out after 2001 milliseconds

172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2000 milliseconds

172.31.248.201
curl: (28) Connection timed out after 2000 milliseconds

172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

curl: (28) Connection timed out after 2001 milliseconds

172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
curl: (28) Connection timed out after 2001 milliseconds

172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201

Expected results:
Should use available EgressIP nodes and outbound traffic is not broken.

Additional info:

Comment 2 Alexander Constantinescu 2021-06-01 18:36:58 UTC
Setting the bug to blocker for 4.8. I thought this PR would get in a while ago (seeing as how I posted it months ago) and hence didn't mark it as such. However, given that we are fast approaching code freeze and this is a regression from 4.7 and we cannot ship openshift-sdn with this problem, I am setting it to blocker so that it shows up on peoples radar.

Comment 7 errata-xmlrpc 2021-07-27 22:58:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438