Bug 1948436 - The outbound traffic was broken intermittently after shutdown one egressIP node
Summary: The outbound traffic was broken intermittently after shutdown one egressIP node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Alexander Constantinescu
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-12 07:37 UTC by huirwang
Modified: 2021-07-27 22:59 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:58:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 311 0 None closed Bug 1948436: remove vxlan_monitor and OVS packet stat parsing 2021-06-10 12:04:50 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:59:18 UTC

Description huirwang 2021-04-12 07:37:21 UTC
Description of problem:
The outbound traffic was broken intermittently after shutdown one egressIP node

Version-Release number of selected component (if applicable):
4.8.0-0.ci-2021-04-12-041028

How reproducible:
Always

Steps to Reproduce:
1. Patch EgressIPs to 3 nodes manually.
oc get hostsubnet
NAME              HOST              HOST IP         SUBNET          EGRESS CIDRS   EGRESS IPS
compute-0         compute-0         172.31.248.75   10.128.2.0/23                  ["172.31.248.202"]
compute-1         compute-1         172.31.248.80   10.129.2.0/23                  
compute-2         compute-2         172.31.248.86   10.131.0.0/23                  ["172.31.248.203"]
control-plane-0   control-plane-0   172.31.248.81   10.130.0.0/23                  ["172.31.248.201"]
control-plane-1   control-plane-1   172.31.248.83   10.128.0.0/23                  
control-plane-2   control-plane-2   172.31.248.85   10.129.0.0/23 

2. Create a namespace test, patch multiple EgressIPs to "test", then create a pod under test.
oc get netnamespace test
NAME   NETID      EGRESS IPS
test   15436181   ["172.31.248.201","172.31.248.203","172.31.248.202"]

oc get pods -n test -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
hello-pod   1/1     Running   0          21m   10.129.2.34   compute-1   <none>           <none>


3. Check the source IP of the outbound traffic---access an ip-echo service which is outside the cluster.
oc rsh -n test hello-pod
/ # while true; do  curl 172.31.249.80:9095 --connect-timeout 2 ;sleep 2 ;echo ""; done
172.31.248.202
172.31.248.203
172.31.248.201
172.31.248.203
172.31.248.203
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.203
We can see the EgressIPs were  load-balanced among different nodes.

4. Shutdown one EgressIP node, here is compute-2

Actual results:
The outbound traffic was intermittently broken.

while true; do  curl 172.31.249.80:9095 --connect-timeout 2 ;sleep 2 ;echo ""; done
172.31.248.202
172.31.248.203
172.31.248.201
172.31.248.203
172.31.248.203
172.31.248.202
......
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

curl: (28) Connection timed out after 2001 milliseconds

curl: (28) Connection timed out after 2001 milliseconds

172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2000 milliseconds

172.31.248.201
curl: (28) Connection timed out after 2000 milliseconds

172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
curl: (28) Connection timed out after 2001 milliseconds

curl: (28) Connection timed out after 2001 milliseconds

172.31.248.202
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.201
172.31.248.202
curl: (28) Connection timed out after 2001 milliseconds

172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.201
172.31.248.202
172.31.248.202
172.31.248.201
172.31.248.201

Expected results:
Should use available EgressIP nodes and outbound traffic is not broken.

Additional info:

Comment 2 Alexander Constantinescu 2021-06-01 18:36:58 UTC
Setting the bug to blocker for 4.8. I thought this PR would get in a while ago (seeing as how I posted it months ago) and hence didn't mark it as such. However, given that we are fast approaching code freeze and this is a regression from 4.7 and we cannot ship openshift-sdn with this problem, I am setting it to blocker so that it shows up on peoples radar.

Comment 7 errata-xmlrpc 2021-07-27 22:58:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.