Description of problem: The outbound traffic was broken intermittently after shutdown one egressIP node Version-Release number of selected component (if applicable): 4.8.0-0.ci-2021-04-12-041028 How reproducible: Always Steps to Reproduce: 1. Patch EgressIPs to 3 nodes manually. oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS compute-0 compute-0 172.31.248.75 10.128.2.0/23 ["172.31.248.202"] compute-1 compute-1 172.31.248.80 10.129.2.0/23 compute-2 compute-2 172.31.248.86 10.131.0.0/23 ["172.31.248.203"] control-plane-0 control-plane-0 172.31.248.81 10.130.0.0/23 ["172.31.248.201"] control-plane-1 control-plane-1 172.31.248.83 10.128.0.0/23 control-plane-2 control-plane-2 172.31.248.85 10.129.0.0/23 2. Create a namespace test, patch multiple EgressIPs to "test", then create a pod under test. oc get netnamespace test NAME NETID EGRESS IPS test 15436181 ["172.31.248.201","172.31.248.203","172.31.248.202"] oc get pods -n test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod 1/1 Running 0 21m 10.129.2.34 compute-1 <none> <none> 3. Check the source IP of the outbound traffic---access an ip-echo service which is outside the cluster. oc rsh -n test hello-pod / # while true; do curl 172.31.249.80:9095 --connect-timeout 2 ;sleep 2 ;echo ""; done 172.31.248.202 172.31.248.203 172.31.248.201 172.31.248.203 172.31.248.203 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.203 We can see the EgressIPs were load-balanced among different nodes. 4. Shutdown one EgressIP node, here is compute-2 Actual results: The outbound traffic was intermittently broken. while true; do curl 172.31.249.80:9095 --connect-timeout 2 ;sleep 2 ;echo ""; done 172.31.248.202 172.31.248.203 172.31.248.201 172.31.248.203 172.31.248.203 172.31.248.202 ...... 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 curl: (28) Connection timed out after 2001 milliseconds curl: (28) Connection timed out after 2001 milliseconds curl: (28) Connection timed out after 2001 milliseconds 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 curl: (28) Connection timed out after 2001 milliseconds 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 curl: (28) Connection timed out after 2001 milliseconds 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 curl: (28) Connection timed out after 2000 milliseconds 172.31.248.201 curl: (28) Connection timed out after 2000 milliseconds 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 curl: (28) Connection timed out after 2001 milliseconds curl: (28) Connection timed out after 2001 milliseconds 172.31.248.202 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.201 172.31.248.202 curl: (28) Connection timed out after 2001 milliseconds 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.201 172.31.248.202 172.31.248.202 172.31.248.201 172.31.248.201 Expected results: Should use available EgressIP nodes and outbound traffic is not broken. Additional info:
Setting the bug to blocker for 4.8. I thought this PR would get in a while ago (seeing as how I posted it months ago) and hence didn't mark it as such. However, given that we are fast approaching code freeze and this is a regression from 4.7 and we cannot ship openshift-sdn with this problem, I am setting it to blocker so that it shows up on peoples radar.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438