Bug 1728342

Summary: [3.11] Random outages with egressIP
Product: OpenShift Container Platform Reporter: shiyang.wang <shiywang>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, danw, huirwang, jack.ottofaro, joboyer, openshift-bugs-escalate, sburke, travi, zzhao
Version: 3.11.0Keywords: Reopened
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: If a pod using an egress IP tries to contact an external host that is not responding, the egress IP monitoring code may mistakenly interpret that as meaning that the node hosting the egress IP is not responding. Consequence: High-availability egress IPs might get switched from one node to another spuriously. Fix: The monitoring code now distinguishes the case of "egress node not responding" from "final destination not responding" Result: High-availability egress IPs will not be switched between nodes unnecessarily.
Story Points: ---
Clone Of: 1718542 Environment:
Last Closed: 2019-07-23 19:56:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1717639, 1718542    
Bug Blocks:    

Comment 2 zhaozhanqi 2019-07-11 04:45:34 UTC
Verified this bug on oc v3.11.128 with steps 

1. Create cluster on 3.11 with networkpolicy plugin
2. Create new project
3. Added egressip for namespaces. eg:
   oc patch netnamespace z1 -p '{"egressIPs":["10.0.76.100"]}'
4. Added egressip on one node, eg:
   oc patch hostsubnet preserve-zzhao-311nrr-1 -p '{"egressIPs":["10.0.76.100"]}'

5. Create test pod to make sure it scheduled to node (not the egress ip node)
6. rsh into the test pod and ping one blocked ip
7. check the sdn logs of node which same the test pod and there isn't any logs show node is offline.

Comment 4 errata-xmlrpc 2019-07-23 19:56:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1753