Bug 1717639

Summary: Random outages with egressIP
Product: OpenShift Container Platform Reporter: Pedro Amoedo <pamoedom>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: anusaxen, aos-bugs, danw, erich, jack.ottofaro, jcrumple, misalunk, nchavan, openshift-bugs-escalate, weliang
Version: 3.11.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: If a pod using an egress IP tries to contact an external host that is not responding, the egress IP monitoring code may mistakenly interpret that as meaning that the node hosting the egress IP is not responding. Consequence: High-availability egress IPs might get switched from one node to another spuriously. Fix: The monitoring code now distinguishes the case of "egress node not responding" from "final destination not responding" Result: High-availability egress IPs will not be switched between nodes unnecessarily.
Story Points: ---
Clone Of:
: 1718541 1718542 1732486 (view as bug list) Environment:
Last Closed: 2019-10-16 06:31:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1718541, 1718542, 1728342, 1732486    

Description Pedro Amoedo 2019-06-05 20:42:35 UTC
Description of problem:

We have a customer suffering strange egressIP outages when vxlan_monitor performs the "egressVXLANMonitor" checks, *look at the timestamps* of the following checks (PFA full log):

I0601 16:02:34.045175   14443 vxlan_monitor.go:231] Node may be offline... retrying
I0602 02:56:09.040928   14443 vxlan_monitor.go:231] Node may be offline... retrying
W0602 05:26:09.042415   14443 vxlan_monitor.go:226] Node is offline
I0602 05:26:14.040664   14443 vxlan_monitor.go:213] Node is back online
I0602 05:31:39.039763   14443 vxlan_monitor.go:231] Node may be offline... retrying
I0602 16:13:44.039254   14443 vxlan_monitor.go:231] Node may be offline... retrying
W0603 03:56:34.039591   14443 vxlan_monitor.go:226] Node is offline
I0603 03:56:39.039668   14443 vxlan_monitor.go:213] Node is back online

As per source code, those retry checks should be performed within seconds, right?

Version-Release number of selected component (if applicable):

OCP v3.11.98

How reproducible:

As per customer comment:

EgressIP needs to be set up with a blocking egressnetworkpolicy. Then, from a pod using an egressIP try to send a packet to a blocked destination. This will end up in just increasing the outgoing packet counter, but will not receive any. Also, per the code, just the counters are checked again, and the check will still fail, and after maxRetries failures the egress router node will be treated as being down. I suspect this will remove routing towards that node. And after this procedure will the code start pinging the egress router at all, which will find that the node is back online. I think at the very first occurrence where we suspect the node might be down, we should start pinging it. That would avoid this behavior. Or to be simpler, a background process sending pings every second may be set up. That would make the code more simple.

This is a rare case, when the source node has light egress traffic, and a simple packet without response can trigger this.

Steps to Reproduce:

Actual results:

Very unusual long delay between egressVXLANMonitor retries.

Expected results:

Maybe a node.retries counter reset as follows?

diff --git a/pkg/network/node/vxlan_monitor.go b/pkg/network/node/vxlan_monitor.go
index b7b1c8c2f1..37b329efcc 100644
--- a/pkg/network/node/vxlan_monitor.go
+++ b/pkg/network/node/vxlan_monitor.go
@@ -232,6 +232,8 @@ func (evm *egressVXLANMonitor) check(retryOnly bool) bool {
                                        retry = true
+                       } else {
+                               node.retries = 0

Additional info:


Comment 17 Miheer Salunke 2019-06-07 03:31:22 UTC
PR has been sent for this issue by rkojedzinszky . PTAL.


Comment 19 zhaozhanqi 2019-06-10 10:02:07 UTC
list reproduce steps for verified this bug:

1. Create cluster on 3.11 with networkpolicy plugin
2. Create new project
3. Added egressip for namespaces. eg:
   oc patch netnamespace z1 -p '{"egressIPs":[""]}'
4. Added egressip on one node, eg:
   oc patch hostsubnet preserve-zzhao-311nrr-1 -p '{"egressIPs":[""]}'

5. Create test pod to make sure it scheduled to node (not the egress ip node)
6. rsh into the test pod and ping one blocked ip
7. check the sdn logs of node which same the test pod.

Comment 24 Weibin Liang 2019-06-25 19:14:40 UTC
Above two PRs are for v3.11, can not find the PR number for v4.2

Move bug from ON_QA back to assigned

Comment 25 Dan Winship 2019-06-25 20:40:42 UTC
(In reply to Weibin Liang from comment #24)
> Above two PRs are for v3.11, can not find the PR number for v4.2

The PRs in the comments are the customer's original 3.11 PR; the PR linked from the "External Trackers" table is correct: https://github.com/openshift/origin/pull/23089

Comment 26 Weibin Liang 2019-06-25 21:49:50 UTC
Following steps in comment 19, testing passed in 4.2.0-0.nightly-2019-06-21-041727.

Comment 30 Dan Winship 2019-07-23 12:41:09 UTC
created bug 1732486 for 3.10 backport

Comment 33 errata-xmlrpc 2019-10-16 06:31:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.