Description of problem: We have a customer suffering strange egressIP outages when vxlan_monitor performs the "egressVXLANMonitor" checks, *look at the timestamps* of the following checks (PFA full log): ~~~ I0601 16:02:34.045175 14443 vxlan_monitor.go:231] Node 192.168.52.153 may be offline... retrying I0602 02:56:09.040928 14443 vxlan_monitor.go:231] Node 192.168.52.153 may be offline... retrying W0602 05:26:09.042415 14443 vxlan_monitor.go:226] Node 192.168.52.153 is offline I0602 05:26:14.040664 14443 vxlan_monitor.go:213] Node 192.168.52.153 is back online I0602 05:31:39.039763 14443 vxlan_monitor.go:231] Node 192.168.52.153 may be offline... retrying I0602 16:13:44.039254 14443 vxlan_monitor.go:231] Node 192.168.52.153 may be offline... retrying W0603 03:56:34.039591 14443 vxlan_monitor.go:226] Node 192.168.52.153 is offline I0603 03:56:39.039668 14443 vxlan_monitor.go:213] Node 192.168.52.153 is back online ~~~ As per source code, those retry checks should be performed within seconds, right? Version-Release number of selected component (if applicable): OCP v3.11.98 How reproducible: As per customer comment: -------------------------- EgressIP needs to be set up with a blocking egressnetworkpolicy. Then, from a pod using an egressIP try to send a packet to a blocked destination. This will end up in just increasing the outgoing packet counter, but will not receive any. Also, per the code, just the counters are checked again, and the check will still fail, and after maxRetries failures the egress router node will be treated as being down. I suspect this will remove routing towards that node. And after this procedure will the code start pinging the egress router at all, which will find that the node is back online. I think at the very first occurrence where we suspect the node might be down, we should start pinging it. That would avoid this behavior. Or to be simpler, a background process sending pings every second may be set up. That would make the code more simple. This is a rare case, when the source node has light egress traffic, and a simple packet without response can trigger this. -------------------------- Steps to Reproduce: 1. 2. 3. Actual results: Very unusual long delay between egressVXLANMonitor retries. Expected results: Maybe a node.retries counter reset as follows? ~~~ diff --git a/pkg/network/node/vxlan_monitor.go b/pkg/network/node/vxlan_monitor.go index b7b1c8c2f1..37b329efcc 100644 --- a/pkg/network/node/vxlan_monitor.go +++ b/pkg/network/node/vxlan_monitor.go @@ -232,6 +232,8 @@ func (evm *egressVXLANMonitor) check(retryOnly bool) bool { retry = true continue } + } else { + node.retries = 0 } } ~~~ Additional info: https://github.com/openshift/origin/blob/master/pkg/network/node/vxlan_monitor.go
PR has been sent for this issue by rkojedzinszky . PTAL. https://github.com/openshift/origin/pull/23069
list reproduce steps for verified this bug: 1. Create cluster on 3.11 with networkpolicy plugin 2. Create new project 3. Added egressip for namespaces. eg: oc patch netnamespace z1 -p '{"egressIPs":["10.0.76.100"]}' 4. Added egressip on one node, eg: oc patch hostsubnet preserve-zzhao-311nrr-1 -p '{"egressIPs":["10.0.76.100"]}' 5. Create test pod to make sure it scheduled to node (not the egress ip node) 6. rsh into the test pod and ping one blocked ip 7. check the sdn logs of node which same the test pod.
Above two PRs are for v3.11, can not find the PR number for v4.2 Move bug from ON_QA back to assigned
(In reply to Weibin Liang from comment #24) > Above two PRs are for v3.11, can not find the PR number for v4.2 The PRs in the comments are the customer's original 3.11 PR; the PR linked from the "External Trackers" table is correct: https://github.com/openshift/origin/pull/23089
Following steps in comment 19, testing passed in 4.2.0-0.nightly-2019-06-21-041727.
created bug 1732486 for 3.10 backport
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922