1571430 – [3.9] Update of Egress Network Policy causes temporary egress failure when using dnsName

Bug 1571430 - [3.9] Update of Egress Network Policy causes temporary egress failure when using dnsName

Summary: [3.9] Update of Egress Network Policy causes temporary egress failure when us...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Ravi Sankar
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-24 18:54 UTC by Ravi Sankar
Modified:	2018-05-17 06:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Update egress policy needed blocking outgoing traffic, patching ovs flows and then re-enabling traffic but the ovs flow generation for dns names was slow. Consequence: Few seconds egress traffic downtime which may not be acceptable. Fix: Change update egress policy handling to pre-populate all new ovs flows before blocking the outgoing traffic. Result: Reduces the downtime during egress policy updates.
Clone Of:
Environment:
Last Closed:	2018-05-17 06:43:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1566	0	None	None	None	2018-05-17 06:44:00 UTC

Description Ravi Sankar 2018-04-24 18:54:54 UTC

+++ This bug was initially created as a clone of Bug #1558484 +++

Description of problem:
Every 30 minutes the egress network policies are being updated/re-written (even when there is no change to any policy). As part of the process for the update to a policy in a project a drop rule is applied to the OpenFlow tables for the project with maximum priority and then the rules are rewritten. This means that for the duration of this rewrite no egress traffic is permitted from any pods in the project and no dns lookups are permitted either. We are seeing occasions where this re-write of rules can take the order of 5/6 seconds which potentially will impact our apps

The customer confirmed that redhat/ovs-multinant-pugin is configured on masters and nodes and followed the notes present on our documentation for this specific configuration:

https://docs.openshift.com/container-platform/3.7/admin_guide/managing_networking.html#admin-guide-limit-pod-access-egress


Expected results:
Control the EgressNetworkPolicies for being updated and/or not causing application downtime, since it seems traffic stops when the policies are updated.

Additional info:
OCP is using vSphere Cloud Provider.

I've been looking at this file:

https://raw.githubusercontent.com/openshift/origin/master/api/swagger-spec/oapi-v1.json

I don't see if there's any variable that would be useful to help with this or if there is something we can do to configure update timing or blocking it on the policy.json we can use to create the EgressNetworkPolicy object.

Also don't know if this might be related with this issue:

"Domain name updates are polled based on the TTL (time to live) value of the domain of the local non-authoritative server, or 30 minutes if the TTL is unable to be fetched. The pod should also resolve the domain from the same local non-authoritative server when necessary, otherwise the IP addresses for the domain perceived by the egress network policy controller and the pod will be different, and the egress network policy may not be enforced as expected. In the above example, suppose www.foo.com resolved to 10.11.12.13 and has a DNS TTL of one minute, but was later changed to 20.21.22.23. OpenShift Container Platform will then take up to one minute to adapt to these changes."

Comment 1 Ravi Sankar 2018-04-24 19:38:06 UTC

https://github.com/openshift/ose/pull/1228

Comment 5 Hongan Li 2018-05-02 10:11:26 UTC

cannot reproduce the issue in old version, so tested in atomic-openshift-3.9.27-1.git.0.964617d and verified the code change from node logs.

test step:
1. create a project
2. journalctl -u atomic-openshift-node.service -f | grep 'table=101\|Correcting CIDRSelector'
3. create egressnetworkpolicy with some DNS name and cidrSelector: "0.0.0.0/32"

in old version, the logs like:
May 02 05:49:33 qe-hongli-39old-master-etcd-1 atomic-openshift-node[23194]: I0502 05:49:33.592344   23194 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, reg0=2012612, cookie=1, priority=65535, actions=drop
May 02 05:49:33 qe-hongli-39old-master-etcd-1 atomic-openshift-node[23194]: I0502 05:49:33.600145   23194 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 table=101, reg0=2012612, cookie=0/1
May 02 05:49:33 qe-hongli-39old-master-etcd-1 atomic-openshift-node[23194]: I0502 05:49:33.606833   23194 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, reg0=2012612, priority=13, ip, nw_dst=123.125.116.16, actions=output:2
<---snip--->
May 02 05:49:33 qe-hongli-39old-master-etcd-1 atomic-openshift-node[23194]: W0502 05:49:33.841086   23194 ovscontroller.go:478] Correcting CIDRSelector '0.0.0.0/32' to '0.0.0.0/0' in EgressNetworkPolicy lha:policy-test
May 02 05:49:33 qe-hongli-39old-master-etcd-1 atomic-openshift-node[23194]: I0502 05:49:33.841116   23194 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, reg0=2012612, priority=1, ip, actions=drop
May 02 05:49:33 qe-hongli-39old-master-etcd-1 atomic-openshift-node[23194]: I0502 05:49:33.848476   23194 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 table=101, reg0=2012612, cookie=1/1

in 3.9.27 version, the logs like:

May 02 05:54:16 qe-hongli-39-node-registry-router-1 atomic-openshift-node[17460]: W0502 05:54:16.160220   17460 ovscontroller.go:476] Correcting CIDRSelector '0.0.0.0/32' to '0.0.0.0/0' in EgressNetworkPolicy lha:policy-test
May 02 05:54:16 qe-hongli-39-node-registry-router-1 atomic-openshift-node[17460]: I0502 05:54:16.160243   17460 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, reg0=3896081, cookie=1, priority=65535, actions=drop
May 02 05:54:16 qe-hongli-39-node-registry-router-1 atomic-openshift-node[17460]: I0502 05:54:16.166923   17460 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 table=101, reg0=3896081, cookie=0/1
May 02 05:54:16 qe-hongli-39-node-registry-router-1 atomic-openshift-node[17460]: I0502 05:54:16.172930   17460 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, reg0=3896081, priority=4, ip, nw_dst=98.138.219.232, actions=output:2
<---snip--->
May 02 05:54:17 qe-hongli-39-node-registry-router-1 atomic-openshift-node[17460]: I0502 05:54:17.299352   17460 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, reg0=3896081, priority=1, ip, actions=drop
May 02 05:54:17 qe-hongli-39-node-registry-router-1 atomic-openshift-node[17460]: I0502 05:54:17.306265   17460 ovs.go:145] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 table=101, reg0=3896081, cookie=1/1

test env:
Red Hat Enterprise Linux Server release 7.5 (Maipo)
Linux qe-39-node 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 8 errata-xmlrpc 2018-05-17 06:43:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1566

Note You need to log in before you can comment on or make changes to this bug.