Bug 1411712

Summary: [3.3] [Critical] Need help in investigating SF#01767652. "openvswitch rules are not applied"
Product: OpenShift Container Platform Reporter: Alexander Koksharov <akokshar>
Component: NetworkingAssignee: Dan Winship <danw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.3.0CC: aloughla, aos-bugs, clichybi, danw, erich, pmorey, yadu
Target Milestone: ---   
Target Release: 3.3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: atomic-openshift-3.3.1.11-1.git.0.cba037c.el7 Doc Type: Bug Fix
Doc Text:
Previously, the EgressNetworkPolicy functionality might stop working on a node after restarting the node service. This has been fixed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-26 20:43:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Alexander Koksharov 2017-01-10 11:31:51 UTC
Description of problem:

Customer is creating egressNetworkPolicy in a project. But system is just removing all rules from OVS for the netnamespace and adding a single 
"drop all" rule. In node logs we see:
atomic-openshift-node[39469]: E0106 17:40:05.187734   39469 controller.go:506] multiple EgressNetworkPolicies in same network namespace (vwc-rec:default, m4d-rec:default) is not allowed; dropping all traffic

have checked:
- no global projects have egress policy defined.
- there are no joined projects.
- none of the projects have more than one egress policy defined.

Two separate environments (3.3.1.7 and 3.3.1.3) do suffer from the issue. 
At the beginning only one node was affected by this. But now both nodes have this issue. It looks like more project related.

Please advise on what to check/trace.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Dan Winship 2017-01-10 14:18:46 UTC
This is https://github.com/openshift/origin/pull/12045 and it's fixed in 3.4 (v3.4.0.32). We did not backport the fix to 3.3. The relevant code didn't change much between 3.3 and 3.4 so it would be possible to do, but I don't know what the policy is for 3.3 bugfixes at this point...

(There is no way to work around the bug other than backporting the bugfix.)

Comment 11 Meng Bo 2017-01-22 08:08:34 UTC
Tested on OCP 3.3.1.11

After adding multiple egresspolicy to a single namespace, the existing openflow rules will not be affected. And will add a new one to drop the traffic for specific project.

From node log:
Jan 22 03:03:10 node1 atomic-openshift-node[27026]: E0122 03:03:10.885757   27026 controller.go:506] multiple EgressNetwor
kPolicies in same network namespace (bmengp1:default, bmengp1:default2) is not allowed; dropping all traffic
Jan 22 03:03:10 node1 atomic-openshift-node[27026]: I0122 03:03:10.885809   27026 ovs.go:37] Executing: /usr/bin/ovs-ofctl
 -O OpenFlow13 del-flows br0 table=9, reg0=720494
Jan 22 03:03:10 node1 atomic-openshift-node[27026]: I0122 03:03:10.891489   27026 ovs.go:37] Executing: /usr/bin/ovs-ofctl
 -O OpenFlow13 add-flow br0 table=9, reg0=720494, priority=1, actions=drop


Check the openflow rules:
# ovs-ofctl dump-flows br0 -O openflow13
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x0, duration=221.113s, table=0, n_packets=0, n_bytes=0, priority=200,arp,in_port=1,arp_spa=10.1.0.0/16,arp_tpa=10.1.1.0/24 actions=move:NXM_NX_TUN_ID[0..31]-
>NXM_NX_REG0[],goto_table:1
 cookie=0x0, duration=221.110s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=1,nw_src=10.1.0.0/16,nw_dst=10.1.1.0/24 actions=move:NXM_NX_TUN_ID[0..31]->NX
M_NX_REG0[],goto_table:1 
 cookie=0x0, duration=221.105s, table=0, n_packets=45, n_bytes=1890, priority=200,arp,in_port=2,arp_spa=10.1.1.1,arp_tpa=10.1.0.0/16 actions=goto_table:5
 cookie=0x0, duration=221.102s, table=0, n_packets=3871, n_bytes=2493051, priority=200,ip,in_port=2 actions=goto_table:5
 cookie=0x0, duration=221.095s, table=0, n_packets=2, n_bytes=84, priority=200,arp,in_port=3,arp_spa=10.1.1.0/24 actions=goto_table:5
 cookie=0x0, duration=221.085s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=3,nw_src=10.1.1.0/24 actions=goto_table:5
 cookie=0x0, duration=221.108s, table=0, n_packets=0, n_bytes=0, priority=150,in_port=1 actions=drop
 cookie=0x0, duration=221.098s, table=0, n_packets=16, n_bytes=1296, priority=150,in_port=2 actions=drop
 cookie=0x0, duration=221.058s, table=0, n_packets=38, n_bytes=3132, priority=150,in_port=3 actions=drop
 cookie=0x0, duration=221.050s, table=0, n_packets=41, n_bytes=1722, priority=100,arp actions=goto_table:2
 cookie=0x0, duration=221.044s, table=0, n_packets=2231, n_bytes=239877, priority=100,ip actions=goto_table:2
 cookie=0x0, duration=221.004s, table=0, n_packets=45, n_bytes=3558, priority=0 actions=drop
 cookie=0x0, duration=220.782s, table=1, n_packets=0, n_bytes=0, priority=100,tun_src=10.8.174.9 actions=goto_table:5
 cookie=0x0, duration=221.001s, table=1, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=220.613s, table=2, n_packets=2, n_bytes=84, priority=100,arp,in_port=11,arp_spa=10.1.1.5,arp_sha=02:42:0a:01:01:05 actions=load:0->NXM_NX_REG0[],goto_table:5
 cookie=0x0, duration=220.604s, table=2, n_packets=318, n_bytes=28460, priority=100,ip,in_port=11,nw_src=10.1.1.5 actions=load:0->NXM_NX_REG0[],goto_table:3
 cookie=0x0, duration=220.992s, table=2, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=220.990s, table=3, n_packets=299, n_bytes=75681, priority=100,ip,nw_dst=172.30.0.0/16 actions=goto_table:4
 cookie=0x0, duration=220.981s, table=3, n_packets=1932, n_bytes=164196, priority=0 actions=goto_table:5
 cookie=0x0, duration=220.958s, table=4, n_packets=299, n_bytes=75681, priority=200,reg0=0 actions=output:2
...
...
...
 cookie=0x0, duration=220.913s, table=8, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=22.721s, table=9, n_packets=0, n_bytes=0, priority=1,reg0=0xafe6e actions=drop
 cookie=0x0, duration=220.911s, table=9, n_packets=496, n_bytes=35890, priority=0 actions=output:2
 cookie=0x0, duration=220.822s, table=253, n_packets=0, n_bytes=0, actions=note:01.01.00.00.00.00

Comment 12 Meng Bo 2017-01-22 11:05:32 UTC
Please ignore the comment#11 above.

Tested with following steps

To reproduce, tested on build 3.3.1.9
1. Create 10 projects
2. Add egress policy to each project
3. Check the openflow 
4. Restart openshift node service
5. Check the openflow again
Result:
In step 3, the openflow rules for the project created in table9 with following contents,
 cookie=0x0, duration=1.722s, table=9, n_packets=0, n_bytes=0, priority=2,ip,reg0=0x5d687c,nw_dst=172.16.120.0/24 actions=output:2
 cookie=0x0, duration=1.715s, table=9, n_packets=0, n_bytes=0, priority=1,ip,reg0=0x5d687c,nw_dst=10.66.140.0/24 actions=drop
In step 5, the openflow rules are changed by the restart to
 cookie=0x0, duration=1.704s, table=9, n_packets=0, n_bytes=0, priority=1,reg0=0x5d687c actions=drop

To verify, tested with the same steps above on build 3.3.1.11
The openflow rules for the project with egressnetworkpolicy will not be corrupted by the restart.

Comment 16 errata-xmlrpc 2017-01-26 20:43:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0199