Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1534720

Summary:	Invalid OVS rules when the node IP is updated
Product:	OpenShift Container Platform	Reporter:	Ravi Sankar <rpenta>
Component:	Networking	Assignee:	Ravi Sankar <rpenta>
Status:	CLOSED ERRATA	QA Contact:	Hongan Li <hongli>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, bmeng, hongli, jkaur, weliang
Target Milestone:	---
Target Release:	3.9.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Problem: Node IP update created invalid ovs rules which resulted into unexpected traffic behavior. Fix: Node IP update handled correctly by waiting for latest HostSubnet record and no unnecessary ovs flow rules will be created.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-28 14:19:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ravi Sankar 2018-01-15 19:41:03 UTC

Description of problem:
When the node IP is updated, AddHostSubnet() is creating ovs rules for the current host with old node IP which causes problem. We should never create hostsubnet ovs rules for the current host.

Version-Release number of selected component (if applicable):
Observed on:
oc v3.9.0-alpha.2+bbe94ca-19-dirty
kubernetes v1.9.0-beta1
But this could happen on older versions as well.

How reproducible:
This could happen when openshift master is loaded and is slow to update HostIP in the HostSubnet record for the node.

Steps to Reproduce:
1. Configure openshift sdn plugin on master and node.
2. Run openshift master and node
2. Make openshift master heavily loaded (or simulate by putting a few secs sleep in handling hostsubnet events)
3. Restart openshift node with new node IP.

Actual results:
(172.17.0.3 is the old IP on the node)
[root@openshift-node-1 /]# ovs-ofctl -O openflow13 dump-flows br0
...
cookie=0x0, duration=62.426s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=172.17.0.3 actions=goto_table:30
cookie=0x0, duration=62.423s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.17.0.3->tun_dst,output:1
cookie=0x0, duration=62.420s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.17.0.3->tun_dst,output:1
...
Expected results:
[root@openshift-node-1 /]# ovs-ofctl -O openflow13 dump-flows br0
No ovs rules for current node IP for the tables 10, 50 and 90.

Additional info:

Comment 1 Ravi Sankar 2018-01-15 19:43:51 UTC

Proposed fix: openshift-node should not read the local hostsubnet record if the node IP is not updated.

Comment 2 Ravi Sankar 2018-01-15 21:32:46 UTC

This is very easy to reproduce. Openshift master doesn't need to be heavily loaded. Still wondering why this issue was not filed earlier. Do we need to back-port this fix?

Fixed in https://github.com/openshift/origin/pull/18117

Comment 3 openshift-github-bot 2018-01-17 07:17:59 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/f6e67a0d61b6597ba281ac44fb7311fb2c74ee3d
Bug 1534720 - SDN node should fetch latest local HostSubnet for the node

https://github.com/openshift/origin/commit/0201b094575868e7f79af6ee672fead00828cc33
Merge pull request #18117 from pravisankar/fix-subnets

Automatic merge from submit-queue (batch tested with PRs 18117, 18049).

Bug 1534720 - SDN node should fetch latest local HostSubnet for the node

Comment 9 Dan Williams 2018-01-31 23:11:51 UTC

*** Bug 1530931 has been marked as a duplicate of this bug. ***

Comment 10 Hongan Li 2018-02-06 10:17:52 UTC

verified in openshift v3.9.0-0.38.0 but failed to reach to the node and pods on the node from master.

Checked the hostsubnet and ovs rules after updating nodeIP in node-config.yaml, looks the hostsubnet is updated and no ovs rules for current node IP for the tables 10, 50 and 90. but the problem is cannot reach to the node and pods on the node. If reverting to original nodeIP in node-config.yaml then the problem is gone.

Comment 11 Weibin Liang 2018-02-07 21:39:38 UTC

@hongli, using dind env to test in v3.9.0-0.41, I can not reproduce the origin problem, the both hostsubnet and new ovs rules are updated with new nodeIP worked fine.

After using new nodeIP, master and other node does not have any issue to reach that new nodeIP from testing node.

Here is the dind commands to create two NICs (eth0 and eth1) in one node:
 ./dind-cluster.sh start -ar -n redhat/openshift-ovs-multitenant

Comment 12 Hongan Li 2018-02-08 11:42:13 UTC

@Weibin, thanks for your help and verification. I just using a secondary IP in eth0 to test it. like

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:86:27:7c brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.12/24 brd 172.16.1.255 scope global dynamic eth0
       valid_lft 73436sec preferred_lft 73436sec
    inet 172.16.1.13/24 scope global secondary eth0
       valid_lft forever preferred_lft forever

so does that mean we cannot using secondary IP for nodeIP ?

Comment 13 Weibin Liang 2018-02-08 14:37:15 UTC

@hongli, before trying dind setup , I did the same way as you did to define secondary IP in eth0, and I found the other nodes can not communicate with testing node through this secondary IP.

I check with our developers about using secondary IP in openshift env, and they confirmed because the network security configuration in AWS or Openstack,
their security policy may block secondary IP traffic.

At same time, the Ravi's origin bug was found by configuring second NIC eth1, not secondary IP under same NIC eth0.

Comment 16 errata-xmlrpc 2018-03-28 14:19:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489