Bug 1534720 - Invalid OVS rules when the node IP is updated
Summary: Invalid OVS rules when the node IP is updated
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: 3.9.0
Assignee: Ravi Sankar
QA Contact: Hongan Li
URL:
Whiteboard:
: 1530931 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-15 19:41 UTC by Ravi Sankar
Modified: 2018-03-29 05:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Problem: Node IP update created invalid ovs rules which resulted into unexpected traffic behavior. Fix: Node IP update handled correctly by waiting for latest HostSubnet record and no unnecessary ovs flow rules will be created.
Clone Of:
Environment:
Last Closed: 2018-03-28 14:19:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:19:32 UTC

Description Ravi Sankar 2018-01-15 19:41:03 UTC
Description of problem:
When the node IP is updated, AddHostSubnet() is creating ovs rules for the current host with old node IP which causes problem. We should never create hostsubnet ovs rules for the current host.

Version-Release number of selected component (if applicable):
Observed on:
oc v3.9.0-alpha.2+bbe94ca-19-dirty
kubernetes v1.9.0-beta1
But this could happen on older versions as well.

How reproducible:
This could happen when openshift master is loaded and is slow to update HostIP in the HostSubnet record for the node.

Steps to Reproduce:
1. Configure openshift sdn plugin on master and node.
2. Run openshift master and node
2. Make openshift master heavily loaded (or simulate by putting a few secs sleep in handling hostsubnet events)
3. Restart openshift node with new node IP.

Actual results:
(172.17.0.3 is the old IP on the node)
[root@openshift-node-1 /]# ovs-ofctl -O openflow13 dump-flows br0
...
cookie=0x0, duration=62.426s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=172.17.0.3 actions=goto_table:30
cookie=0x0, duration=62.423s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.17.0.3->tun_dst,output:1
cookie=0x0, duration=62.420s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.17.0.3->tun_dst,output:1
...
Expected results:
[root@openshift-node-1 /]# ovs-ofctl -O openflow13 dump-flows br0
No ovs rules for current node IP for the tables 10, 50 and 90.

Additional info:

Comment 1 Ravi Sankar 2018-01-15 19:43:51 UTC
Proposed fix: openshift-node should not read the local hostsubnet record if the node IP is not updated.

Comment 2 Ravi Sankar 2018-01-15 21:32:46 UTC
This is very easy to reproduce. Openshift master doesn't need to be heavily loaded. Still wondering why this issue was not filed earlier. Do we need to back-port this fix?

Fixed in https://github.com/openshift/origin/pull/18117

Comment 3 openshift-github-bot 2018-01-17 07:17:59 UTC
Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/f6e67a0d61b6597ba281ac44fb7311fb2c74ee3d
Bug 1534720 - SDN node should fetch latest local HostSubnet for the node

https://github.com/openshift/origin/commit/0201b094575868e7f79af6ee672fead00828cc33
Merge pull request #18117 from pravisankar/fix-subnets

Automatic merge from submit-queue (batch tested with PRs 18117, 18049).

Bug 1534720 - SDN node should fetch latest local HostSubnet for the node

Comment 9 Dan Williams 2018-01-31 23:11:51 UTC
*** Bug 1530931 has been marked as a duplicate of this bug. ***

Comment 10 Hongan Li 2018-02-06 10:17:52 UTC
verified in openshift v3.9.0-0.38.0 but failed to reach to the node and pods on the node from master.

Checked the hostsubnet and ovs rules after updating nodeIP in node-config.yaml, looks the hostsubnet is updated and no ovs rules for current node IP for the tables 10, 50 and 90. but the problem is cannot reach to the node and pods on the node. If reverting to original nodeIP in node-config.yaml then the problem is gone.

Comment 11 Weibin Liang 2018-02-07 21:39:38 UTC
@hongli, using dind env to test in v3.9.0-0.41, I can not reproduce the origin problem, the both hostsubnet and new ovs rules are updated with new nodeIP worked fine.

After using new nodeIP, master and other node does not have any issue to reach that new nodeIP from testing node.

Here is the dind commands to create two NICs (eth0 and eth1) in one node:
 ./dind-cluster.sh start -ar -n redhat/openshift-ovs-multitenant

Comment 12 Hongan Li 2018-02-08 11:42:13 UTC
@Weibin, thanks for your help and verification. I just using a secondary IP in eth0 to test it. like

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:86:27:7c brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.12/24 brd 172.16.1.255 scope global dynamic eth0
       valid_lft 73436sec preferred_lft 73436sec
    inet 172.16.1.13/24 scope global secondary eth0
       valid_lft forever preferred_lft forever

so does that mean we cannot using secondary IP for nodeIP ?

Comment 13 Weibin Liang 2018-02-08 14:37:15 UTC
@hongli, before trying dind setup , I did the same way as you did to define secondary IP in eth0, and I found the other nodes can not communicate with testing node through this secondary IP.

I check with our developers about using secondary IP in openshift env, and they confirmed because the network security configuration in AWS or Openstack,
their security policy may block secondary IP traffic.

At same time, the Ravi's origin bug was found by configuring second NIC eth1, not secondary IP under same NIC eth0.

Comment 16 errata-xmlrpc 2018-03-28 14:19:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.