Description of problem: When the node IP is updated, AddHostSubnet() is creating ovs rules for the current host with old node IP which causes problem. We should never create hostsubnet ovs rules for the current host. Version-Release number of selected component (if applicable): Observed on: oc v3.9.0-alpha.2+bbe94ca-19-dirty kubernetes v1.9.0-beta1 But this could happen on older versions as well. How reproducible: This could happen when openshift master is loaded and is slow to update HostIP in the HostSubnet record for the node. Steps to Reproduce: 1. Configure openshift sdn plugin on master and node. 2. Run openshift master and node 2. Make openshift master heavily loaded (or simulate by putting a few secs sleep in handling hostsubnet events) 3. Restart openshift node with new node IP. Actual results: (172.17.0.3 is the old IP on the node) [root@openshift-node-1 /]# ovs-ofctl -O openflow13 dump-flows br0 ... cookie=0x0, duration=62.426s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=172.17.0.3 actions=goto_table:30 cookie=0x0, duration=62.423s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.17.0.3->tun_dst,output:1 cookie=0x0, duration=62.420s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.17.0.3->tun_dst,output:1 ... Expected results: [root@openshift-node-1 /]# ovs-ofctl -O openflow13 dump-flows br0 No ovs rules for current node IP for the tables 10, 50 and 90. Additional info:
Proposed fix: openshift-node should not read the local hostsubnet record if the node IP is not updated.
This is very easy to reproduce. Openshift master doesn't need to be heavily loaded. Still wondering why this issue was not filed earlier. Do we need to back-port this fix? Fixed in https://github.com/openshift/origin/pull/18117
Commits pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/f6e67a0d61b6597ba281ac44fb7311fb2c74ee3d Bug 1534720 - SDN node should fetch latest local HostSubnet for the node https://github.com/openshift/origin/commit/0201b094575868e7f79af6ee672fead00828cc33 Merge pull request #18117 from pravisankar/fix-subnets Automatic merge from submit-queue (batch tested with PRs 18117, 18049). Bug 1534720 - SDN node should fetch latest local HostSubnet for the node
*** Bug 1530931 has been marked as a duplicate of this bug. ***
verified in openshift v3.9.0-0.38.0 but failed to reach to the node and pods on the node from master. Checked the hostsubnet and ovs rules after updating nodeIP in node-config.yaml, looks the hostsubnet is updated and no ovs rules for current node IP for the tables 10, 50 and 90. but the problem is cannot reach to the node and pods on the node. If reverting to original nodeIP in node-config.yaml then the problem is gone.
@hongli, using dind env to test in v3.9.0-0.41, I can not reproduce the origin problem, the both hostsubnet and new ovs rules are updated with new nodeIP worked fine. After using new nodeIP, master and other node does not have any issue to reach that new nodeIP from testing node. Here is the dind commands to create two NICs (eth0 and eth1) in one node: ./dind-cluster.sh start -ar -n redhat/openshift-ovs-multitenant
@Weibin, thanks for your help and verification. I just using a secondary IP in eth0 to test it. like 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:86:27:7c brd ff:ff:ff:ff:ff:ff inet 172.16.1.12/24 brd 172.16.1.255 scope global dynamic eth0 valid_lft 73436sec preferred_lft 73436sec inet 172.16.1.13/24 scope global secondary eth0 valid_lft forever preferred_lft forever so does that mean we cannot using secondary IP for nodeIP ?
@hongli, before trying dind setup , I did the same way as you did to define secondary IP in eth0, and I found the other nodes can not communicate with testing node through this secondary IP. I check with our developers about using secondary IP in openshift env, and they confirmed because the network security configuration in AWS or Openstack, their security policy may block secondary IP traffic. At same time, the Ravi's origin bug was found by configuring second NIC eth1, not secondary IP under same NIC eth0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489