Bug 1323275 - Adding multiple nodes at once to a Native HA Environment will assign the same hostsubnet to nodes.
Summary: Adding multiple nodes at once to a Native HA Environment will assign the same...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2016-04-01 16:41 UTC by Ryan Howe
Modified: 2017-10-06 18:06 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-27 16:52:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2245651 0 None None None 2016-04-07 23:49:12 UTC

Description Ryan Howe 2016-04-01 16:41:40 UTC
Created attachment 1142642 [details]
hostsubnet and flows

Description of problem:

 - When adding multiple nodes a HA environment at once will assign the same hostsubnet to some of the new nodes.


Version-Release number of selected component (if applicable):
3.1.1.6

How reproducible:
- Have not reproduced
- Hit the issue once during a scale up. 


Steps to Reproduce:
1.  Added 7+ nodes to an environment with the scale up playbook
   /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/scaleup.yml

2.  All nodes succeed in starting and enter a ready state


Actual results:

 2 nodes get assigned same hostsubnet only one node gets the correct flows passed to OpenVswitch the other node stays in a READY state but pods are unable to be scheduled to node and node is unable to access other hosts on the SDN  

Expected results:

 Each node gets its own hostsubnet 

Additional info:

 I speculate this happens when the nodes hit the API on separate masters via the LB at the same times. 

# oc get hostsubnet


osenode117.example.com        10.16.188.88      10.1.26.0/24  # DUPLIATE NOT WORKING
osenode118.example.com        10.16.188.89      10.1.36.0/24
osenode119.example.com        10.16.188.93      10.1.33.0/24
osenode122.example.com        10.16.188.91      10.1.31.0/24
osenode123.example.com        10.16.188.94      10.1.34.0/24
osenode125.example.com        10.16.188.96      10.1.32.0/24
osenode126.example.com        10.16.188.92      10.1.26.0/24  # DUPLIATE  WORKING

# ovs-ofctl -O OpenFlow13 dump-flows br0
 cookie=0xa10bc5c, duration=344.266s, table=8, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.26.0/24 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.16.188.92->tun_dst,output:1

 cookie=0xa10bc5c, duration=344.263s, table=9, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.1.26.0/24 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.16.188.92->tun_dst,output:1


More info in attachment

Comment 1 Dan Williams 2016-04-04 16:39:05 UTC
upstream fix:

https://github.com/openshift/openshift-sdn/pull/282

Comment 2 Dan Williams 2016-04-04 19:25:47 UTC
Posted fix is incorrect and will be dropped.

Since you're running in HA mode, you need to make sure that only *one* of the masters is running the controller-manager with the SDN plugin.  The rest cannot be running the SDN master plugin because then yes, multiple master can allocate the same subnet to nodes.

Comment 3 Ryan Howe 2016-04-07 23:35:58 UTC
@Dan Williams at this point in time with a Native HA setup etcd ensures that one atomic-openshift-master-controller.service is running. Does the SDN plugin run within this service or the api service? 

Is there any way to make sure only one master is running the plug in?

Comment 4 Clayton Coleman 2016-04-12 14:08:53 UTC
If the SDN allocator is treated as a controller, only one will be run.

Comment 5 Dan Williams 2016-04-15 17:14:56 UTC
Ok, I see now where each controller instance blocks waiting on the lease from etcd.

We're going to need more logging in openshift-sdn to figure out what hostnames and IP addresses the nodes are sending to the master and when leases are handed out to clients.  I can't see after some analysis where the problem might happen as the HostSubnet allocations are all serialized in the master, and the master reads existing ones out of etcd when it starts up.  If one master looses the lease, then the next master is allowed to start and it will read all of the existing HostSubnets from etcd and add them to the allocation map to ensure it doesn't hand them out again.

Comment 7 Dan Williams 2016-04-20 22:44:16 UTC
There is some more logging for openshift-sdn to help debug this in https://github.com/openshift/openshift-sdn/pull/296

Comment 8 Dan Williams 2016-04-28 13:02:36 UTC
These logging updates are part of OpenShift 3.2

Comment 9 Dan Williams 2016-04-28 13:05:24 UTC
Ryan, next steps here would attempting to reproduce this with OpenShift 3.2 and grabbing the logging from all the [atomic-]openshift-master-controllers services with something like "journalctl -b -u atomic-openshift-master-controllers".

Comment 11 Dan Williams 2016-05-04 22:30:13 UTC
Ryan, can you also report the exact atomic-openshift RPM version that is being run here?  I need that so I can create a scratch build with additional debugging patches applied.  Thanks!

Comment 13 Ryan Howe 2016-05-05 20:31:52 UTC
atomic-openshift-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 
atomic-openshift-clients-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 
atomic-openshift-node-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 
atomic-openshift-sdn-ovs-3.1.1.6-3.git.16.5327e56.el7aos.x86_64
atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos.noarch 
tuned-profiles-atomic-openshift-node-3.1.1.6-3.git.16.5327e56.el7aos.x86_64

Comment 14 Dan Williams 2016-05-05 21:50:39 UTC
Testing RPMs are here:

http://people.redhat.com/dcbw/openshift/rh1323275/

Procedure:

1) "rpm -Fvh *.rpm" on any machine running the 'atomic-openshift-master-controllers' service

2) systemctl restart atomic-openshift-master-controllers on each master

3) scale down, then scale up and attempt to reproduce the problem

4) when reproduced, on each master grab:

journalctl -b -u atomic-openshift-master-controllers
/etc/origin/master/master-config.yaml
oc get hostsubnets

Comment 17 Brennan Vincello 2016-05-24 01:15:34 UTC
Client is requesting RPMs for 3.1.1.6-4 instead.

Comment 18 Dan Williams 2016-05-27 20:50:25 UTC
Updated testing RPMs are here:

http://people.redhat.com/dcbw/openshift/rh1323275/

Procedure:

1) "rpm -Fvh *.rpm" on any machine running the 'atomic-openshift-master-controllers' service

2) systemctl restart atomic-openshift-master-controllers on each master

3) scale down, then scale up and attempt to reproduce the problem

4) when reproduced, on each master grab:

journalctl -b -u atomic-openshift-master-controllers
/etc/origin/master/master-config.yaml
oc get hostsubnets

Comment 19 Dan Williams 2016-06-03 14:51:52 UTC
Was the customer able to install and testing out the debug RPMs?

Comment 23 Dan Williams 2016-09-09 23:12:09 UTC
This happened on another cluster.  Multiple masters were concurrently running the SDN master code as evidenced by SDN node watch timeouts happening close to each other on different nodes:

Sep 09 00:00:31 node3 atomic-openshift-master-controllers[101101]: W0909 00:00:31.626195  101101 reflector.go:224] /builddir/build/BUILD/atomic-openshift-git-32.adf8ec9/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go:528: watch of *api.Node ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [331719549/331719546]) [331720548]

Sep 09 00:00:36 node2 atomic-openshift-master-controllers[3139]: W0909 00:00:36.297243    3139 reflector.go:224] /builddir/build/BUILD/atomic-openshift-git-32.adf8ec9/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go:528: watch of *api.Node ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [331719549/331719534]) [331720548]

Sep 09 00:01:11 node1 atomic-openshift-master-controllers[2789]: W0909 00:01:11.497431    2789 reflector.go:224] /builddir/build/BUILD/atomic-openshift-git-32.adf8ec9/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go:528: watch of *api.Node ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [331720078/331720074]) [331721077]

This is likely because this cluster's config did not specify ControllerLeaseTTL > 0 in the master configuration, which causes all master-controllers to run concurrently and not take turns being the leader.

This clearly doesn't work well for the SDN, and we need to find some way of exiting with an error when the SDN is enabled, and ControllerLeaseTTL <= 0.

Comment 24 Dan Williams 2016-09-14 23:10:36 UTC
Fix for discussion opened at https://github.com/openshift/origin/pull/10918

Comment 25 Dan Williams 2016-09-15 17:03:07 UTC
To re-iterate, after upstream discussion:

When running a multi-master-controller (HA) the configuration *MUST* set controllerLeaseTTL > 0, not just for SDN, otherwise it's apparently expected that things will be "insanely broken".  It would be nice if the ansible installer helped here, but if you're rolling your own configuration this needs to be done manually.

Comment 26 Andrew Butcher 2016-09-15 17:09:18 UTC
The ansible installer configures "controllerLeaseTTL: 30" by default with multi-master environments.

Comment 27 Dan Williams 2016-09-23 22:00:58 UTC
(In reply to Andrew Butcher from comment #26)
> The ansible installer configures "controllerLeaseTTL: 30" by default with
> multi-master environments.

Ok, good.  We need to make sure users who custom-install or custom-configure do the same.

Comment 28 Dan Williams 2016-10-08 21:01:03 UTC
Proposed fix was deemed insufficient by Clayton, so back to assigned for reworking.

Comment 29 Dan Williams 2016-11-01 20:16:58 UTC
In the upstream github issue, Clayton said:

"Having leases doesn't guarantee one is active at a time. It just reduces the potential conflict. All controllers need to be tolerant of racing to some degree. An allocation controller is the hardest kind to write - it must sync to the underlying map allocation object before every write and check afterwards (effectively a 2PC). The work ravi had done would address some of that, but these are notoriously difficult to get correct. I would recommend focusing on recovery (minimize / document / automate in the controller the process when multiples start working), or focus on getting the allocator in place. We actually don't have an API resource for it today, but we should. A config map or secret could be used in this scenario."

So we have more work to do here, but in the short-term ensuring that controllerLeaseTTL is > 0 is an acceptable workaround.

Comment 31 Ben Bennett 2017-01-27 16:52:33 UTC
There's a trello card open to track the larger issue:
  https://trello.com/c/B56OdzdS

But since there's a work-around and the installer by default sets it up the right way, closing this.


Note You need to log in before you can comment on or make changes to this bug.