Hide Forgot
Created attachment 1142642 [details] hostsubnet and flows Description of problem: - When adding multiple nodes a HA environment at once will assign the same hostsubnet to some of the new nodes. Version-Release number of selected component (if applicable): 3.1.1.6 How reproducible: - Have not reproduced - Hit the issue once during a scale up. Steps to Reproduce: 1. Added 7+ nodes to an environment with the scale up playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/scaleup.yml 2. All nodes succeed in starting and enter a ready state Actual results: 2 nodes get assigned same hostsubnet only one node gets the correct flows passed to OpenVswitch the other node stays in a READY state but pods are unable to be scheduled to node and node is unable to access other hosts on the SDN Expected results: Each node gets its own hostsubnet Additional info: I speculate this happens when the nodes hit the API on separate masters via the LB at the same times. # oc get hostsubnet osenode117.example.com 10.16.188.88 10.1.26.0/24 # DUPLIATE NOT WORKING osenode118.example.com 10.16.188.89 10.1.36.0/24 osenode119.example.com 10.16.188.93 10.1.33.0/24 osenode122.example.com 10.16.188.91 10.1.31.0/24 osenode123.example.com 10.16.188.94 10.1.34.0/24 osenode125.example.com 10.16.188.96 10.1.32.0/24 osenode126.example.com 10.16.188.92 10.1.26.0/24 # DUPLIATE WORKING # ovs-ofctl -O OpenFlow13 dump-flows br0 cookie=0xa10bc5c, duration=344.266s, table=8, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.26.0/24 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.16.188.92->tun_dst,output:1 cookie=0xa10bc5c, duration=344.263s, table=9, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.1.26.0/24 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.16.188.92->tun_dst,output:1 More info in attachment
upstream fix: https://github.com/openshift/openshift-sdn/pull/282
Posted fix is incorrect and will be dropped. Since you're running in HA mode, you need to make sure that only *one* of the masters is running the controller-manager with the SDN plugin. The rest cannot be running the SDN master plugin because then yes, multiple master can allocate the same subnet to nodes.
@Dan Williams at this point in time with a Native HA setup etcd ensures that one atomic-openshift-master-controller.service is running. Does the SDN plugin run within this service or the api service? Is there any way to make sure only one master is running the plug in?
If the SDN allocator is treated as a controller, only one will be run.
Ok, I see now where each controller instance blocks waiting on the lease from etcd. We're going to need more logging in openshift-sdn to figure out what hostnames and IP addresses the nodes are sending to the master and when leases are handed out to clients. I can't see after some analysis where the problem might happen as the HostSubnet allocations are all serialized in the master, and the master reads existing ones out of etcd when it starts up. If one master looses the lease, then the next master is allowed to start and it will read all of the existing HostSubnets from etcd and add them to the allocation map to ensure it doesn't hand them out again.
There is some more logging for openshift-sdn to help debug this in https://github.com/openshift/openshift-sdn/pull/296
These logging updates are part of OpenShift 3.2
Ryan, next steps here would attempting to reproduce this with OpenShift 3.2 and grabbing the logging from all the [atomic-]openshift-master-controllers services with something like "journalctl -b -u atomic-openshift-master-controllers".
Ryan, can you also report the exact atomic-openshift RPM version that is being run here? I need that so I can create a scratch build with additional debugging patches applied. Thanks!
atomic-openshift-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 atomic-openshift-clients-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 atomic-openshift-node-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 atomic-openshift-sdn-ovs-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos.noarch tuned-profiles-atomic-openshift-node-3.1.1.6-3.git.16.5327e56.el7aos.x86_64
Testing RPMs are here: http://people.redhat.com/dcbw/openshift/rh1323275/ Procedure: 1) "rpm -Fvh *.rpm" on any machine running the 'atomic-openshift-master-controllers' service 2) systemctl restart atomic-openshift-master-controllers on each master 3) scale down, then scale up and attempt to reproduce the problem 4) when reproduced, on each master grab: journalctl -b -u atomic-openshift-master-controllers /etc/origin/master/master-config.yaml oc get hostsubnets
Client is requesting RPMs for 3.1.1.6-4 instead.
Updated testing RPMs are here: http://people.redhat.com/dcbw/openshift/rh1323275/ Procedure: 1) "rpm -Fvh *.rpm" on any machine running the 'atomic-openshift-master-controllers' service 2) systemctl restart atomic-openshift-master-controllers on each master 3) scale down, then scale up and attempt to reproduce the problem 4) when reproduced, on each master grab: journalctl -b -u atomic-openshift-master-controllers /etc/origin/master/master-config.yaml oc get hostsubnets
Was the customer able to install and testing out the debug RPMs?
This happened on another cluster. Multiple masters were concurrently running the SDN master code as evidenced by SDN node watch timeouts happening close to each other on different nodes: Sep 09 00:00:31 node3 atomic-openshift-master-controllers[101101]: W0909 00:00:31.626195 101101 reflector.go:224] /builddir/build/BUILD/atomic-openshift-git-32.adf8ec9/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go:528: watch of *api.Node ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [331719549/331719546]) [331720548] Sep 09 00:00:36 node2 atomic-openshift-master-controllers[3139]: W0909 00:00:36.297243 3139 reflector.go:224] /builddir/build/BUILD/atomic-openshift-git-32.adf8ec9/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go:528: watch of *api.Node ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [331719549/331719534]) [331720548] Sep 09 00:01:11 node1 atomic-openshift-master-controllers[2789]: W0909 00:01:11.497431 2789 reflector.go:224] /builddir/build/BUILD/atomic-openshift-git-32.adf8ec9/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go:528: watch of *api.Node ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [331720078/331720074]) [331721077] This is likely because this cluster's config did not specify ControllerLeaseTTL > 0 in the master configuration, which causes all master-controllers to run concurrently and not take turns being the leader. This clearly doesn't work well for the SDN, and we need to find some way of exiting with an error when the SDN is enabled, and ControllerLeaseTTL <= 0.
Fix for discussion opened at https://github.com/openshift/origin/pull/10918
To re-iterate, after upstream discussion: When running a multi-master-controller (HA) the configuration *MUST* set controllerLeaseTTL > 0, not just for SDN, otherwise it's apparently expected that things will be "insanely broken". It would be nice if the ansible installer helped here, but if you're rolling your own configuration this needs to be done manually.
The ansible installer configures "controllerLeaseTTL: 30" by default with multi-master environments.
(In reply to Andrew Butcher from comment #26) > The ansible installer configures "controllerLeaseTTL: 30" by default with > multi-master environments. Ok, good. We need to make sure users who custom-install or custom-configure do the same.
Proposed fix was deemed insufficient by Clayton, so back to assigned for reworking.
In the upstream github issue, Clayton said: "Having leases doesn't guarantee one is active at a time. It just reduces the potential conflict. All controllers need to be tolerant of racing to some degree. An allocation controller is the hardest kind to write - it must sync to the underlying map allocation object before every write and check afterwards (effectively a 2PC). The work ravi had done would address some of that, but these are notoriously difficult to get correct. I would recommend focusing on recovery (minimize / document / automate in the controller the process when multiples start working), or focus on getting the allocator in place. We actually don't have an API resource for it today, but we should. A config map or secret could be used in this scenario." So we have more work to do here, but in the short-term ensuring that controllerLeaseTTL is > 0 is an acceptable workaround.
There's a trello card open to track the larger issue: https://trello.com/c/B56OdzdS But since there's a work-around and the installer by default sets it up the right way, closing this.