Red Hat Bugzilla – Bug 1579025
[OVN][RA] Forceful reset of OVN master node results in promotion/demotion loop of ovndb-servers on other nodes
Last modified: 2018-10-26 11:34:11 EDT
Description of problem: When controller-0 is reset, and when controller-1 is chosen for master, IPAddr2 resource is moved to controller-1 (as there is a colocation constraint set). ovsdb-servers (in controller-1) which were earlier running as standby, report the status as active as soon the IPAddr2 is configured. The ovsdb_server_promote function returns at L412 and we don't record the active master (by running $CRM_MASTER -N $host_name -v ${master_score} And when notify is called with "post-promote" op, since we didn't record the master node, the line L150 (https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L150) evaluates to false and we demote back. This results in a loop with pacemaker promoting controller-1 and controller-2 alternatively. Version-Release number of selected component (if applicable): openvswitch-ovn-common-2.9.0-20.el7fdp.x86_64 How reproducible: Often Steps to Reproduce: 1. Forcefully reset Master node of ovndb-servers resource on Openstack deployment with OVN deployed. Actual results: No other node (which are in standby mode) is successfully promoted to Master. It keeps failing and promoting/demoting in a loop. Expected results: One of the slave nodes should be promoted successfully to Master. Additional info: As result of the bug, many neutron related functionality does not work, which includes spawning new instances and many others
Not sure if we want to make this as a blocker. Since the issue is seen only when the node is brought down ungracefully.
Submitted the patch to fix the issue - https://patchwork.ozlabs.org/patch/915289/
Does this only happen when we forcibly reset the master node? If a failure would occur in ovsdb-server (or the container is stopped), the failover takes place smoothly? then we'd hit https://bugzilla.redhat.com/show_bug.cgi?id=1578312?
The patch is merged in ovs master, branch 2.9 (and is part of 2.9.2 tag) - https://github.com/openvswitch/ovs/commit/c16e265713bef1c701bfce7608d68ab11695e286
I am assigning the BZ to openvswitch componen
(In reply to Daniel Alvarez Sanchez from comment #5) > Does this only happen when we forcibly reset the master node? If a failure That's correct, at least in my tests, but I believe I saw a mention from Numad that he reproduces with clean shutdown of ovsdb-server master through pacemaker. > would occur in ovsdb-server (or the container is stopped), the failover > takes place smoothly? then we'd hit I do not think so, I believe If the container is stopped outside of pacemaker (i.e. using docker command) It may behave the same - means with the buggy behaviour. > https://bugzilla.redhat.com/show_bug.cgi?id=1578312? I managed to get to the point where the above bug could be hit because the bug of this bz report was not 100% reproducible.
Backported in the internal build: openvswitch-2.9.0-40.el7fdn Brew: http://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16484769 Please do not crosstag FDN builds.
We need this along the next ztsream release of OSP13, will we pick this from upstream, or is it already backported downstream? Can we set the "Fixed in version" and status to MODIFIED?
Hi, it was fixed in openvswitch-2.9.0-40.el7fdn and FDP 18.06 is aligned to openvswitch-2.9.0-47.el7fdp, so it'll include the fix
We'll want to bump OVS (+OVN) for z2.