Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1579025 - [OVN][RA] Forceful reset of OVN master node results in promotion/demotion loop of ovndb-servers on other nodes
[OVN][RA] Forceful reset of OVN master node results in promotion/demotion loo...
Status: POST
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
high Severity urgent
: z2
: 13.0 (Queens)
Assigned To: Timothy Redaelli
Ofer Blaut
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-05-16 15:42 EDT by Marian Krcmarik
Modified: 2018-07-09 09:35 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
OVN pacemaker Resource Agent (RA) script sometimes does not handle the promotion action properly when pacemaker tries to promote a slave node. This is seen when the ovsdb-servers report the status as master to the RA script when the master ip is moved to the node. The issue is fixed upstream. When the issue occurs, the neutron server will not be able to connect the OVN North and South DB servers and all Create/Update/Delete APIs to the neutron server will fail. Restarting the ovn-dbs-bundle resource will resolve the issue. Run the below command in one of the controller node: "pcs resource restart ovn-dbs-bundle"
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Marian Krcmarik 2018-05-16 15:42:35 EDT
Description of problem:
When controller-0 is reset, and when controller-1 is chosen for master,  IPAddr2 resource is moved to controller-1 (as there is a colocation constraint set). ovsdb-servers (in controller-1) which were earlier running as standby, report the status as active as soon the IPAddr2 is configured.  The ovsdb_server_promote function returns at L412 and we don't record the active master (by running $CRM_MASTER -N $host_name -v ${master_score}

And when notify is called with "post-promote" op, since we didn't record the master node, the line L150 (https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L150) evaluates to false and we demote back. This results in a loop with pacemaker promoting controller-1 and controller-2 alternatively.

Version-Release number of selected component (if applicable):
openvswitch-ovn-common-2.9.0-20.el7fdp.x86_64

How reproducible:
Often

Steps to Reproduce:
1. Forcefully reset Master node of ovndb-servers resource on Openstack deployment with OVN deployed.

Actual results:
No other node (which are in standby mode) is successfully promoted to Master. It keeps failing and promoting/demoting in a loop.

Expected results:
One of the slave nodes should be promoted successfully to Master.

Additional info:
As result of the bug, many neutron related functionality does not work, which includes spawning new instances and many others
Comment 3 Numan Siddique 2018-05-17 05:16:22 EDT
Not sure if we want to make this as a blocker. Since the issue is seen only when the node is brought down ungracefully.
Comment 4 Numan Siddique 2018-05-17 06:05:04 EDT
Submitted the patch to fix the issue - https://patchwork.ozlabs.org/patch/915289/
Comment 5 Daniel Alvarez Sanchez 2018-05-21 11:01:16 EDT
Does this only happen when we forcibly reset the master node? If a failure would occur in ovsdb-server (or the container is stopped), the failover takes place smoothly? then we'd hit https://bugzilla.redhat.com/show_bug.cgi?id=1578312?
Comment 6 Numan Siddique 2018-05-28 02:21:57 EDT
The patch is merged in ovs master, branch 2.9 (and is part of 2.9.2 tag) - https://github.com/openvswitch/ovs/commit/c16e265713bef1c701bfce7608d68ab11695e286
Comment 7 Numan Siddique 2018-05-28 02:22:51 EDT
I am assigning the BZ to openvswitch componen
Comment 8 Marian Krcmarik 2018-05-28 05:59:30 EDT
(In reply to Daniel Alvarez Sanchez from comment #5)
> Does this only happen when we forcibly reset the master node? If a failure
That's correct, at least in my tests, but I believe I saw a mention from Numad that he reproduces with clean shutdown of ovsdb-server master through pacemaker.
> would occur in ovsdb-server (or the container is stopped), the failover
> takes place smoothly? then we'd hit
I do not think so, I believe If the container is stopped outside of pacemaker (i.e. using docker command) It may behave the same - means with the buggy behaviour.
> https://bugzilla.redhat.com/show_bug.cgi?id=1578312?
I managed to get to the point where the above bug could be hit because the bug of this bz report was not 100% reproducible.
Comment 9 Timothy Redaelli 2018-05-28 06:23:08 EDT
Backported in the internal build: openvswitch-2.9.0-40.el7fdn
Brew: http://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16484769

Please do not crosstag FDN builds.
Comment 13 Miguel Angel Ajo 2018-06-19 05:26:00 EDT
We need this along the next ztsream release of OSP13, will we pick this from upstream, or is it already backported downstream? Can we set the "Fixed in version" and status to MODIFIED?
Comment 14 Timothy Redaelli 2018-06-19 08:10:15 EDT
Hi,
it was fixed in openvswitch-2.9.0-40.el7fdn and FDP 18.06 is aligned to openvswitch-2.9.0-47.el7fdp, so it'll include the fix
Comment 15 Assaf Muller 2018-07-09 09:35:51 EDT
We'll want to bump OVS (+OVN) for z2.

Note You need to log in before you can comment on or make changes to this bug.