Bug 1579025

Summary: [OVN][RA] Forceful reset of OVN master node results in promotion/demotion loop of ovndb-servers on other nodes
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: openvswitchAssignee: Assaf Muller <amuller>
Status: CLOSED CURRENTRELEASE QA Contact: Roman Safronov <rsafrono>
Severity: urgent Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: amuller, apevec, chrisw, dalvarez, jschluet, lhh, majopela, mariel, mkrcmari, nusiddiq, ragiman, rhos-maint, rkhan, shdunne, srevivo, takito, tredaelli
Target Milestone: z4Keywords: TestOnly, Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-70.el7fdp.1 Doc Type: Known Issue
Doc Text:
OVN pacemaker Resource Agent (RA) script sometimes does not handle the promotion action properly when pacemaker tries to promote a slave node. This is seen when the ovsdb-servers report the status as master to the RA script when the master ip is moved to the node. The issue is fixed upstream. When the issue occurs, the neutron server will not be able to connect the OVN North and South DB servers and all Create/Update/Delete APIs to the neutron server will fail. Restarting the ovn-dbs-bundle resource will resolve the issue. Run the below command in one of the controller node: "pcs resource restart ovn-dbs-bundle"
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-15 10:33:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1658631    
Bug Blocks:    

Description Marian Krcmarik 2018-05-16 19:42:35 UTC
Description of problem:
When controller-0 is reset, and when controller-1 is chosen for master,  IPAddr2 resource is moved to controller-1 (as there is a colocation constraint set). ovsdb-servers (in controller-1) which were earlier running as standby, report the status as active as soon the IPAddr2 is configured.  The ovsdb_server_promote function returns at L412 and we don't record the active master (by running $CRM_MASTER -N $host_name -v ${master_score}

And when notify is called with "post-promote" op, since we didn't record the master node, the line L150 (https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L150) evaluates to false and we demote back. This results in a loop with pacemaker promoting controller-1 and controller-2 alternatively.

Version-Release number of selected component (if applicable):
openvswitch-ovn-common-2.9.0-20.el7fdp.x86_64

How reproducible:
Often

Steps to Reproduce:
1. Forcefully reset Master node of ovndb-servers resource on Openstack deployment with OVN deployed.

Actual results:
No other node (which are in standby mode) is successfully promoted to Master. It keeps failing and promoting/demoting in a loop.

Expected results:
One of the slave nodes should be promoted successfully to Master.

Additional info:
As result of the bug, many neutron related functionality does not work, which includes spawning new instances and many others

Comment 3 Numan Siddique 2018-05-17 09:16:22 UTC
Not sure if we want to make this as a blocker. Since the issue is seen only when the node is brought down ungracefully.

Comment 4 Numan Siddique 2018-05-17 10:05:04 UTC
Submitted the patch to fix the issue - https://patchwork.ozlabs.org/patch/915289/

Comment 5 Daniel Alvarez Sanchez 2018-05-21 15:01:16 UTC
Does this only happen when we forcibly reset the master node? If a failure would occur in ovsdb-server (or the container is stopped), the failover takes place smoothly? then we'd hit https://bugzilla.redhat.com/show_bug.cgi?id=1578312?

Comment 6 Numan Siddique 2018-05-28 06:21:57 UTC
The patch is merged in ovs master, branch 2.9 (and is part of 2.9.2 tag) - https://github.com/openvswitch/ovs/commit/c16e265713bef1c701bfce7608d68ab11695e286

Comment 7 Numan Siddique 2018-05-28 06:22:51 UTC
I am assigning the BZ to openvswitch componen

Comment 8 Marian Krcmarik 2018-05-28 09:59:30 UTC
(In reply to Daniel Alvarez Sanchez from comment #5)
> Does this only happen when we forcibly reset the master node? If a failure
That's correct, at least in my tests, but I believe I saw a mention from Numad that he reproduces with clean shutdown of ovsdb-server master through pacemaker.
> would occur in ovsdb-server (or the container is stopped), the failover
> takes place smoothly? then we'd hit
I do not think so, I believe If the container is stopped outside of pacemaker (i.e. using docker command) It may behave the same - means with the buggy behaviour.
> https://bugzilla.redhat.com/show_bug.cgi?id=1578312?
I managed to get to the point where the above bug could be hit because the bug of this bz report was not 100% reproducible.

Comment 9 Timothy Redaelli 2018-05-28 10:23:08 UTC
Backported in the internal build: openvswitch-2.9.0-40.el7fdn
Brew: http://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16484769

Please do not crosstag FDN builds.

Comment 13 Miguel Angel Ajo 2018-06-19 09:26:00 UTC
We need this along the next ztsream release of OSP13, will we pick this from upstream, or is it already backported downstream? Can we set the "Fixed in version" and status to MODIFIED?

Comment 14 Timothy Redaelli 2018-06-19 12:10:15 UTC
Hi,
it was fixed in openvswitch-2.9.0-40.el7fdn and FDP 18.06 is aligned to openvswitch-2.9.0-47.el7fdp, so it'll include the fix

Comment 15 Assaf Muller 2018-07-09 13:35:51 UTC
We'll want to bump OVS (+OVN) for z2.

Comment 25 Eran Kuris 2018-12-11 12:59:46 UTC
I reproduce the issue so I have to re-open it:
  pcsd: active/enabled
[root@controller-1 ~]# pcs status 
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Tue Dec 11 12:55:36 2018
Last change: Mon Dec 10 13:56:29 2018 by root via cibadmin on controller-0

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Stopped
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Stopped
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Stopped
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
 ip-192.168.24.10	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.114	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.29	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.14	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.3.24	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.13	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Stopped
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp13/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Stopped
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


forced reset the master node with virt-manager 

OpenStack/13.0-RHEL-7/2018-12-07.1
[root@controller-1 ~]# rpm -qa | grep -i openvs
openvswitch-ovn-central-2.9.0-81.el7fdp.x86_64
openvswitch-2.9.0-81.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-81.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch
openvswitch-ovn-common-2.9.0-81.el7fdp.x86_64
openstack-neutron-openvswitch-12.0.5-2.el7ost.noarch
python-openvswitch-2.9.0-81.el7fdp.x86_64

Comment 26 Marian Krcmarik 2018-12-11 13:17:32 UTC
(In reply to Eran Kuris from comment #25)
> I reproduce the issue so I have to re-open it:

That reminds me It could be another bug - https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update pacemaker package on controller nodes to at least version pacemaker-1.1.19-8.el7_6.2

Comment 27 Eran Kuris 2018-12-11 15:05:47 UTC
(In reply to Marian Krcmarik from comment #26)
> (In reply to Eran Kuris from comment #25)
> > I reproduce the issue so I have to re-open it:
> 
> That reminds me It could be another bug -
> https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update
> pacemaker package on controller nodes to at least version
> pacemaker-1.1.19-8.el7_6.2

which bug is it remind you? you attached this bug ID.

Comment 28 Marian Krcmarik 2018-12-12 10:46:48 UTC
(In reply to Eran Kuris from comment #27)
> (In reply to Marian Krcmarik from comment #26)
> > (In reply to Eran Kuris from comment #25)
> > > I reproduce the issue so I have to re-open it:
> > 
> > That reminds me It could be another bug -
> > https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update
> > pacemaker package on controller nodes to at least version
> > pacemaker-1.1.19-8.el7_6.2
> 
> which bug is it remind you? you attached this bug ID.

Sorry I meant - https://bugzilla.redhat.com/show_bug.cgi?id=1654602

Comment 30 Jon Schlueter 2018-12-20 17:09:36 UTC
you might want to check to see if you hit any selinux issues when you reproduced this issue.

Comment 33 Lon Hohberger 2019-03-12 10:38:14 UTC
According to our records, this should be resolved by openvswitch-2.9.0-83.el7fdp.1.  This build is available now.

Comment 34 Roman Safronov 2019-03-14 15:39:34 UTC
Verified on puddle 13.0-RHEL-7/2019-03-01.1 with openvswitch-2.9.0-97.el7fdp.x86_64

Setup: environment with 3 controller and 2 compute nodes, fencing enabled.

Verified that after resetting ovndb-server master node another node promoted to be a master. 
Tested with different reset types: using  "echo o >/proc/sysrq-trigger",  force stop/reset by virt-manager, pcs cluster stop controller-X, sudo docker stop ovn-dbs-bundle-docker-0, shutdown -h now.
Checked that it is possible to create/delete a vm after each reset and vm has access to network.

[heat-admin@controller-0 ~]$ rpm -qa | grep openvswitch
openvswitch-2.9.0-97.el7fdp.x86_64
openvswitch-ovn-central-2.9.0-97.el7fdp.x86_64
openvswitch-ovn-common-2.9.0-97.el7fdp.x86_64
openstack-neutron-openvswitch-12.0.5-4.el7ost.noarch
python-openvswitch-2.9.0-97.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch
openvswitch-ovn-host-2.9.0-97.el7fdp.x86_64

[heat-admin@controller-0 ~]$ rpm -qa | grep pacemaker
pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64
pacemaker-libs-1.1.19-8.el7_6.4.x86_64
pacemaker-remote-1.1.19-8.el7_6.4.x86_64
pacemaker-cli-1.1.19-8.el7_6.4.x86_64
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.19-8.el7_6.4.x86_64
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch