Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1512568

Summary:	OVN services does not recovered on OVN master node after killing ovsdb-server services[nb/sb] or ovn-northd service
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	openvswitch	Assignee:	Numan Siddique <nusiddiq>
Status:	CLOSED ERRATA	QA Contact:	Eran Kuris <ekuris>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	12.0 (Pike)	CC:	apevec, chrisw, jlibosva, majopela, mariel, michele, nusiddiq, rhos-maint, srevivo
Target Milestone:	z1	Keywords:	TechPreview, Triaged, ZStream
Target Release:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openvswitch-2.7.3-3.git20180112.el7fdp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-01-30 20:25:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1433534

Description Eran Kuris 2017-11-13 14:33:53 UTC

Description of problem:
Auto recover does not work on OVN master node after killing ovnDB services or north-d service.
When running kill -9 of ovsdb-server [ovnnb_db.pid/ovnsb_db.pid] or ovn-northd
The expected behavior is one of the slaves nodes will take the OVN master role.
Pacemaker should detect that the services are down and move the role to the slave node.  
After debugging with dev its looks like that recovering script on the master is not called.

Version-Release number of selected component (if applicable):
rpm -qa |grep ovn 
openstack-nova-novncproxy-16.0.3-0.20171028031400.60d6e87.el7ost.noarch
puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch
python-networking-ovn-3.0.1-0.20171005161553.0cde8a5.el7ost.noarch
novnc-0.6.1-1.el7ost.noarch
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
(overcloud) [root@controller-2 ~]# rpm -qa |grep pacemaker
pacemaker-cli-1.1.16-12.el7_4.4.x86_64
ansible-pacemaker-1.0.3-2.el7ost.noarch
pacemaker-1.1.16-12.el7_4.4.x86_64
puppet-pacemaker-0.6.1-0.20171024215340.9a46ecd.el7ost.noarch
pacemaker-libs-1.1.16-12.el7_4.4.x86_64
pacemaker-cluster-libs-1.1.16-12.el7_4.4.x86_

How reproducible:
100%

Steps to Reproduce:
1.deploy HA setup with OVN 
2.Kill -9 ovn-northd service / ovsdb-server   on Master node
3.verify that one of the slave node change the status to be "Master"

https://drive.google.com/a/redhat.com/file/d/1v_4oDMM1jQaQ7Ey40vUgrIaiFlGjF4lK/view?usp=sharing

Comment 1 Numan Siddique 2017-11-13 14:48:28 UTC

One thing which is missing from OVN pacemaker OCf script is that - on the master node, it doesn't check the health of ovn-northd. This needs to be fixed.

But the main issue here is that, on the node where OVN db servers are running as master, pacemaker is not calling the OVN OCF script periodically with the "monitor" action. Where as it calls this script on the slave nodes. When a slave node is made as master, we see the same behavior. And the node which was master, when it becomes slave, the OCF script gets called periodically.

Its OSP12 setup with all the other pacemaker services run as bundles and only OVN db service runs as a baremetal resource.

@Michelle - You have any comments on this ?

Comment 3 Numan Siddique 2017-11-13 17:40:50 UTC

Hi Michele,
We have a setup. We can definitely look into this anytime you are fine with.

Thanks
Numan

Comment 7 Numan Siddique 2017-11-22 12:20:22 UTC

Submitted the patch upstream to fix the issue - https://patchwork.ozlabs.org/patch/839022/

Comment 8 Numan Siddique 2017-12-04 05:43:12 UTC

The laest patch - https://patchwork.ozlabs.org/patch/844113/

Comment 9 Numan Siddique 2017-12-06 07:46:20 UTC

The patch to fix this issue is merged in master/branch/2.8 and branch 2.7 - https://github.com/openvswitch/ovs/commit/e7b9b17cd096c569b1c4d408b423ecedb9497c41

Comment 14 Eran Kuris 2018-01-28 15:38:26 UTC

fixed verified 
[stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 
12   -p 2018-01-26.2
[root@controller-0 ~]# rpm -qa |grep openvswitch-2.7.3-3
python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch
openvswitch-2.7.3-3.git20180112.el7fdp.x86_64

Comment 17 errata-xmlrpc 2018-01-30 20:25:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0248