Bug 1512568 - OVN services does not recovered on OVN master node after killing ovsdb-server services[nb/sb] or ovn-northd service
Summary: OVN services does not recovered on OVN master node after killing ovsdb-serve...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: z1
: 12.0 (Pike)
Assignee: Numan Siddique
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks: 1433534
TreeView+ depends on / blocked
 
Reported: 2017-11-13 14:33 UTC by Eran Kuris
Modified: 2018-02-15 22:33 UTC (History)
9 users (show)

Fixed In Version: openvswitch-2.7.3-3.git20180112.el7fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-30 20:25:08 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0248 normal SHIPPED_LIVE Red Hat OpenStack Platform 12 Bug Fix and Enhancement Advisory 2018-02-16 03:46:52 UTC
Launchpad 1731934 None None None 2017-11-13 14:33:52 UTC

Description Eran Kuris 2017-11-13 14:33:53 UTC
Description of problem:
Auto recover does not work on OVN master node after killing ovnDB services or north-d service.
When running kill -9 of ovsdb-server [ovnnb_db.pid/ovnsb_db.pid] or ovn-northd
The expected behavior is one of the slaves nodes will take the OVN master role.
Pacemaker should detect that the services are down and move the role to the slave node.  
After debugging with dev its looks like that recovering script on the master is not called.

Version-Release number of selected component (if applicable):
rpm -qa |grep ovn 
openstack-nova-novncproxy-16.0.3-0.20171028031400.60d6e87.el7ost.noarch
puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch
python-networking-ovn-3.0.1-0.20171005161553.0cde8a5.el7ost.noarch
novnc-0.6.1-1.el7ost.noarch
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
(overcloud) [root@controller-2 ~]# rpm -qa |grep pacemaker
pacemaker-cli-1.1.16-12.el7_4.4.x86_64
ansible-pacemaker-1.0.3-2.el7ost.noarch
pacemaker-1.1.16-12.el7_4.4.x86_64
puppet-pacemaker-0.6.1-0.20171024215340.9a46ecd.el7ost.noarch
pacemaker-libs-1.1.16-12.el7_4.4.x86_64
pacemaker-cluster-libs-1.1.16-12.el7_4.4.x86_

How reproducible:
100%

Steps to Reproduce:
1.deploy HA setup with OVN 
2.Kill -9 ovn-northd service / ovsdb-server   on Master node
3.verify that one of the slave node change the status to be "Master"

https://drive.google.com/a/redhat.com/file/d/1v_4oDMM1jQaQ7Ey40vUgrIaiFlGjF4lK/view?usp=sharing

Comment 1 Numan Siddique 2017-11-13 14:48:28 UTC
One thing which is missing from OVN pacemaker OCf script is that - on the master node, it doesn't check the health of ovn-northd. This needs to be fixed.

But the main issue here is that, on the node where OVN db servers are running as master, pacemaker is not calling the OVN OCF script periodically with the "monitor" action. Where as it calls this script on the slave nodes. When a slave node is made as master, we see the same behavior. And the node which was master, when it becomes slave, the OCF script gets called periodically.

Its OSP12 setup with all the other pacemaker services run as bundles and only OVN db service runs as a baremetal resource.

@Michelle - You have any comments on this ?

Comment 3 Numan Siddique 2017-11-13 17:40:50 UTC
Hi Michele,
We have a setup. We can definitely look into this anytime you are fine with.

Thanks
Numan

Comment 7 Numan Siddique 2017-11-22 12:20:22 UTC
Submitted the patch upstream to fix the issue - https://patchwork.ozlabs.org/patch/839022/

Comment 8 Numan Siddique 2017-12-04 05:43:12 UTC
The laest patch - https://patchwork.ozlabs.org/patch/844113/

Comment 9 Numan Siddique 2017-12-06 07:46:20 UTC
The patch to fix this issue is merged in master/branch/2.8 and branch 2.7 - https://github.com/openvswitch/ovs/commit/e7b9b17cd096c569b1c4d408b423ecedb9497c41

Comment 14 Eran Kuris 2018-01-28 15:38:26 UTC
fixed verified 
[stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 
12   -p 2018-01-26.2
[root@controller-0 ~]# rpm -qa |grep openvswitch-2.7.3-3
python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch
openvswitch-2.7.3-3.git20180112.el7fdp.x86_64

Comment 17 errata-xmlrpc 2018-01-30 20:25:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0248


Note You need to log in before you can comment on or make changes to this bug.