Description of problem:
Auto recover does not work on OVN master node after killing ovnDB services or north-d service.
When running kill -9 of ovsdb-server [ovnnb_db.pid/ovnsb_db.pid] or ovn-northd
The expected behavior is one of the slaves nodes will take the OVN master role.
Pacemaker should detect that the services are down and move the role to the slave node.
After debugging with dev its looks like that recovering script on the master is not called.
Version-Release number of selected component (if applicable):
rpm -qa |grep ovn
(overcloud) [root@controller-2 ~]# rpm -qa |grep pacemaker
Steps to Reproduce:
1.deploy HA setup with OVN
2.Kill -9 ovn-northd service / ovsdb-server on Master node
3.verify that one of the slave node change the status to be "Master"
One thing which is missing from OVN pacemaker OCf script is that - on the master node, it doesn't check the health of ovn-northd. This needs to be fixed.
But the main issue here is that, on the node where OVN db servers are running as master, pacemaker is not calling the OVN OCF script periodically with the "monitor" action. Where as it calls this script on the slave nodes. When a slave node is made as master, we see the same behavior. And the node which was master, when it becomes slave, the OCF script gets called periodically.
Its OSP12 setup with all the other pacemaker services run as bundles and only OVN db service runs as a baremetal resource.
@Michelle - You have any comments on this ?
We have a setup. We can definitely look into this anytime you are fine with.
Submitted the patch upstream to fix the issue - https://patchwork.ozlabs.org/patch/839022/
The laest patch - https://patchwork.ozlabs.org/patch/844113/
The patch to fix this issue is merged in master/branch/2.8 and branch 2.7 - https://github.com/openvswitch/ovs/commit/e7b9b17cd096c569b1c4d408b423ecedb9497c41
[stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed
12 -p 2018-01-26.2
[root@controller-0 ~]# rpm -qa |grep openvswitch-2.7.3-3
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.