Red Hat Bugzilla – Bug 1512568
OVN services does not recovered on OVN master node after killing ovsdb-server services[nb/sb] or ovn-northd service
Last modified: 2018-02-15 17:33:54 EST
Description of problem: Auto recover does not work on OVN master node after killing ovnDB services or north-d service. When running kill -9 of ovsdb-server [ovnnb_db.pid/ovnsb_db.pid] or ovn-northd The expected behavior is one of the slaves nodes will take the OVN master role. Pacemaker should detect that the services are down and move the role to the slave node. After debugging with dev its looks like that recovering script on the master is not called. Version-Release number of selected component (if applicable): rpm -qa |grep ovn openstack-nova-novncproxy-16.0.3-0.20171028031400.60d6e87.el7ost.noarch puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch python-networking-ovn-3.0.1-0.20171005161553.0cde8a5.el7ost.noarch novnc-0.6.1-1.el7ost.noarch openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64 (overcloud) [root@controller-2 ~]# rpm -qa |grep pacemaker pacemaker-cli-1.1.16-12.el7_4.4.x86_64 ansible-pacemaker-1.0.3-2.el7ost.noarch pacemaker-1.1.16-12.el7_4.4.x86_64 puppet-pacemaker-0.6.1-0.20171024215340.9a46ecd.el7ost.noarch pacemaker-libs-1.1.16-12.el7_4.4.x86_64 pacemaker-cluster-libs-1.1.16-12.el7_4.4.x86_ How reproducible: 100% Steps to Reproduce: 1.deploy HA setup with OVN 2.Kill -9 ovn-northd service / ovsdb-server on Master node 3.verify that one of the slave node change the status to be "Master" https://drive.google.com/a/redhat.com/file/d/1v_4oDMM1jQaQ7Ey40vUgrIaiFlGjF4lK/view?usp=sharing
One thing which is missing from OVN pacemaker OCf script is that - on the master node, it doesn't check the health of ovn-northd. This needs to be fixed. But the main issue here is that, on the node where OVN db servers are running as master, pacemaker is not calling the OVN OCF script periodically with the "monitor" action. Where as it calls this script on the slave nodes. When a slave node is made as master, we see the same behavior. And the node which was master, when it becomes slave, the OCF script gets called periodically. Its OSP12 setup with all the other pacemaker services run as bundles and only OVN db service runs as a baremetal resource. @Michelle - You have any comments on this ?
Hi Michele, We have a setup. We can definitely look into this anytime you are fine with. Thanks Numan
Submitted the patch upstream to fix the issue - https://patchwork.ozlabs.org/patch/839022/
The laest patch - https://patchwork.ozlabs.org/patch/844113/
The patch to fix this issue is merged in master/branch/2.8 and branch 2.7 - https://github.com/openvswitch/ovs/commit/e7b9b17cd096c569b1c4d408b423ecedb9497c41
fixed verified [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 12 -p 2018-01-26.2 [root@controller-0 ~]# rpm -qa |grep openvswitch-2.7.3-3 python-openvswitch-2.7.3-3.git20180112.el7fdp.noarch openvswitch-2.7.3-3.git20180112.el7fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0248