Description of problem: this bug is similar with a ovs 2.9 bug bz1684363 on rhel7 Version-Release number of selected component (if applicable): [root@hp-dl388g8-02 ovn_ha]# uname -a Linux hp-dl388g8-02.rhts.eng.pek2.redhat.com 4.18.0-64.el8.x86_64 #1 SMP Wed Jan 23 20:50:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [root@hp-dl388g8-02 ovn_ha]# rpm -qa | grep openvswitch openvswitch2.11-ovn-common-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64 kernel-kernel-networking-openvswitch-ovn_ha-1.0-30.noarch openvswitch2.11-ovn-central-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64 openvswitch-selinux-extra-policy-1.0-10.el8fdp.noarch openvswitch2.11-ovn-host-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64 openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64 How reproducible: everytime Steps to Reproduce: 1.set up cluster with 3 nodes as ovndb_servers 2.restart openvswitch on master node [root@hp-dl388g8-02 ovn_ha]# pcs status Cluster name: my_cluster Stack: corosync Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum Last updated: Fri Mar 1 03:40:50 2019 Last change: Fri Mar 1 03:33:33 2019 by root via crm_attribute on 70.0.0.2 3 nodes configured 4 resources configured Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ] Full list of resources: ip-70.0.0.50 (ocf::heartbeat:IPaddr2): Started 70.0.0.2 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable) Masters: [ 70.0.0.2 ] Slaves: [ 70.0.0.12 70.0.0.20 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@hp-dl388g8-02 ovn_ha]# systemctl restart openvswitch [root@hp-dl388g8-02 ovn_ha]# pcs status Cluster name: my_cluster Stack: corosync Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum Last updated: Fri Mar 1 03:41:21 2019 Last change: Fri Mar 1 03:33:33 2019 by root via crm_attribute on 70.0.0.2 3 nodes configured 4 resources configured Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ] Full list of resources: ip-70.0.0.50 (ocf::heartbeat:IPaddr2): Started 70.0.0.2 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable) Masters: [ 70.0.0.2 ] Slaves: [ 70.0.0.12 70.0.0.20 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@hp-dl388g8-02 ovn_ha]# pcs status Cluster name: my_cluster Stack: corosync Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum Last updated: Fri Mar 1 03:41:25 2019 Last change: Fri Mar 1 03:33:33 2019 by root via crm_attribute on 70.0.0.2 3 nodes configured 4 resources configured Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ] Full list of resources: ip-70.0.0.50 (ocf::heartbeat:IPaddr2): Started 70.0.0.2 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable) ovndb_servers (ocf::ovn:ovndb-servers): FAILED 70.0.0.2 Slaves: [ 70.0.0.12 70.0.0.20 ] Failed Resource Actions: * ovndb_servers_demote_0 on 70.0.0.2 'not running' (7): call=17, status=complete, exitreason='', last-rc-change='Fri Mar 1 03:41:22 2019', queued=0ms, exec=97ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@hp-dl388g8-02 ovn_ha]# pcs status Cluster name: my_cluster Stack: corosync Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum Last updated: Fri Mar 1 03:41:27 2019 Last change: Fri Mar 1 03:33:33 2019 by root via crm_attribute on 70.0.0.2 3 nodes configured 4 resources configured Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ] Full list of resources: ip-70.0.0.50 (ocf::heartbeat:IPaddr2): Started 70.0.0.2 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable) ovndb_servers (ocf::ovn:ovndb-servers): FAILED 70.0.0.2 Slaves: [ 70.0.0.12 70.0.0.20 ] Failed Resource Actions: * ovndb_servers_demote_0 on 70.0.0.2 'not running' (7): call=17, status=complete, exitreason='', last-rc-change='Fri Mar 1 03:41:22 2019', queued=0ms, exec=97ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@hp-dl388g8-02 ovn_ha]# Actual results: the node can't be back after restart openvswitch Expected results: the node can be back after restart openvswitch Additional info:
*** Bug 1723291 has been marked as a duplicate of this bug. ***
The main issue is that when you restart openvswitch, the ovs run time folders - /var/run/openvswitch is deleted and recreated again by the openvswitch systemd script. Since the OVN ovsdb-servers (and ovn-controller) also use the same runtime directory, all the OVN ovsdb-servers' run time socket files are also deleted. After which the OVN ocf script can't stop or monitor the status of the ovsdb-servers. If you do "ps -aef | grep ovsdb-servers" you will see that the old ovsdb-servers will be still running. Killing those processes manually and then refreshing the pacemaker resource recovers it. I think this is expected and known issue. The proper fix to it is to use a separate runtime directory for OVN.
*** Bug 1566412 has been marked as a duplicate of this bug. ***
This issue is blocked by bug: https://bugzilla.redhat.com/show_bug.cgi?id=1769202
Verified on ovn2.12.0-7: [root@dell-per740-12 ovn_ha]# pcs status Cluster name: my_cluster WARNINGS: Corosync and pacemaker node names do not match (IPs used in setup?) Stack: corosync Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum Last updated: Sat Nov 9 01:20:42 2019 Last change: Sat Nov 9 01:16:11 2019 by root via crm_attribute on dell-per740-12.rhts.eng.pek2.redhat.com 3 nodes configured 4 resources configured Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ] Full list of resources: ip-70.11.0.50 (ocf::heartbeat:IPaddr2): Started dell-per740-12.rhts.eng.pek2.redhat.com Master/Slave Set: ovndb_servers-master [ovndb_servers] Masters: [ dell-per740-12.rhts.eng.pek2.redhat.com ] Slaves: [ hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@dell-per740-12 ovn_ha]# hostname dell-per740-12.rhts.eng.pek2.redhat.com [root@dell-per740-12 ovn_ha]# systemctl restart openvswitch <==== restart openvswitch [root@dell-per740-12 ovn_ha]# pcs status Cluster name: my_cluster WARNINGS: Corosync and pacemaker node names do not match (IPs used in setup?) Stack: corosync Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum Last updated: Sat Nov 9 01:21:07 2019 Last change: Sat Nov 9 01:16:11 2019 by root via crm_attribute on dell-per740-12.rhts.eng.pek2.redhat.com 3 nodes configured 4 resources configured Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ] Full list of resources: ip-70.11.0.50 (ocf::heartbeat:IPaddr2): Started dell-per740-12.rhts.eng.pek2.redhat.com Master/Slave Set: ovndb_servers-master [ovndb_servers] Masters: [ dell-per740-12.rhts.eng.pek2.redhat.com ] Slaves: [ hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled <==== pcs is still up [root@dell-per740-12 ovn_ha]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.12-2.12.0-4.el7fdp.x86_64 ovn2.12-host-2.12.0-7.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn_ha-1.0-43.noarch ovn2.12-central-2.12.0-7.el7fdp.x86_64 ovn2.12-2.12.0-7.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:4208