Description of problem: ovndb_servers fail to start if set NIC down and up. build ovn cluster with pcs on 3 nodes, A is the master, then set NIC on A down, then sleep a while, then set NIC up. the node fail to start Version-Release number of selected component (if applicable): 2.13-20.12.0-1 How reproducible: Always Steps to Reproduce: 1. set enforce 0 and systemctl start openvswitch on 3 nodes 2. start pcs with following script: setenforce 0 systemctl start openvswitch ip_s=1.1.1.16 ip_c1=1.1.1.17 ip_c2=1.1.1.18 ip_v=1.1.1.100 (sleep 2;echo "hacluster"; sleep 2; echo "redhat" ) |pcs host auth $ip_c1 $ip_c2 $ip_s sleep 5 pcs cluster setup my_cluster --force --start $ip_c1 $ip_c2 $ip_s pcs cluster enable --all pcs property set stonith-enabled=false pcs property set no-quorum-policy=ignore pcs cluster cib tmp-cib.xml sleep 10 cp tmp-cib.xml tmp-cib.deltasrc pcs resource delete ip-$ip_v pcs resource delete ovndb_servers-clone sleep 5 pcs status pcs -f tmp-cib.xml resource create ip-$ip_v ocf:heartbeat:IPaddr2 ip=$ip_v op monitor interval=30s sleep 5 pcs -f tmp-cib.xml resource create ovndb_servers ocf:ovn:ovndb-servers manage_northd=yes master_ip=$ip_v nb_master_port=6641 sb_master_port=6642 promotable sleep 5 pcs -f tmp-cib.xml resource meta ovndb_servers-clone notify=true pcs -f tmp-cib.xml constraint order start ip-$ip_v then promote ovndb_servers-clone pcs -f tmp-cib.xml constraint colocation add ip-$ip_v with master ovndb_servers-clone pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_s=1500 pcs -f tmp-cib.xml constraint location ovndb_servers-clone prefers $ip_s=1500 pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c2=1000 pcs -f tmp-cib.xml constraint location ovndb_servers-clone prefers $ip_c2=1000 pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c1=500 pcs -f tmp-cib.xml constraint location ovndb_servers-clone prefers $ip_c1=500 pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.deltasrc 4. master is 1.1.1.16. then set interface with ip 1.1.1.16 down 5. sleep 15, then set 1.1.1.16 up 6. pcs status Actual results: 1.1.1.16 fail to start Expected results: 1.1.1.16 should start Additional info: [root@wsfd-advnetlab16 bz1614166]# pcs status Cluster name: my_cluster Cluster Summary: * Stack: corosync * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum * Last updated: Mon Jan 11 22:16:04 2021 * Last change: Mon Jan 11 22:15:54 2021 by root via crm_attribute on 1.1.1.16 * 3 nodes configured * 4 resource instances configured Node List: * Online: [ 1.1.1.16 1.1.1.17 1.1.1.18 ] Full List of Resources: * ip-1.1.1.100 (ocf::heartbeat:IPaddr2): Started 1.1.1.16 * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable): * Masters: [ 1.1.1.16 ] * Slaves: [ 1.1.1.17 1.1.1.18 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/disabled [root@wsfd-advnetlab16 bz1614166]# ip addr sh ens1f0 5: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 0c:42:a1:08:0b:1a brd ff:ff:ff:ff:ff:ff inet 1.1.1.16/24 scope global ens1f0 valid_lft forever preferred_lft forever inet 1.1.1.100/24 scope global secondary ens1f0 valid_lft forever preferred_lft forever inet6 fe80::e42:a1ff:fe08:b1a/64 scope link valid_lft forever preferred_lft forever [root@wsfd-advnetlab16 bz1614166]# ip link set ens1f0 down [root@wsfd-advnetlab16 bz1614166]# sleep 15 [root@wsfd-advnetlab16 bz1614166]# pcs status Cluster name: my_cluster Cluster Summary: * Stack: corosync * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition WITHOUT quorum * Last updated: Mon Jan 11 22:16:38 2021 * Last change: Mon Jan 11 22:15:54 2021 by root via crm_attribute on 1.1.1.16 * 3 nodes configured * 4 resource instances configured Node List: * Online: [ 1.1.1.16 ] * OFFLINE: [ 1.1.1.17 1.1.1.18 ] Full List of Resources: * ip-1.1.1.100 (ocf::heartbeat:IPaddr2): Stopped * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable): * Masters: [ 1.1.1.16 ] * Stopped: [ 1.1.1.17 1.1.1.18 ] Failed Resource Actions: * ip-1.1.1.100_start_0 on 1.1.1.16 'error' (1): call=23, status='complete', exitreason='[findif] failed', last-rc-change='2021-01-11 22:16:24 -05:00', queued=0ms, exec=92ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/disabled [root@wsfd-advnetlab16 bz1614166]# ip link set ens1f0 up [root@wsfd-advnetlab16 bz1614166]# pcs status Cluster name: my_cluster Cluster Summary: * Stack: corosync * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum * Last updated: Mon Jan 11 22:17:39 2021 * Last change: Mon Jan 11 22:16:57 2021 by hacluster via crmd on 1.1.1.17 * 3 nodes configured * 4 resource instances configured Node List: * Online: [ 1.1.1.16 1.1.1.17 1.1.1.18 ] Full List of Resources: * ip-1.1.1.100 (ocf::heartbeat:IPaddr2): Stopped * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable): * ovndb_servers (ocf::ovn:ovndb-servers): Starting 1.1.1.18 * ovndb_servers (ocf::ovn:ovndb-servers): FAILED 1.1.1.16 (Monitoring) * Slaves: [ 1.1.1.17 ] Failed Resource Actions: * ovndb_servers_monitor_30000 on 1.1.1.16 'not running' (7): call=159, status='complete', exitreason='', last-rc-change='2021-01-11 22:17:37 -05:00', queued=0ms, exec=136ms * ip-1.1.1.100_start_0 on 1.1.1.16 'error' (1): call=23, status='complete', exitreason='[findif] failed', last-rc-change='2021-01-11 22:16:24 -05:00', queued=0ms, exec=92ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/disabled <==== 1.1.1.16 fail to start [root@wsfd-advnetlab16 bz1614166]# rpm -qa | grep -E "openvswitch|ovn|pacemaker|pcs" pacemaker-cli-2.0.4-6.el8.x86_64 ovn2.13-20.12.0-1.el8fdp.x86_64 openvswitch2.13-2.13.0-77.el8fdp.x86_64 pacemaker-schemas-2.0.4-6.el8.noarch pacemaker-cluster-libs-2.0.4-6.el8.x86_64 pacemaker-2.0.4-6.el8.x86_64 pcs-0.10.6-4.el8.x86_64 openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch ovn2.13-central-20.12.0-1.el8fdp.x86_64 pacemaker-libs-2.0.4-6.el8.x86_64 ovn2.13-host-20.12.0-1.el8fdp.x86_64 [root@wsfd-advnetlab16 bz1614166]# uname -a Linux wsfd-advnetlab16.anl.lab.eng.bos.redhat.com 4.18.0-240.el8.x86_64 #1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux status on 1.1.1.18: [root@wsfd-advnetlab18 ~]# ip addr sh ens1f0 5: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 0c:42:a1:08:0b:02 brd ff:ff:ff:ff:ff:ff inet 1.1.1.18/24 scope global ens1f0 valid_lft forever preferred_lft forever inet 1.1.1.100/24 scope global secondary ens1f0 valid_lft forever preferred_lft forever inet6 fe80::e42:a1ff:fe08:b02/64 scope link valid_lft forever preferred_lft forever [root@wsfd-advnetlab18 ~]# pcs status Cluster name: my_cluster Cluster Summary: * Stack: corosync * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum * Last updated: Mon Jan 11 22:20:34 2021 * Last change: Mon Jan 11 22:16:16 2021 by root via crm_attribute on 1.1.1.18 * 3 nodes configured * 4 resource instances configured Node List: * Online: [ 1.1.1.16 1.1.1.17 1.1.1.18 ] Full List of Resources: * ip-1.1.1.100 (ocf::heartbeat:IPaddr2): Started 1.1.1.18 * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable): * ovndb_servers (ocf::ovn:ovndb-servers): Demoting 1.1.1.18 * ovndb_servers (ocf::ovn:ovndb-servers): FAILED 1.1.1.16 (Monitoring) * Slaves: [ 1.1.1.17 ] Failed Resource Actions: * ovndb_servers_monitor_30000 on 1.1.1.16 'not running' (7): call=749, status='complete', exitreason='', last-rc-change='2021-01-11 22:20:34 -05:00', queued=0ms, exec=152ms * ip-1.1.1.100_start_0 on 1.1.1.16 'error' (1): call=23, status='complete', exitreason='[findif] failed', last-rc-change='2021-01-11 22:16:24 -05:00', queued=0ms, exec=92ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/disabled
the issue also exists on 20.I (2.13-20.09.0-17)
corosync 2 (in RHEL 7) can not handle interface down/up. This was corrected in corosync 3 (in RHEL 8), as part of a major design overhaul that is unfortunately not backportable. This report should be closed WONTFIX. As an aside, interface down/up is not a good test of network outages, as it does not accurately correspond to what happens in that situation. Either physically pulling the network cable, or using the firewall to block all incoming and outgoing packets on an interface, is a more accurate test.
Closing WONTFIX based on Ken's recommendation.