1915129 – ovndb_servers fail to start if set NIC down and up

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1915129 - ovndb_servers fail to start if set NIC down and up

Summary: ovndb_servers fail to start if set NIC down and up

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn2.13
Sub Component:
Version:	FDP 21.A
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	OVN Team
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-12 03:21 UTC by Jianlin Shi
Modified:	2021-11-19 14:48 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-19 14:47:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1017	0	None	None	None	2021-11-19 14:48:08 UTC

Description Jianlin Shi 2021-01-12 03:21:52 UTC

Description of problem:
ovndb_servers fail to start if set NIC down and up.
build ovn cluster with pcs on 3 nodes, A is the master, then set NIC on A down, then sleep a while, then set NIC up. the node fail to start

Version-Release number of selected component (if applicable):
2.13-20.12.0-1

How reproducible:
Always

Steps to Reproduce:
1. set enforce 0 and systemctl start openvswitch on 3 nodes

2. start pcs with following script:
setenforce 0
systemctl start openvswitch                                                                                     
ip_s=1.1.1.16                                                                                                   
ip_c1=1.1.1.17
ip_c2=1.1.1.18                                                                                                  
ip_v=1.1.1.100                                                                                                  
(sleep 2;echo "hacluster"; sleep 2; echo "redhat" ) |pcs host auth  $ip_c1 $ip_c2 $ip_s                         
sleep 5                                                                                                         
                                                                                                                
pcs cluster setup my_cluster --force --start $ip_c1 $ip_c2 $ip_s
pcs cluster enable --all                                                                                        

pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore                                                                        
pcs cluster cib tmp-cib.xml                                                                                     
sleep 10
cp tmp-cib.xml tmp-cib.deltasrc                                                                                 

pcs resource delete ip-$ip_v
pcs resource delete ovndb_servers-clone                                                                         
sleep 5
pcs status
pcs -f tmp-cib.xml resource create ip-$ip_v ocf:heartbeat:IPaddr2 ip=$ip_v   op monitor interval=30s            
sleep 5
pcs -f tmp-cib.xml resource create ovndb_servers  ocf:ovn:ovndb-servers manage_northd=yes master_ip=$ip_v nb_master_port=6641 sb_master_port=6642 promotable
sleep 5                                                                                                         
pcs -f tmp-cib.xml resource meta ovndb_servers-clone notify=true
pcs -f tmp-cib.xml constraint order start ip-$ip_v then promote ovndb_servers-clone
pcs -f tmp-cib.xml constraint colocation add ip-$ip_v with master ovndb_servers-clone
pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_s=1500
pcs -f tmp-cib.xml constraint location ovndb_servers-clone prefers $ip_s=1500
pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c2=1000
pcs -f tmp-cib.xml constraint location ovndb_servers-clone prefers $ip_c2=1000
pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c1=500
pcs -f tmp-cib.xml constraint location ovndb_servers-clone prefers $ip_c1=500
pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.deltasrc

4. master is 1.1.1.16. then set interface with ip 1.1.1.16 down
5. sleep 15, then set 1.1.1.16 up
6. pcs status

Actual results:
1.1.1.16 fail to start

Expected results:
1.1.1.16 should start

Additional info:

[root@wsfd-advnetlab16 bz1614166]# pcs status                                                                                                                                                              
Cluster name: my_cluster                      
Cluster Summary:                               
  * Stack: corosync                           
  * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
  * Last updated: Mon Jan 11 22:16:04 2021                                                                  
  * Last change:  Mon Jan 11 22:15:54 2021 by root via crm_attribute on 1.1.1.16
  * 3 nodes configured                                                                                                                                           
  * 4 resource instances configured
                                                                  
Node List:                                                                               
  * Online: [ 1.1.1.16 1.1.1.17 1.1.1.18 ]                                                                 
                                                                                           
Full List of Resources:                                                    
  * ip-1.1.1.100        (ocf::heartbeat:IPaddr2):        Started 1.1.1.16
  * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable):                  
    * Masters: [ 1.1.1.16 ]                                          
    * Slaves: [ 1.1.1.17 1.1.1.18 ]                                        
                                                                     
Daemon Status:                                                                    
  corosync: active/enabled                                               
  pacemaker: active/enabled                                               
  pcsd: active/disabled                                                              
[root@wsfd-advnetlab16 bz1614166]# ip addr sh ens1f0                                 
5: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:42:a1:08:0b:1a brd ff:ff:ff:ff:ff:ff          
    inet 1.1.1.16/24 scope global ens1f0
       valid_lft forever preferred_lft forever                                                                                                                                                             
    inet 1.1.1.100/24 scope global secondary ens1f0                                                                                                                                                        
       valid_lft forever preferred_lft forever
    inet6 fe80::e42:a1ff:fe08:b1a/64 scope link
       valid_lft forever preferred_lft forever
[root@wsfd-advnetlab16 bz1614166]# ip link set ens1f0 down                       
[root@wsfd-advnetlab16 bz1614166]# sleep 15
[root@wsfd-advnetlab16 bz1614166]# pcs status                                   
Cluster name: my_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition WITHOUT quorum
  * Last updated: Mon Jan 11 22:16:38 2021
  * Last change:  Mon Jan 11 22:15:54 2021 by root via crm_attribute on 1.1.1.16
  * 3 nodes configured
  * 4 resource instances configured
  
Node List:
  * Online: [ 1.1.1.16 ]
  * OFFLINE: [ 1.1.1.17 1.1.1.18 ]
  
Full List of Resources:
  * ip-1.1.1.100        (ocf::heartbeat:IPaddr2):        Stopped
  * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable):
    * Masters: [ 1.1.1.16 ]
    * Stopped: [ 1.1.1.17 1.1.1.18 ]
    
Failed Resource Actions:
  * ip-1.1.1.100_start_0 on 1.1.1.16 'error' (1): call=23, status='complete', exitreason='[findif] failed', last-rc-change='2021-01-11 22:16:24 -05:00', queued=0ms, exec=92ms
  
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/disabled
[root@wsfd-advnetlab16 bz1614166]# ip link set ens1f0 up
[root@wsfd-advnetlab16 bz1614166]# pcs status
Cluster name: my_cluster
Cluster Summary:                                                                 
  * Stack: corosync                       
  * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
  * Last updated: Mon Jan 11 22:17:39 2021
  * Last change:  Mon Jan 11 22:16:57 2021 by hacluster via crmd on 1.1.1.17
  * 3 nodes configured
  * 4 resource instances configured
                                          
Node List:
  * Online: [ 1.1.1.16 1.1.1.17 1.1.1.18 ]
                                                                         
Full List of Resources:                                          
  * ip-1.1.1.100        (ocf::heartbeat:IPaddr2):        Stopped                     
  * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable):                    
    * ovndb_servers     (ocf::ovn:ovndb-servers):        Starting 1.1.1.18          
    * ovndb_servers     (ocf::ovn:ovndb-servers):        FAILED 1.1.1.16 (Monitoring)
    * Slaves: [ 1.1.1.17 ]
                                                                                                                                                                             
Failed Resource Actions:                                                                                                                                                      
  * ovndb_servers_monitor_30000 on 1.1.1.16 'not running' (7): call=159, status='complete', exitreason='', last-rc-change='2021-01-11 22:17:37 -05:00', queued=0ms, exec=136ms
  * ip-1.1.1.100_start_0 on 1.1.1.16 'error' (1): call=23, status='complete', exitreason='[findif] failed', last-rc-change='2021-01-11 22:16:24 -05:00', queued=0ms, exec=92ms
                          
Daemon Status:             
  corosync: active/enabled
  pacemaker: active/enabled                  
  pcsd: active/disabled

<==== 1.1.1.16 fail to start

[root@wsfd-advnetlab16 bz1614166]# rpm -qa | grep -E "openvswitch|ovn|pacemaker|pcs"                                      
pacemaker-cli-2.0.4-6.el8.x86_64                                                                                          
ovn2.13-20.12.0-1.el8fdp.x86_64                                                                                           
openvswitch2.13-2.13.0-77.el8fdp.x86_64                                                                                   
pacemaker-schemas-2.0.4-6.el8.noarch                                                                                      
pacemaker-cluster-libs-2.0.4-6.el8.x86_64                                                                                 
pacemaker-2.0.4-6.el8.x86_64                                                                                              
pcs-0.10.6-4.el8.x86_64                                                                                                   
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch                                                                     
ovn2.13-central-20.12.0-1.el8fdp.x86_64                                                                                   
pacemaker-libs-2.0.4-6.el8.x86_64                                                                                         
ovn2.13-host-20.12.0-1.el8fdp.x86_64                                                                                      
[root@wsfd-advnetlab16 bz1614166]# uname -a                                                                               
Linux wsfd-advnetlab16.anl.lab.eng.bos.redhat.com 4.18.0-240.el8.x86_64 #1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux


status on 1.1.1.18:

[root@wsfd-advnetlab18 ~]# ip addr sh ens1f0                                                          
5: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000       
    link/ether 0c:42:a1:08:0b:02 brd ff:ff:ff:ff:ff:ff                                                
    inet 1.1.1.18/24 scope global ens1f0                                                              
       valid_lft forever preferred_lft forever                                                        
    inet 1.1.1.100/24 scope global secondary ens1f0                                                   
       valid_lft forever preferred_lft forever                                                        
    inet6 fe80::e42:a1ff:fe08:b02/64 scope link                                                       
       valid_lft forever preferred_lft forever                                                        
[root@wsfd-advnetlab18 ~]# pcs status
Cluster name: my_cluster                                                                              
Cluster Summary:                                                                                      
  * Stack: corosync                                                                                   
  * Current DC: 1.1.1.16 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum                     
  * Last updated: Mon Jan 11 22:20:34 2021                                                            
  * Last change:  Mon Jan 11 22:16:16 2021 by root via crm_attribute on 1.1.1.18                      
  * 3 nodes configured                                                                                
  * 4 resource instances configured                                                                   
                                                                                                      
Node List:                                                                                            
  * Online: [ 1.1.1.16 1.1.1.17 1.1.1.18 ]                                                            
                                                                                                      
Full List of Resources:                                                                               
  * ip-1.1.1.100        (ocf::heartbeat:IPaddr2):        Started 1.1.1.18                             
  * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable):                                      
    * ovndb_servers     (ocf::ovn:ovndb-servers):        Demoting 1.1.1.18                            
    * ovndb_servers     (ocf::ovn:ovndb-servers):        FAILED 1.1.1.16 (Monitoring)                 
    * Slaves: [ 1.1.1.17 ]                                                                            

Failed Resource Actions:
  * ovndb_servers_monitor_30000 on 1.1.1.16 'not running' (7): call=749, status='complete', exitreason='', last-rc-change='2021-01-11 22:20:34 -05:00', queued=0ms, exec=152ms
  * ip-1.1.1.100_start_0 on 1.1.1.16 'error' (1): call=23, status='complete', exitreason='[findif] failed', last-rc-change='2021-01-11 22:16:24 -05:00', queued=0ms, exec=92ms

Daemon Status:
  corosync: active/enabled                                                                            
  pacemaker: active/enabled
  pcsd: active/disabled

Comment 1 Jianlin Shi 2021-01-12 06:46:42 UTC

the issue also exists on 20.I (2.13-20.09.0-17)

Comment 2 Ken Gaillot 2021-01-12 16:00:21 UTC

corosync 2 (in RHEL 7) can not handle interface down/up. This was corrected in corosync 3 (in RHEL 8), as part of a major design overhaul that is unfortunately not backportable. This report should be closed WONTFIX.

As an aside, interface down/up is not a good test of network outages, as it does not accurately correspond to what happens in that situation. Either physically pulling the network cable, or using the firewall to block all incoming and outgoing packets on an interface, is a more accurate test.

Comment 3 Mark Michelson 2021-11-19 14:47:02 UTC

Closing WONTFIX based on Ken's recommendation.

Note You need to log in before you can comment on or make changes to this bug.