Bug 1684419

Summary:	[ovn_cluster][RHEL 7] master node can't be up after restart openvswitch
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	haidong li <haili>
Component:	OVN	Assignee:	Numan Siddique <nusiddiq>
Status:	CLOSED ERRATA	QA Contact:	Jianlin Shi <jishi>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	FDP 19.G	CC:	ctrautma, dceara, ekuris, fhallal, jhsiao, jiji, jishi, kfida, mmichels, nusiddiq, qding, ralongi
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ovn2.12-2.12.0-2	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1787971 (view as bug list)		Environment:
Last Closed:	2019-12-11 12:04:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1787971

Description haidong li 2019-03-01 08:49:45 UTC

Description of problem:
this bug is similar with a ovs 2.9 bug bz1684363 on rhel7

Version-Release number of selected component (if applicable):
[root@hp-dl388g8-02 ovn_ha]# uname -a
Linux hp-dl388g8-02.rhts.eng.pek2.redhat.com 4.18.0-64.el8.x86_64 #1 SMP Wed Jan 23 20:50:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-dl388g8-02 ovn_ha]# rpm -qa | grep openvswitch
openvswitch2.11-ovn-common-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
kernel-kernel-networking-openvswitch-ovn_ha-1.0-30.noarch
openvswitch2.11-ovn-central-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
openvswitch-selinux-extra-policy-1.0-10.el8fdp.noarch
openvswitch2.11-ovn-host-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64

How reproducible:
everytime

Steps to Reproduce:
1.set up cluster with 3 nodes as ovndb_servers
2.restart openvswitch on master node

[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:40:50 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     Masters: [ 70.0.0.2 ]
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# systemctl restart openvswitch
[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:41:21 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     Masters: [ 70.0.0.2 ]
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:41:25 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     ovndb_servers	(ocf::ovn:ovndb-servers):	FAILED 70.0.0.2
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Failed Resource Actions:
* ovndb_servers_demote_0 on 70.0.0.2 'not running' (7): call=17, status=complete, exitreason='',
    last-rc-change='Fri Mar  1 03:41:22 2019', queued=0ms, exec=97ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:41:27 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     ovndb_servers	(ocf::ovn:ovndb-servers):	FAILED 70.0.0.2
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Failed Resource Actions:
* ovndb_servers_demote_0 on 70.0.0.2 'not running' (7): call=17, status=complete, exitreason='',
    last-rc-change='Fri Mar  1 03:41:22 2019', queued=0ms, exec=97ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# 



Actual results:
the node can't be back after restart openvswitch

Expected results:
the node can be back after restart openvswitch

Additional info:

Comment 1 Numan Siddique 2019-06-24 09:41:18 UTC

*** Bug 1723291 has been marked as a duplicate of this bug. ***

Comment 2 Numan Siddique 2019-06-24 09:43:48 UTC

The main issue is that when you restart openvswitch, the ovs run time folders - /var/run/openvswitch is deleted and recreated again by the openvswitch systemd script.
Since the OVN ovsdb-servers (and ovn-controller) also use the same runtime directory, all the OVN ovsdb-servers' run time socket files are also deleted. After which
the OVN ocf script can't stop or monitor the status of the ovsdb-servers.

If you do "ps -aef | grep ovsdb-servers" you will see that the old ovsdb-servers will be still running. Killing those processes
manually and then refreshing the pacemaker resource recovers it.

I think this is expected and known issue. The proper fix to it is to use a separate runtime directory for OVN.

Comment 3 Numan Siddique 2019-07-01 14:46:26 UTC

*** Bug 1566412 has been marked as a duplicate of this bug. ***

Comment 7 haidong li 2019-11-08 03:03:49 UTC

This issue is blocked by bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1769202

Comment 8 Jianlin Shi 2019-11-09 06:24:47 UTC

Verified on ovn2.12.0-7:

[root@dell-per740-12 ovn_ha]# pcs status                                                                                                                                                                    
Cluster name: my_cluster                                                                                                                                                                                    
                                                                                                                                                                                                            
WARNINGS:                                                                                                                                                                                                   
Corosync and pacemaker node names do not match (IPs used in setup?)                                                                                                                                         
                                                                                                                                                                                                            
Stack: corosync                                                                                                                                                                                             
Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum                                                                                               
Last updated: Sat Nov  9 01:20:42 2019                                                                                                                                                                      
Last change: Sat Nov  9 01:16:11 2019 by root via crm_attribute on dell-per740-12.rhts.eng.pek2.redhat.com                                                                                                  
                                                                                                                                                                                                            
3 nodes configured                                                                                                                                                                                          
4 resources configured                                                                                                                                                                                      
                                                                                                                                                                                                            
Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]                                                                         
                                                                                                                                                                                                            
Full list of resources:                               
                                                
 ip-70.11.0.50  (ocf::heartbeat:IPaddr2):       Started dell-per740-12.rhts.eng.pek2.redhat.com
 Master/Slave Set: ovndb_servers-master [ovndb_servers]                                                                                
     Masters: [ dell-per740-12.rhts.eng.pek2.redhat.com ]
     Slaves: [ hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]
                                              
Daemon Status:                                                                                                                         
  corosync: active/enabled                            
  pacemaker: active/enabled                     
  pcsd: active/enabled                        
[root@dell-per740-12 ovn_ha]# hostname  
dell-per740-12.rhts.eng.pek2.redhat.com 

[root@dell-per740-12 ovn_ha]# systemctl restart openvswitch

<==== restart openvswitch

[root@dell-per740-12 ovn_ha]# pcs status                                                                                               
Cluster name: my_cluster                              
                                                
WARNINGS:                                     
Corosync and pacemaker node names do not match (IPs used in setup?)
                                        
Stack: corosync                                            
Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Sat Nov  9 01:21:07 2019                             
Last change: Sat Nov  9 01:16:11 2019 by root via crm_attribute on dell-per740-12.rhts.eng.pek2.redhat.com
               
3 nodes configured                                                                                                                                                                                         
4 resources configured                
                                                                                                                                                                                                           
Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]
                                      
Full list of resources:                                                                                   

 ip-70.11.0.50  (ocf::heartbeat:IPaddr2):       Started dell-per740-12.rhts.eng.pek2.redhat.com                                                                                                            
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ dell-per740-12.rhts.eng.pek2.redhat.com ]
     Slaves: [ hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]                                   
                                                                                               
Daemon Status:                                         
  corosync: active/enabled                               
  pacemaker: active/enabled                                                                     
  pcsd: active/enabled                 

<==== pcs is still up
                 
[root@dell-per740-12 ovn_ha]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.12-2.12.0-4.el7fdp.x86_64                                                          
ovn2.12-host-2.12.0-7.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn_ha-1.0-43.noarch
ovn2.12-central-2.12.0-7.el7fdp.x86_64                           
ovn2.12-2.12.0-7.el7fdp.x86_64        
openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch

Comment 10 errata-xmlrpc 2019-12-11 12:04:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:4208