1684419 – [ovn_cluster][RHEL 7] master node can't be up after restart openvswitch

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1684419 - [ovn_cluster][RHEL 7] master node can't be up after restart openvswitch

Summary: [ovn_cluster][RHEL 7] master node can't be up after restart openvswitch

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	OVN
Sub Component:
Version:	FDP 19.G
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Numan Siddique
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1566412 1723291 (view as bug list)
Depends On:
Blocks:	1787971
TreeView+	depends on / blocked

Reported:	2019-03-01 08:49 UTC by haidong li
Modified:	2020-01-14 21:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:	ovn2.12-2.12.0-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1787971 (view as bug list)
Environment:
Last Closed:	2019-12-11 12:04:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:4208	0	None	None	None	2019-12-11 12:04:53 UTC

Description haidong li 2019-03-01 08:49:45 UTC

Description of problem:
this bug is similar with a ovs 2.9 bug bz1684363 on rhel7

Version-Release number of selected component (if applicable):
[root@hp-dl388g8-02 ovn_ha]# uname -a
Linux hp-dl388g8-02.rhts.eng.pek2.redhat.com 4.18.0-64.el8.x86_64 #1 SMP Wed Jan 23 20:50:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-dl388g8-02 ovn_ha]# rpm -qa | grep openvswitch
openvswitch2.11-ovn-common-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
kernel-kernel-networking-openvswitch-ovn_ha-1.0-30.noarch
openvswitch2.11-ovn-central-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
openvswitch-selinux-extra-policy-1.0-10.el8fdp.noarch
openvswitch2.11-ovn-host-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64

How reproducible:
everytime

Steps to Reproduce:
1.set up cluster with 3 nodes as ovndb_servers
2.restart openvswitch on master node

[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:40:50 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     Masters: [ 70.0.0.2 ]
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# systemctl restart openvswitch
[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:41:21 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     Masters: [ 70.0.0.2 ]
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:41:25 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     ovndb_servers	(ocf::ovn:ovndb-servers):	FAILED 70.0.0.2
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Failed Resource Actions:
* ovndb_servers_demote_0 on 70.0.0.2 'not running' (7): call=17, status=complete, exitreason='',
    last-rc-change='Fri Mar  1 03:41:22 2019', queued=0ms, exec=97ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: 70.0.0.2 (version 2.0.1-3.el8-0eb7991564) - partition with quorum
Last updated: Fri Mar  1 03:41:27 2019
Last change: Fri Mar  1 03:33:33 2019 by root via crm_attribute on 70.0.0.2

3 nodes configured
4 resources configured

Online: [ 70.0.0.2 70.0.0.12 70.0.0.20 ]

Full list of resources:

 ip-70.0.0.50	(ocf::heartbeat:IPaddr2):	Started 70.0.0.2
 Clone Set: ovndb_servers-clone [ovndb_servers] (promotable)
     ovndb_servers	(ocf::ovn:ovndb-servers):	FAILED 70.0.0.2
     Slaves: [ 70.0.0.12 70.0.0.20 ]

Failed Resource Actions:
* ovndb_servers_demote_0 on 70.0.0.2 'not running' (7): call=17, status=complete, exitreason='',
    last-rc-change='Fri Mar  1 03:41:22 2019', queued=0ms, exec=97ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@hp-dl388g8-02 ovn_ha]# 



Actual results:
the node can't be back after restart openvswitch

Expected results:
the node can be back after restart openvswitch

Additional info:

Comment 1 Numan Siddique 2019-06-24 09:41:18 UTC

*** Bug 1723291 has been marked as a duplicate of this bug. ***

Comment 2 Numan Siddique 2019-06-24 09:43:48 UTC

The main issue is that when you restart openvswitch, the ovs run time folders - /var/run/openvswitch is deleted and recreated again by the openvswitch systemd script.
Since the OVN ovsdb-servers (and ovn-controller) also use the same runtime directory, all the OVN ovsdb-servers' run time socket files are also deleted. After which
the OVN ocf script can't stop or monitor the status of the ovsdb-servers.

If you do "ps -aef | grep ovsdb-servers" you will see that the old ovsdb-servers will be still running. Killing those processes
manually and then refreshing the pacemaker resource recovers it.

I think this is expected and known issue. The proper fix to it is to use a separate runtime directory for OVN.

Comment 3 Numan Siddique 2019-07-01 14:46:26 UTC

*** Bug 1566412 has been marked as a duplicate of this bug. ***

Comment 7 haidong li 2019-11-08 03:03:49 UTC

This issue is blocked by bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1769202

Comment 8 Jianlin Shi 2019-11-09 06:24:47 UTC

Verified on ovn2.12.0-7:

[root@dell-per740-12 ovn_ha]# pcs status                                                                                                                                                                    
Cluster name: my_cluster                                                                                                                                                                                    
                                                                                                                                                                                                            
WARNINGS:                                                                                                                                                                                                   
Corosync and pacemaker node names do not match (IPs used in setup?)                                                                                                                                         
                                                                                                                                                                                                            
Stack: corosync                                                                                                                                                                                             
Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum                                                                                               
Last updated: Sat Nov  9 01:20:42 2019                                                                                                                                                                      
Last change: Sat Nov  9 01:16:11 2019 by root via crm_attribute on dell-per740-12.rhts.eng.pek2.redhat.com                                                                                                  
                                                                                                                                                                                                            
3 nodes configured                                                                                                                                                                                          
4 resources configured                                                                                                                                                                                      
                                                                                                                                                                                                            
Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]                                                                         
                                                                                                                                                                                                            
Full list of resources:                               
                                                
 ip-70.11.0.50  (ocf::heartbeat:IPaddr2):       Started dell-per740-12.rhts.eng.pek2.redhat.com
 Master/Slave Set: ovndb_servers-master [ovndb_servers]                                                                                
     Masters: [ dell-per740-12.rhts.eng.pek2.redhat.com ]
     Slaves: [ hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]
                                              
Daemon Status:                                                                                                                         
  corosync: active/enabled                            
  pacemaker: active/enabled                     
  pcsd: active/enabled                        
[root@dell-per740-12 ovn_ha]# hostname  
dell-per740-12.rhts.eng.pek2.redhat.com 

[root@dell-per740-12 ovn_ha]# systemctl restart openvswitch

<==== restart openvswitch

[root@dell-per740-12 ovn_ha]# pcs status                                                                                               
Cluster name: my_cluster                              
                                                
WARNINGS:                                     
Corosync and pacemaker node names do not match (IPs used in setup?)
                                        
Stack: corosync                                            
Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Sat Nov  9 01:21:07 2019                             
Last change: Sat Nov  9 01:16:11 2019 by root via crm_attribute on dell-per740-12.rhts.eng.pek2.redhat.com
               
3 nodes configured                                                                                                                                                                                         
4 resources configured                
                                                                                                                                                                                                           
Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]
                                      
Full list of resources:                                                                                   

 ip-70.11.0.50  (ocf::heartbeat:IPaddr2):       Started dell-per740-12.rhts.eng.pek2.redhat.com                                                                                                            
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ dell-per740-12.rhts.eng.pek2.redhat.com ]
     Slaves: [ hp-dl380pg8-11.rhts.eng.pek2.redhat.com ibm-x3650m5-03.rhts.eng.pek2.redhat.com ]                                   
                                                                                               
Daemon Status:                                         
  corosync: active/enabled                               
  pacemaker: active/enabled                                                                     
  pcsd: active/enabled                 

<==== pcs is still up
                 
[root@dell-per740-12 ovn_ha]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.12-2.12.0-4.el7fdp.x86_64                                                          
ovn2.12-host-2.12.0-7.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn_ha-1.0-43.noarch
ovn2.12-central-2.12.0-7.el7fdp.x86_64                           
ovn2.12-2.12.0-7.el7fdp.x86_64        
openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch

Comment 10 errata-xmlrpc 2019-12-11 12:04:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:4208

Note You need to log in before you can comment on or make changes to this bug.