Bug 1788906

Summary:

ovsdb-server running in standby mode reconnects to active because of no probe interval response

Product:

Red Hat Enterprise Linux Fast Datapath

Reporter:

Numan Siddique <nusiddiq>

Component:

openvswitch2.12

Assignee:

Numan Siddique <nusiddiq>

Status:

CLOSED ERRATA

QA Contact:

Jianlin Shi <jishi>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

RHEL 7.7

CC:

ctrautma, jhsiao, jishi, kfida, ovs-qe, ralongi, tredaelli

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openvswitch2.11-2.11.0-17.el7fdn

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1788800

Environment:

Last Closed:

2020-03-10 09:36:07 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovnnb_db.db file	none

Description Numan Siddique 2020-01-08 11:04:35 UTC

+++ This bug was initially created as a clone of Bug #1788800 +++

Description of problem:

If active ovsdb-server doesn't respond to the echo request from the standby ovsdb-servers (in the active/passive deployment) within 5 seconds, the standby ovsdb-server disconnects. And if active ovsdb-server is heavily loaded then this could result in continous loop of connect/disconnect.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jianlin Shi 2020-02-05 09:33:06 UTC

reproduced with following steps:

install pcs on two systems:
yum -y install pcs pacemaker fence-agents-all

setenforce 0
systemctl start openvswitch

then setup pcs with following script:

setenforce 0
systemctl start openvswitch
ip_c1=20.0.30.26
ip_c2=20.0.30.25
ip_v=20.0.30.100
(sleep 2;echo "hacluster"; sleep 2; echo "redhat" ) |pcs cluster auth  $ip_c1 $ip_c2
sleep 5
pcs cluster setup --force --start --name my_cluster $ip_c1 $ip_c2
pcs cluster enable --all

pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore
pcs cluster cib tmp-cib.xml
sleep 10
cp tmp-cib.xml tmp-cib.deltasrc
pcs resource delete ip-$ip_v
pcs resource delete ovndb_servers-master
sleep 5
pcs status

pcs -f tmp-cib.xml resource create ip-$ip_v ocf:heartbeat:IPaddr2 ip=$ip_v op monitor interval=30s
sleep 5
pcs -f tmp-cib.xml resource create ovndb_servers  ocf:ovn:ovndb-servers manage_northd=yes master_ip=$ip_v nb_master_port=6641 sb_master_port=6642 master
sleep 5
pcs -f tmp-cib.xml resource meta ovndb_servers-master notify=true
pcs -f tmp-cib.xml constraint order start ip-$ip_v then promote ovndb_servers-master
pcs -f tmp-cib.xml constraint colocation add ip-$ip_v with master ovndb_servers-master

#pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c2=1000
#pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_c2=1000
#pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c1=500
#pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_c1=500

pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.deltasrc

then copy ovnnb_db.db attached to /etc/ovn
then restart resource with: pcs resource restart ovndb_servers

reproduced on ovs2.12.0-10:

[root@hp-dl380pg8-12 ovs2.12.0-10]# rpm -ivh *                                                                                                                                                              
Preparing...                          ################################# [100%]                                                                                                                              
Updating / installing...                                                                                                                                                                                    
   1:openvswitch2.12-2.12.0-10.el7fdp ################################# [100%]                                                                                                                                                                                                             
[root@hp-dl380pg8-12 ovn2.12.0-26]# rpm -ivh *                                                                                                                                       
Preparing...                          ################################# [100%]                                                                                                                              
Updating / installing...                                                                                                                                                       
   1:ovn2.12-2.12.0-26.el7fdp         ################################# [ 33%]                                                                                                 
Unit ovn-northd.service could not be found.                                                                                                                                                                 
   2:ovn2.12-central-2.12.0-26.el7fdp ################################# [ 67%]                                                                                                                              
Unit ovn-controller.service could not be found.                                                                                                                                                     
   3:ovn2.12-host-2.12.0-26.el7fdp    ################################# [100%] 

[root@hp-dl380pg8-12 bz1788800]# pcs status                                                                                                                                                                 
Cluster name: my_cluster                                                                                                                                                                                    
                                                                                                                                                                                                            
WARNINGS:                                                                                                                                                                                                   
Corosync and pacemaker node names do not match (IPs used in setup?)                                                                                                                                         
                                                                                                                                                                                                            
Stack: corosync                                                                                                                                                                                             
Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum                                                      
Last updated: Wed Feb  5 04:05:39 2020                                                                                                                                                                      
Last change: Wed Feb  5 04:05:02 2020 by root via crm_resource on hp-dl380pg8-12.rhts.eng.pek2.redhat.com                                                                                                   
                                                                                                                                                                                                            
2 nodes configured                                                                                                                                                                   
3 resources configured                                                                                                                                                                                      
                                                                                                                                                                               
Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-12.rhts.eng.pek2.redhat.com ]                                                                                    
                                                                                                                                                                                                            
Full list of resources:                                                                                                                                                                                     
                                                                                                                                                                                                    
 ip-20.0.30.100 (ocf::heartbeat:IPaddr2):       Started hp-dl380pg8-12.rhts.eng.pek2.redhat.com                                                                     
 Master/Slave Set: ovndb_servers-master [ovndb_servers]                                                                                                             
     Masters: [ hp-dl380pg8-12.rhts.eng.pek2.redhat.com ]                                                                                                                                                   
     Slaves: [ dell-per740-12.rhts.eng.pek2.redhat.com ]                                                                                                                                                    
                                                                                                                                                                                   
Daemon Status:                                                                                                                                                                                              
  corosync: active/enabled                                                                                                                                                                                  
  pacemaker: active/enabled                                                                                                                                                                                 
  pcsd: active/disabled

top result on master (after about 5m):

Tasks: 334 total,   3 running, 331 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.6 us,  0.6 sy,  0.0 ni, 95.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32736216 total, 25530636 free,  4704328 used,  2501252 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 27531560 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                 
32726 root      20   0 3623956   3.4g   1780 R 100.0 10.9   2:20.24 ovsdb-server

log in ovsdb-server-sb.log on slave:

2020-02-05T09:07:39.691Z|00048|reconnect|ERR|tcp:20.0.30.100:6642: no response to inactivity probe after 5 seconds, disconnecting
2020-02-05T09:07:39.691Z|00049|reconnect|INFO|tcp:20.0.30.100:6642: connection dropped
2020-02-05T09:07:40.693Z|00050|reconnect|INFO|tcp:20.0.30.100:6642: connecting...
2020-02-05T09:07:40.694Z|00051|reconnect|INFO|tcp:20.0.30.100:6642: connected

Verified on ovs2.12.0-21:

[root@hp-dl380pg8-12 bz1788800]# pcs status
Cluster name: my_cluster

WARNINGS:
Corosync and pacemaker node names do not match (IPs used in setup?)

Stack: corosync
Current DC: dell-per740-12.rhts.eng.pek2.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Wed Feb  5 04:31:59 2020
Last change: Wed Feb  5 04:12:43 2020 by root via crm_resource on dell-per740-12.rhts.eng.pek2.redhat.com

2 nodes configured
3 resources configured

Online: [ dell-per740-12.rhts.eng.pek2.redhat.com hp-dl380pg8-12.rhts.eng.pek2.redhat.com ]

Full list of resources:

 ip-20.0.30.100 (ocf::heartbeat:IPaddr2):       Started hp-dl380pg8-12.rhts.eng.pek2.redhat.com
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ hp-dl380pg8-12.rhts.eng.pek2.redhat.com ]
     Slaves: [ dell-per740-12.rhts.eng.pek2.redhat.com ]

Failed Resource Actions:
* ovndb_servers_monitor_10000 on hp-dl380pg8-12.rhts.eng.pek2.redhat.com 'unknown error' (1): call=34, status=Timed Out, exitreason='',
    last-rc-change='Wed Feb  5 04:13:35 2020', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/disabled
[root@hp-dl380pg8-12 bz1788800]# rpm -qa | grep -E "openvswitch|ovn"
kernel-kernel-networking-openvswitch-ovn-common-1.0-7.noarch
ovn2.12-central-2.12.0-26.el7fdp.x86_64
ovn2.12-host-2.12.0-26.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch
kernel-kernel-networking-openvswitch-ovn-basic-1.0-18.noarch
openvswitch2.12-2.12.0-21.el7fdp.x86_64
ovn2.12-2.12.0-26.el7fdp.x86_64

top - 04:32:19 up  7:38,  2 users,  load average: 0.01, 0.04, 0.14
Tasks: 333 total,   1 running, 332 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32736216 total, 25063884 free,  5166016 used,  2506316 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 27067668 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                 
34444 root      rt   0  192140  95920  70836 S   1.0  0.3   0:13.14 corosync                                                                                                                                
    9 root      20   0       0      0      0 S   0.3  0.0   0:22.18 rcu_sched                                                                                                                               
43201 root      20   0  162292   2548   1580 R   0.3  0.0   0:00.03 top                                                                                                                                     
    1 root      20   0  194168   7320   4216 S   0.0  0.0   0:35.60 sys

no reconnect log in ovsdb-server-sb.log on slave

Comment 3 Jianlin Shi 2020-02-05 09:35:55 UTC

Created attachment 1657858 [details]
ovnnb_db.db file

Comment 5 errata-xmlrpc 2020-03-10 09:36:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0745