Bug 1836308

Summary: [OVN SCALE] Disable ovsdb raft inactivity probe
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dumitru Ceara <dceara>
Component: openvswitch2.13Assignee: Dumitru Ceara <dceara>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: unspecified    
Version: RHEL 8.0CC: ctrautma, dceara, jhsiao, jishi, kfida, mmichels, qding, ralongi, trozet
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.13-2.13.0-28.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1822290 Environment:
Last Closed: 2020-07-15 12:58:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1822290    
Bug Blocks:    

Description Dumitru Ceara 2020-05-15 15:21:33 UTC
+++ This bug was initially created as a clone of Bug #1822290 +++

We see in large scale tests that ovn-controller disconnects from ovsdb due to inactivity probe timer. There is an effort upstream to disable this probe:

https://patchwork.ozlabs.org/patch/1264446/

This BZ is to track landing it downstream.

--- Additional comment from OvS team on 2020-05-15 00:57:22 UTC ---

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-24
- raft: Disable RAFT jsonrpc inactivity probe. (#1822290)
  [b12acf45a6872dda85642cbc73dd86eb529be17e]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-23
- raft: Fix leak of the incomplete command. (#1835729)
  [bb552cffb89104c2bb19b8aff749b8b825a6db13]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-22
- raft: Fix the problem of stuck in candidate role forever. (#1828639)
  [c5937276691bb90f99fad1871b5e3ca4ac9391e7]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-21
- raft: Fix next_index in install_snapshot reply handling. (#1828639)
  [09ac3c327ec678f36cd9df451b7846acdf734c0f]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-20
- raft: Avoid busy loop during leader election. (#1828639)
  [19683b041e19a49e275a4b42f5bb5b0528de898a]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-19
- raft: Fix raft_is_connected() when there is no leader yet. (#1828639)
  [2dae730162e5e1b084ac0d1fc339d2f09bd8cddb]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-18
- ovsdb-server: Don't disconnect clients after raft install_snapshot. (#1828639)
  [da9680c6095df8d6c477aa10e29baa8f00dc2e25]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-17
- raft-rpc: Fix message format. (#1828639)
  [e9bb63d6190925db63b4cad83e57a945c4ac0629]

Comment 1 OvS team 2020-05-15 21:02:48 UTC
* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-28
- raft: Disable RAFT jsonrpc inactivity probe. (#1836308)
  [3d9b529afb098531190d57d6f35d1622bb4093cd]
* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-27
- raft: Fix leak of the incomplete command. (#1836307)
  [5c38ccd52fb3925e82eda20f1897ec02abb390d9]
* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-26
- raft: Fix the problem of stuck in candidate role forever. (#1836305)
  [9c76350e271546eedfeb18720975e35b4e36e1f1]
* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-25
- raft: Fix next_index in install_snapshot reply handling. (#1836305)
  [cc3d02699203e2fe9d9fd384d09e268ba614828d]
* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-24
- raft: Avoid busy loop during leader election. (#1836305)
  [053b78c8d60ffb4d212fd7894f91be52027f291f]

* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-23
- raft: Fix raft_is_connected() when there is no leader yet. (#1836305)
  [e732012d7be335650398ff03c2431c64b2c4aaba]

* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-22
- ovsdb-server: Don't disconnect clients after raft install_snapshot. (#1836305)
  [8ff30dfee6cb075e36ed38b77695ff03321ce12b]

* Fri May 15 2020 Dumitru Ceara <dceara> - 2.13.0-21
- raft-rpc: Fix message format. (#1836305)
  [914d885061c9f7e7e6e5f921065301e08837e122]

Comment 4 Jianlin Shi 2020-06-10 01:32:06 UTC
ovsdb cluster leader would send echo to other nodes on openvswitch2.13.0-25.el8:

[root@dell-per740-12 bz1836308]# rpm -qa | grep -E "openvswitch|ovn"                                  
ovn2.13-host-2.13.0-33.el8fdp.x86_64                                                                  
ovn2.13-central-2.13.0-33.el8fdp.x86_64                                                               
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch                                                 
ovn2.13-2.13.0-33.el8fdp.x86_64                                                                       
openvswitch2.13-2.13.0-25.el8fdp.x86_64

setup raft cluster:

leader:

ip_s=20.0.30.25                                                                                       
ip_c1=20.0.30.42                                                                                      
ip_c2=20.0.30.13                                                                                      
/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$ip_s --db-nb-create-insecure-remote=yes \                
        --db-sb-addr=$ip_s --db-sb-create-insecure-remote=yes \                                       
        --db-nb-cluster-local-addr=$ip_s --db-sb-cluster-local-addr=$ip_s \                           
        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \                           
        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 start_northd

non-leader:
ip_s=20.0.30.25
ip_c1=20.0.30.42
ip_c2=20.0.30.13                                                                                      

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$ip_c1 --db-nb-create-insecure-remote=yes \               
        --db-sb-addr=$ip_c1 --db-sb-create-insecure-remote=yes \
        --db-nb-cluster-local-addr=$ip_c1 --db-sb-cluster-local-addr=$ip_c1 \
        --db-nb-cluster-remote-addr=$ip_s --db-sb-cluster-remote-addr=$ip_s \
        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 start_northd

ip_s=20.0.30.25
ip_c1=20.0.30.42                                                                                      
ip_c2=20.0.30.13

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$ip_c2 --db-nb-create-insecure-remote=yes \
        --db-sb-addr=$ip_c2 --db-sb-create-insecure-remote=yes \
        --db-nb-cluster-local-addr=$ip_c2 --db-sb-cluster-local-addr=$ip_c2 \
        --db-nb-cluster-remote-addr=$ip_s --db-sb-cluster-remote-addr=$ip_s \                         
        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 start_northd 

get the ports for leader communication:

[root@dell-per740-12 bz1836308]# ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound | grep -i address
Address: tcp:20.0.30.25:6643 
[root@dell-per740-12 bz1836308]# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound | grep -i address
Address: tcp:20.0.30.25:6644

capture packets on 6643 and 6644:

[root@dell-per740-12 bz1836308]# timeout 10 tcpdump  -i ens4f0 port 6643 -w 25-6643.pcap
[root@dell-per740-12 bz1836308]# timeout 10 tcpdump  -i ens4f0 port 6644 -w 25-6644.pcap 


[root@dell-per740-12 bz1836308]# tcpdump  -r 25-6644.pcap -nn -A | grep echo
reading from file 25-6644.pcap, link-type EN10MB (Ethernet)
.UR.....{"id":"echo","method":"echo","params":[]}
.....U?.{"id":"echo","method":"echo","params":[]}
.UR.....{"id":"echo","result":[],"error":null}
.....UR.{"id":"echo","result":[],"error":null}                                                        
..F..a..{"id":"echo","method":"echo","params":[]}              
.a.q..F.{"id":"echo","method":"echo","params":[]}                          
.a.q..F.{"id":"echo","result":[],"error":null}                                                        
..F..a.q{"id":"echo","result":[],"error":null}                                        
.....UR.{"id":"echo","method":"echo","params":[]}                                                     
.Uf.....{"id":"echo","method":"echo","params":[]}                                                     
.Uf.....{"id":"echo","result":[],"error":null}                                                                                                                                                             
.....Uf.{"id":"echo","result":[],"error":null}

[root@dell-per740-12 bz1836308]# tcpdump  -r 25-6643.pcap -nn -A | grep echo                          
reading from file 25-6643.pcap, link-type EN10MB (Ethernet)                                           
.bH.....{"id":"echo","method":"echo","params":[]}                                                     
...G.b5]{"id":"echo","method":"echo","params":[]}                                                     
...G.bH.{"id":"echo","result":[],"error":null}                                                        
.bH....G{"id":"echo","result":[],"error":null}                                                        
.U.4..U.{"id":"echo","method":"echo","params":[]}                                                     
..iY.U.4{"id":"echo","result":[],"error":null}                                                        
.b\m...p{"id":"echo","method":"echo","params":[]}                                                     
.....b\m{"id":"echo","method":"echo","params":[]}                                                     
.....b\m{"id":"echo","result":[],"error":null}                                                        
.b\n....{"id":"echo","result":[],"error":null}

<=== echo packets was sent


Verified on openvswitch2.13.0-38:

[root@dell-per740-12 bz1836308]# tcpdump  -r 38-6644.pcap -A -nn | grep echo                          
reading from file 38-6644.pcap, link-type EN10MB (Ethernet)                                           
[root@dell-per740-12 bz1836308]# tcpdump  -r 38-6643.pcap -A -nn | grep echo
reading from file 38-6643.pcap, link-type EN10MB (Ethernet) 

<=== no echo packets sent

Comment 6 errata-xmlrpc 2020-07-15 12:58:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2948