Bug 1876793 - raft doesn't work when restart cluster after delete ovnsb_db.db
Summary: raft doesn't work when restart cluster after delete ovnsb_db.db
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: FDP 20.E
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-08 07:56 UTC by Jianlin Shi
Modified: 2023-07-13 07:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-846 0 None None None 2021-09-02 12:11:47 UTC

Description Jianlin Shi 2020-09-08 07:56:47 UTC
Description of problem:
raft doesn't work when restart cluster after delete ovnsb_db.db

Version-Release number of selected component (if applicable):
ovn20.06.2-3

How reproducible:
Always

Steps to Reproduce:
1. start cluster

# master
ctl_cmd="/usr/share/ovn/scripts/ovn-ctl"
ip_s=1.1.1.16
ip_c1=1.1.1.17
ip_c2=1.1.1.18
$ctl_cmd --db-nb-addr=$ip_s --db-nb-create-insecure-remote=yes \
                        --db-sb-addr=$ip_s --db-sb-create-insecure-remote=yes \
                        --db-nb-cluster-local-addr=$ip_s --db-sb-cluster-local-addr=$ip_s \
                        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
                        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 start_northd

# slave1
ctl_cmd=/usr/share/ovn/scripts/ovn-ctl                                                                
ip_s=1.1.1.16
ip_c1=1.1.1.17
ip_c2=1.1.1.18                                                                                        

$ctl_cmd --db-nb-addr=$ip_c1 --db-nb-create-insecure-remote=yes \                                     
                        --db-sb-addr=$ip_c1 --db-sb-create-insecure-remote=yes \
                        --db-nb-cluster-local-addr=$ip_c1 --db-sb-cluster-local-addr=$ip_c1 \
                        --db-nb-cluster-remote-addr=$ip_s --db-sb-cluster-remote-addr=$ip_s \
                        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
                        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 start_northd
# slave 2
ctl_cmd=/usr/share/ovn/scripts/ovn-ctl
ip_s=1.1.1.16
ip_c1=1.1.1.17
ip_c2=1.1.1.18

$ctl_cmd --db-nb-addr=$ip_c2 --db-nb-create-insecure-remote=yes \
                        --db-sb-addr=$ip_c2 --db-sb-create-insecure-remote=yes \
                        --db-nb-cluster-local-addr=$ip_c2 --db-sb-cluster-local-addr=$ip_c2 \
                        --db-nb-cluster-remote-addr=$ip_s --db-sb-cluster-remote-addr=$ip_s \
                        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
                        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 start_northd

2. delete ovnsb_db.db and restart cluster

ctl_cmd="/usr/share/ovn/scripts/ovn-ctl"                                                              
ip_s=1.1.1.16                                                                                         
ip_c1=1.1.1.17                                                                                        
ip_c2=1.1.1.18                                                                                        

rm /etc/ovn/ovnsb_db.db -f
ssh -q $ip_c1 rm /etc/ovn/ovnsb_db.db -f
ssh -q $ip_c2 rm /etc/ovn/ovnsb_db.db -f



ssh -q $ip_c1 $ctl_cmd --db-nb-addr=$ip_c1 --db-nb-create-insecure-remote=yes \
                        --db-sb-addr=$ip_c1 --db-sb-create-insecure-remote=yes \
                        --db-nb-cluster-local-addr=$ip_c1 --db-sb-cluster-local-addr=$ip_c1 \
                        --db-nb-cluster-remote-addr=$ip_s --db-sb-cluster-remote-addr=$ip_s \
                        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
                        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 restart_northd


ssh -q $ip_c2 $ctl_cmd --db-nb-addr=$ip_c2 --db-nb-create-insecure-remote=yes \
                        --db-sb-addr=$ip_c2 --db-sb-create-insecure-remote=yes \
                        --db-nb-cluster-local-addr=$ip_c2 --db-sb-cluster-local-addr=$ip_c2 \
                        --db-nb-cluster-remote-addr=$ip_s --db-sb-cluster-remote-addr=$ip_s \
                        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
                        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 restart_northd

$ctl_cmd --db-nb-addr=$ip_s --db-nb-create-insecure-remote=yes \
                        --db-sb-addr=$ip_s --db-sb-create-insecure-remote=yes \
                        --db-nb-cluster-local-addr=$ip_s --db-sb-cluster-local-addr=$ip_s \
                        --ovn-northd-nb-db=tcp:$ip_s:6641,tcp:$ip_c1:6641,tcp:$ip_c2:6641 \
                        --ovn-northd-sb-db=tcp:$ip_s:6642,tcp:$ip_c1:6642,tcp:$ip_c2:6642 restart_northd

3. check sb status with: ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound

Actual results:
[root@wsfd-advnetlab17 bz1829109]# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
8caf
Name: OVN_Southbound                                                                                  
Cluster ID: 96cc (96cc6df5-ff59-4f15-9cbf-52446c6148bd)
Server ID: 8caf (8cafc269-8e76-47ca-9bd3-c0dc341b6b1a)                                                
Address: tcp:1.1.1.17:6644                                                                            
Status: cluster member                                                                                
Role: follower
Term: 226
Leader: unknown
Vote: 41b1

Election timer: 1000                                                                                  
Log: [2, 10]                                                                                          
Entries not yet committed: 0                                                                          
Entries not yet applied: 0                                                                            
Connections: ->0000 ->e87c <-41b1                                                                     
Servers:
    8caf (8caf at tcp:1.1.1.17:6644) (self)                                                           
    e87c (e87c at tcp:1.1.1.18:6644)                                                                  
    dcf6 (dcf6 at tcp:1.1.1.17:6644)   

<=== there are two for 1.1.1.17 and 1.1.1.18
                                                               
    902e (902e at tcp:1.1.1.16:6644)                                                                  
    41b1 (41b1 at tcp:1.1.1.18:6644)

Expected results:
raft works well

Additional info:

if restart master at first, the issue doesn't occur.

Comment 1 Jianlin Shi 2020-09-08 07:58:29 UTC
[root@wsfd-advnetlab16 bz1829109]# rpm -qa | grep -E "openvswitch|ovn"                                
ovn2.13-20.06.2-3.el8fdp.x86_64                                                                       
kernel-kernel-networking-openvswitch-ovn_ha-1.0-57.noarch                                             
ovn2.13-central-20.06.2-3.el8fdp.x86_64                                                               
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch                                                 
ovn2.13-host-20.06.2-3.el8fdp.x86_64                                                                  
openvswitch2.13-2.13.0-58.el8fdp.x86_64                                                               
[root@wsfd-advnetlab16 bz1829109]# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbou
nd                                                                                                    
497e                                                                                                  
Name: OVN_Southbound                                                                                  
Cluster ID: e1fe (e1fe58fa-6f10-4b4b-8357-f410c189b7bc)                                               
Server ID: 497e (497e68b8-b98b-4fc8-a332-55909876b83b)                                                
Address: tcp:1.1.1.16:6644                                                                            
Status: cluster member                                                                                
Role: leader                                                                                          
Term: 1                                                                                               
Leader: self                                                                                          
Vote: self                                                                                            
                                                                                                      
Election timer: 1000                                                                                  
Log: [2, 5]                                                                                           
Entries not yet committed: 0                                                                          
Entries not yet applied: 0                                                                            
Connections: <-0000 <-0000                                                                            
Servers:                                                                                              
    497e (497e at tcp:1.1.1.16:6644) (self) next_index=2 match_index=4

<== only one on master

Comment 2 Jianlin Shi 2020-09-08 08:02:14 UTC
even run ovn-appctl -t ovn-northd sb-cluster-state-reset on every node, raft still doesn't work


Note You need to log in before you can comment on or make changes to this bug.