Bug 1963948

Summary: [RHEL7] [RFE] Transfer RAFT leadership during snapshot writing
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Ilya Maximets <i.maximets>
Component: ovsdb2.13Assignee: Ilya Maximets <i.maximets>
Status: CLOSED ERRATA QA Contact: Zhiqiang Fang <zfang>
Severity: high Docs Contact:
Priority: high    
Version: RHEL 8.0CC: ctrautma, jhsiao, jishi, kfida, ovs-qe, ovs-team, ralongi, tredaelli, trozet
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.13-2.13.0-94.el7fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1960391 Environment:
Last Closed: 2021-06-21 14:44:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1960391    
Bug Blocks: 1943631    

Description Ilya Maximets 2021-05-24 13:12:13 UTC
Clone for ovsdb2.13 component.

+++ This bug was initially created as a clone of Bug #1960391 +++

Description of problem:
While the raft leader is writing its snapshot, it may fail to send raft heartbeats. In order to alleviate this we should make OVSDB leader transfer leadership when it needs to write its snapshot.

--- Additional comment from Ilya Maximets on 2021-05-13 18:55:26 UTC ---

Patch sent for review:
https://patchwork.ozlabs.org/project/openvswitch/patch/20210506124731.3599531-1-i.maximets@ovn.org/

Comment 4 Zhiqiang Fang 2021-06-17 03:01:19 UTC
Using same method from BZ#1964573 verified this RFE on openvswitch2.13-2.13.0-95.el7fdp.x86_64
Test bed: a 3-host ovn raft cluster and a ovn chassis (a host installed ovn-controller).
Method to trigger snapshot:
  To trigger a snapshot the rule is that database should grow more than 50% and be at least more than 10MB. After 10-20 minutes ovsdb-server will check and decide to compact/create a snapshot.
  In this test, the way to increase db is to add 3000 lsp in short period of time.

RPMs have been used:

[root@wsfd-advnetlab35 ~]# rpm -aq | egrep "ovn|openv"
openvswitch-selinux-extra-policy-1.0-18.el7fdp.noarch
ovn2.13-central-20.12.0-135.el7fdp.x86_64
ovn2.13-host-20.12.0-135.el7fdp.x86_64
openvswitch2.13-2.13.0-95.el7fdp.x86_64
ovn2.13-20.12.0-135.el7fdp.x86_64
[root@wsfd-advnetlab35 ~]# 


OVN_Southbound db leadership transfer:

#cat /var/log/ovn/ovsdb-server-nb.log
...
2021-06-16T19:59:30.812Z|00041|raft|INFO|Transferring leadership to write a snapshot.
2021-06-16T19:59:30.812Z|00042|raft|INFO|rejected append_reply (not leader)
...
2021-06-16T19:59:31.860Z|00060|raft|INFO|rejected append_reply (not leader)
2021-06-16T19:59:31.860Z|00061|raft|INFO|server 037c is leader for term 2
2021-06-16T20:02:36.933Z|00062|raft|INFO|received leadership transfer from 037c in term 2
2021-06-16T20:02:36.933Z|00063|raft|INFO|term 3: starting election
2021-06-16T20:02:36.934Z|00064|raft|INFO|term 3: elected leader by 2+ of 3 servers
...


# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound

##############
Wed Jun 16 15:59:30 EDT 2021
079b
Name: OVN_Southbound
Cluster ID: d908 (d9082509-96e1-4d97-b777-7fa7cc472cd1)
Server ID: 079b (079b530e-1030-4204-90e0-3413beac73df)
Address: tcp:wsfd-advnetlab35.xyz:6644
Status: cluster member
Role: leader
Term: 1
Leader: self
Vote: self

Election timer: 1000
Log: [2, 3002]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-037c ->037c <-1774 ->1774
Servers:
    037c (037c at tcp:netqe5.xyz:6644) next_index=3002 match_index=3001
    079b (079b at tcp:wsfd-advnetlab35.xyz:6644) (self) next_index=2 match_index=3001
    1774 (1774 at tcp:netqe6.xyz:6644) next_index=3002 match_index=3001
##############
Wed Jun 16 15:59:30 EDT 2021
079b
Name: OVN_Southbound
Cluster ID: d908 (d9082509-96e1-4d97-b777-7fa7cc472cd1)
Server ID: 079b (079b530e-1030-4204-90e0-3413beac73df)
Address: tcp:wsfd-advnetlab35.xyz:6644
Status: cluster member
Role: follower
Term: 1
Leader: unknown
Vote: self

Election timer: 1000
Log: [3002, 3002]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-037c ->037c <-1774 ->1774
Servers:
    037c (037c at tcp:netqe5.xyz:6644)
    079b (079b at tcp:wsfd-advnetlab35.xyz:6644) (self)
    1774 (1774 at tcp:netqe6.xyz:6644)
##############
Wed Jun 16 15:59:32 EDT 2021
079b
Name: OVN_Southbound
Cluster ID: d908 (d9082509-96e1-4d97-b777-7fa7cc472cd1)
Server ID: 079b (079b530e-1030-4204-90e0-3413beac73df)
Address: tcp:wsfd-advnetlab35.xyz:6644
Status: cluster member
Role: follower
Term: 2
Leader: 037c
Vote: 037c

Election timer: 1000
Log: [3002, 3003]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-037c ->037c <-1774 ->1774
Servers:
    037c (037c at tcp:netqe5.xyz:6644)
    079b (079b at tcp:wsfd-advnetlab35.xyz:6644) (self)
    1774 (1774 at tcp:netqe6.xyz:6644)
##############



OVN_Northbound db leadership transfer:

# cat /var/log/ovn/ovsdb-server-nb.log
...
2021-06-16T20:04:19.189Z|00224|raft|INFO|Transferring leadership to write a snapshot.
2021-06-16T20:04:19.190Z|00225|raft|INFO|rejected append_reply (not leader)
2021-06-16T20:04:19.703Z|00226|raft|INFO|rejected append_reply (not leader)
2021-06-16T20:04:19.705Z|00227|raft|INFO|server 2e70 is leader for term 2
...



# ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound

##############
Wed Jun 16 16:04:18 EDT 2021
3793
Name: OVN_Northbound
Cluster ID: d0d5 (d0d58acf-16c4-4a5b-b646-3cd9c2961a16)
Server ID: 3793 (3793f874-4202-4ddb-871d-0544671483df)
Address: tcp:wsfd-advnetlab35.xyz:6643
Status: cluster member
Role: leader
Term: 1
Leader: self
Vote: self

Election timer: 1000
Log: [2, 5999]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-7f27 ->7f27 <-2e70 ->2e70
Servers:
    2e70 (2e70 at tcp:netqe6.xyz:6643) next_index=5999 match_index=5998
    3793 (3793 at tcp:wsfd-advnetlab35.xyz:6643) (self) next_index=2 match_index=5998
    7f27 (7f27 at tcp:netqe5.xyz:6643) next_index=5999 match_index=5998
##############
Wed Jun 16 16:04:19 EDT 2021
3793
Name: OVN_Northbound
Cluster ID: d0d5 (d0d58acf-16c4-4a5b-b646-3cd9c2961a16)
Server ID: 3793 (3793f874-4202-4ddb-871d-0544671483df)
Address: tcp:wsfd-advnetlab35.xyz:6643
Status: cluster member
Role: follower
Term: 1
Leader: unknown
Vote: self

Election timer: 1000
Log: [5999, 5999]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-7f27 ->7f27 <-2e70 ->2e70
Servers:
    2e70 (2e70 at tcp:netqe6.xyz:6643)
    3793 (3793 at tcp:wsfd-advnetlab35.xyz:6643) (self)
    7f27 (7f27 at tcp:netqe5.xyz:6643)
##############
Wed Jun 16 16:04:20 EDT 2021
3793
Name: OVN_Northbound
Cluster ID: d0d5 (d0d58acf-16c4-4a5b-b646-3cd9c2961a16)
Server ID: 3793 (3793f874-4202-4ddb-871d-0544671483df)
Address: tcp:wsfd-advnetlab35.xyz:6643
Status: cluster member
Role: follower
Term: 2
Leader: 2e70
Vote: 2e70

Election timer: 1000
Log: [5999, 6000]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-7f27 ->7f27 <-2e70 ->2e70
Servers:
    2e70 (2e70 at tcp:netqe6.xyz:6643)
    3793 (3793 at tcp:wsfd-advnetlab35.xyz:6643) (self)
    7f27 (7f27 at tcp:netqe5.xyz:6643)
##############

Comment 6 errata-xmlrpc 2021-06-21 14:44:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openvswitch2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2506