Bug 1960391

Summary: [RFE] Transfer RAFT leadership during snapshot writing
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Tim Rozet <trozet>
Component: ovsdb2.15Assignee: Ilya Maximets <i.maximets>
Status: CLOSED ERRATA QA Contact: Zhiqiang Fang <zfang>
Severity: high Docs Contact:
Priority: high    
Version: RHEL 8.0CC: ctrautma, i.maximets, jhsiao, kfida, ralongi, tredaelli, zfang
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.15-2.15.0-21.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1963948 (view as bug list) Environment:
Last Closed: 2021-06-21 14:25:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1943631, 1963948    

Description Tim Rozet 2021-05-13 18:40:10 UTC
Description of problem:
While the raft leader is writing its snapshot, it may fail to send raft heartbeats. In order to alleviate this we should make OVSDB leader transfer leadership when it needs to write its snapshot.

Comment 4 Zhiqiang Fang 2021-06-16 16:42:39 UTC
Using similar method from BZ#1964573 verified this RFE on openvswitch2.15-2.15.0-24.el8fdp.x86_64.
For ovs2.15, we created much more resources than ovs2.13 to trigger a db compaction and snapshot.


RPMs have been used:

[root@wsfd-advnetlab35 ~]# rpm -qa |egrep "ovn|openvsw"
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch
openvswitch2.15-2.15.0-24.el8fdp.x86_64
ovn-2021-21.03.0-40.el8fdp.x86_64
ovn-2021-host-21.03.0-40.el8fdp.x86_64
ovn-2021-central-21.03.0-40.el8fdp.x86_64
[root@wsfd-advnetlab35 ~]# 


Northbound db and Southbound db both had leadership transfer and snapshot events.


[root@wsfd-advnetlab35 ~]# cat /var/log/ovn/ovsdb-server-nb.log | grep leader
2021-06-16T14:45:47.644Z|00003|raft|INFO|term 1: elected leader by 1+ of 1 servers
2021-06-16T16:24:28.365Z|00921|raft|INFO|Transferring leadership to write a snapshot.           <--------
2021-06-16T16:24:28.369Z|00922|raft|INFO|rejected append_reply (not leader)
2021-06-16T16:24:28.369Z|00923|raft|INFO|rejected append_reply (not leader)
2021-06-16T16:24:28.538Z|00925|raft|INFO|server 6496 is leader for term 3
[root@wsfd-advnetlab35 ~]# 


root@wsfd-advnetlab35 ~]# cat /var/log/ovn/ovsdb-server-sb.log | grep leader
2021-06-16T14:45:47.753Z|00003|raft|INFO|term 1: elected leader by 1+ of 1 servers
2021-06-16T15:52:17.278Z|00033|raft|INFO|Transferring leadership to write a snapshot.           <--------
2021-06-16T15:52:17.640Z|00034|raft|INFO|server 6881 is leader for term 2
[root@wsfd-advnetlab35 ~]# 


Captured Northbound db leadership transfer as below. We also observed that ovnnb_db.db did a compaction and reduced size from 10MB to 1.2MB. 


##############
Wed Jun 16 12:24:28 EDT 2021
59f6
Name: OVN_Northbound
Cluster ID: eb61 (eb61e5f8-d97e-46f3-a796-f4f1b53e9b67)
Server ID: 59f6 (59f6afaf-be98-4d23-8ae7-928f8245dd5d)
Address: tcp:wsfd-advnetlab35.xyz:6643
Status: cluster member
Role: leader
Term: 1
Leader: self
Vote: self

Last Election started 5920500 ms ago, reason: timeout
Last Election won: 5920499 ms ago
Election timer: 1000
Log: [2, 20380]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-6496 ->6496 <-9585 ->9585
Disconnections: 0
Servers:
    9585 (9585 at tcp:netqe6.xyz:6643) next_index=20380 match_index=20379 last msg 165 ms ago
    6496 (6496 at tcp:netqe5.xyz:6643) next_index=20380 match_index=20379 last msg 165 ms ago
    59f6 (59f6 at tcp:wsfd-advnetlab35.xyz:6643) (self) next_index=2 match_index=20379
total 27588
-rw-r-----. 1 root root 10485306 Jun 16 12:24 ovnnb_db.db
-rw-r-----. 1 root root 10214438 Jun 16 12:24 ovnsb_db.db
##############
Wed Jun 16 12:24:28 EDT 2021
59f6
Name: OVN_Northbound
Cluster ID: eb61 (eb61e5f8-d97e-46f3-a796-f4f1b53e9b67)
Server ID: 59f6 (59f6afaf-be98-4d23-8ae7-928f8245dd5d)
Address: tcp:wsfd-advnetlab35.xyz:6643
Status: cluster member
Role: follower
Term: 3
Leader: 6496
Vote: 6496

Last Election started 5921008 ms ago, reason: timeout
Last Election won: 5921007 ms ago
Election timer: 1000
Log: [20381, 20383]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-6496 ->6496 <-9585 ->9585
Disconnections: 0
Servers:
    9585 (9585 at tcp:netqe6.xyz:6643) last msg 119 ms ago
    6496 (6496 at tcp:netqe5.xyz:6643) last msg 113 ms ago
    59f6 (59f6 at tcp:wsfd-advnetlab35.xyz:6643) (self)
total 18240
-rw-r-----. 1 root root  1224987 Jun 16 12:24 ovnnb_db.db
-rw-r-----. 1 root root 10215771 Jun 16 12:24 ovnsb_db.db
##############

Comment 6 errata-xmlrpc 2021-06-21 14:25:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openvswitch2.15 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2509