1838334 – [OVN][Scale]SB-DB raft cluster doesn't recover at higher election time and 500+ nodes

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1838334 - [OVN][Scale]SB-DB raft cluster doesn't recover at higher election time and 500+ nodes

Summary: [OVN][Scale]SB-DB raft cluster doesn't recover at higher election time and 50...

Keywords:
Status:	CLOSED DUPLICATE of bug 1834838
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	OVN
Sub Component:
Version:	RHEL 7.7
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ilya Maximets
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-20 23:04 UTC by Anil Vishnoi
Modified:	2020-09-10 10:44 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-10 10:44:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sb-db raft cluster nodes logs (314.96 KB, text/plain) 2020-05-20 23:06 UTC, Anil Vishnoi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1834838	0	urgent	CLOSED	nbdb does not release memory after cleanup	2022-10-13 03:55:07 UTC

Description Anil Vishnoi 2020-05-20 23:04:17 UTC

Description of problem:

Cluster goes into a bad state at 500 nodes and with election timer of 32 seconds. One of the nodes end up as a singleton cluster and the other two nodes are not able to carry-out leader election and always end-up in follower state. 

Following is the cluster state of each individual node 

# oc exec -n openshift-ovn-kubernetes ovnkube-master-fxf9h -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
9aca
Name: OVN_Southbound
Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949)
Server ID: 9aca (9aca6b5f-bd2f-455c-a98e-19e7a0226913)
Address: ssl:10.0.210.251:9644
Status: cluster member
Role: follower
Term: 273
Leader: unknown
Vote: unknown

Election timer: 50000
Log: [7764, 7855]
Entries not yet committed: 1
Entries not yet applied: 1
Connections: ->1bbc ->6a8c <-6a8c <-1bbc
Servers:
    9aca (9aca at ssl:10.0.210.251:9644) (self)
    1bbc (1bbc at ssl:10.0.181.129:9644)
    6a8c (6a8c at ssl:10.0.148.1:9644)
---------
# oc exec -n openshift-ovn-kubernetes ovnkube-master-k9ph2 -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
1bbc
Name: OVN_Southbound
Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949)
Server ID: 1bbc (1bbc51cb-4b41-4c27-8407-0ae991b07bbc)
Address: ssl:10.0.181.129:9644
Status: cluster member
Role: candidate
Term: 274
Leader: unknown
Vote: self

Election timer: 32000
Log: [6853, 6853]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->9aca ->6a8c <-9aca <-6a8c
Servers:
    9aca (9aca at ssl:10.0.210.251:9644)
    1bbc (1bbc at ssl:10.0.181.129:9644) (self) (voted for 1bbc)
    6a8c (6a8c at ssl:10.0.148.1:9644)
------
# oc exec -n openshift-ovn-kubernetes ovnkube-master-nfsvs -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
6a8c
Name: OVN_Southbound
Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949)
Server ID: 6a8c (6a8c859f-f54a-4b61-a028-3c86e3f1bd62)
Address: ssl:10.0.148.1:9644
Status: cluster member
Role: follower
Term: 272
Leader: 9aca
Vote: 9aca

Election timer: 50000
Log: [7764, 7855]
Entries not yet committed: 1
Entries not yet applied: 1
Connections: ->9aca ->1bbc <-9aca <-1bbc
Servers:
    9aca (9aca at ssl:10.0.210.251:9644)
    1bbc (1bbc at ssl:10.0.181.129:9644)
    6a8c (6a8c at ssl:10.0.148.1:9644) (self)


Looking at the log it seems like 32 second of election-timer was not enough for the scale of 500 nodes. Seems like the raft election goes in the loop. It selects the leader, but again the leader gets busy for more than 32 seconds and the follower triggers the election again, and the leader changes again. This keeps going on in loop. To break the loop, I tried to increase the election-timer to 500 second on the leader node, but it seems like it was not able to propagate those changes to one of the followers. Now out of 3 nodes, two have 50 seconds of election-timer and the remaining node has 32 seconds.

To see if reducing the number of nodes helps the cluster to recover, but seems like it does not. I reduced the number of nodes from 500 to 100 and I see that there is no churn in the system and sb-db calmed down as well. 

It took around 15 minutes to recover at the 100 nodes and 50 seconds of election-timer.

Seems like while moving the scale to 500 nodes, the singleton cluster ends up incrementing the term and when both the other followers try to ask for a vote (At lesser term), and it was rejecting the vote.

-----
2020-05-20T22:31:40Z|06733|raft|DBG|raft_is_connected: false
2020-05-20T22:31:41Z|06734|raft|DBG|raft_is_connected: false
2020-05-20T22:31:41Z|06735|raft|DBG|raft.c:1512 -->9aca vote_request: term=277 last_log_index=6852 last_log_term=2
2020-05-20T22:31:41Z|06736|raft|DBG|raft.c:1512 -->6a8c vote_request: term=277 last_log_index=6852 last_log_term=2
2020-05-20T22:31:41Z|06737|raft|DBG|raft_is_connected: false
2020-05-20T22:31:41Z|06738|raft|DBG|raft_is_connected: false
2020-05-20T22:31:41Z|06739|raft|DBG|<--9aca vote_reply: term=275 vote=6a8c
2020-05-20T22:31:41Z|06740|raft|INFO|rejecting term 275 < current term 277 received in vote_reply message from server 9aca
2020-05-20T22:31:41Z|06741|raft|DBG|raft_is_connected: false
2020-05-20T22:31:41Z|06742|raft|DBG|raft_is_connected: false
----


Once I reduced the number of nodes to 100, I captured the logs in debug mode for two election terms to provide some more logs to help debugging.


Version-Release number of selected component (if applicable):


How reproducible:
Easily with higher scale. It can probably be recreated with lower scale (200 nodes) and lower election-timer (5 second, with inactivity probe disabled), but probably take longer.

Steps to Reproduce:
1. Set the election timer to 32 seconds for the sb-db raft cluster.
2. Scale the number of nodes to 500 (increment 100 nodes per iteration)
3. You should see that partition started happening because sb-db is busy for more than election-timer.


Actual results:
sb-db raft cluster doesn't recover and no new nodes can join the cluster and neither user can deploy any pod as well. Basically network outage.

Expected results:
Cluster should never get into that state, or if it does, it should recoverge in reasonable time (in my opinion it should not take more than 2*election-timer)

Additional info:

Comment 1 Anil Vishnoi 2020-05-20 23:06:54 UTC

Created attachment 1690416 [details]
sb-db raft cluster nodes logs

Comment 2 Ilya Maximets 2020-05-22 11:59:19 UTC

I think this might be the same issue as in:
https://bugzilla.redhat.com/show_bug.cgi?id=1834838

Looking at the logs I see the high memory consumption that mostly goes from the
jsonrpc backlog:

Above lines shows that ovsdb-server consumes ~12 GB of system memory, ~4 GB of
which are the messages that waits for sending inside jsonrpc for the database
(monitors, probably) connections.

I think that most of the other 8 GB are in jsonrpc backlog, but for the raft
connections. I'm working on a patch to add raft memory consumption to the
memory reports. This will show as a more clear picture.

In general, for me it looks like huge amount of messages lends into raft rpc
and destination ovsdb-server just not able to timely receive and process them.
This mostly happens because of the huge install_snapshot_request's which
contains the whole database. At the time remote ovsdb-server replies to votes
or other control messages, they are already far outdated and this way this
server is not able to ever leave 'candidate' state. Same, probably, happens to
followers that are not able to select a new leader due to continuously outdated
votes and inability to communicate normally. And it's possible that this
condition will not recover itself just because each outdated vote generates
vote_reply that lands to the same backlog and eventually generates other control
messages.

One possible workaround might be to monitor the size of a backlog and just drop
connection if it reaches some critical value, but we need to think how to avoid
such conditions in general.

-- I changed component to OVN as we're usually opening raft/OVN related bugs for
OVN component. Feel free to change back if needed.

Comment 3 Ilya Maximets 2020-05-25 08:33:36 UTC

Anil, I prepared a patch to collect information about raft memory consumption:

  https://patchwork.ozlabs.org/project/openvswitch/patch/20200522173630.417106-1-i.maximets@ovn.org/

It's Acked, but not merged yet.

Could you, please, run your tests with it and collect output of ovs-appctl memory/show
while the system is unstable?  This way we could identify if it's the same issue
as in BZ 1834838.

Comment 4 Anil Vishnoi 2020-05-27 22:57:45 UTC

Sure, i will get to it by sometime tomorrow. do you have any github branch for this patch that i can pull to build my local ovn-kubernetes image?

Comment 5 Ilya Maximets 2020-05-28 10:12:05 UTC

(In reply to Anil Vishnoi from comment #4)
> Sure, i will get to it by sometime tomorrow. do you have any github branch
> for this patch that i can pull to build my local ovn-kubernetes image?

This patch is in upstream master branch already.

BTW, in a couple of hours the second patch (that hopefully fixes the issue)
will be in upstream master branch too:
   https://patchwork.ozlabs.org/project/openvswitch/patch/20200523173412.477681-1-i.maximets@ovn.org/

For your convenience, here is the branch with raft memory report and without
the fix:  https://github.com/igsilya/ovs/tree/tmp-raft-memory-report

Comment 6 Anil Vishnoi 2020-06-02 03:59:24 UTC

(In reply to Ilya Maximets from comment #5)
> (In reply to Anil Vishnoi from comment #4)
> > Sure, i will get to it by sometime tomorrow. do you have any github branch
> > for this patch that i can pull to build my local ovn-kubernetes image?
> 
> This patch is in upstream master branch already.
> 
> BTW, in a couple of hours the second patch (that hopefully fixes the issue)
> will be in upstream master branch too:
>   
> https://patchwork.ozlabs.org/project/openvswitch/patch/20200523173412.477681-
> 1-i.maximets/
> 
> For your convenience, here is the branch with raft memory report and without
> the fix:  https://github.com/igsilya/ovs/tree/tmp-raft-memory-report

Looks like both the patches are merged upstream. I will try the master branch sometime this week and report.

Comment 7 Ilya Maximets 2020-06-16 11:16:06 UTC

Hi, Anil.

Did you have a chance to test with patches applied?

Comment 8 Dan Williams 2020-07-14 11:41:00 UTC

Ilya's patches are in openvswitch2.13-2.13.0-29.el7fdp which is tagged into OCP 4.6 for the last week or so.

Note You need to log in before you can comment on or make changes to this bug.