Description of problem: Cluster goes into a bad state at 500 nodes and with election timer of 32 seconds. One of the nodes end up as a singleton cluster and the other two nodes are not able to carry-out leader election and always end-up in follower state. Following is the cluster state of each individual node # oc exec -n openshift-ovn-kubernetes ovnkube-master-fxf9h -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 9aca Name: OVN_Southbound Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949) Server ID: 9aca (9aca6b5f-bd2f-455c-a98e-19e7a0226913) Address: ssl:10.0.210.251:9644 Status: cluster member Role: follower Term: 273 Leader: unknown Vote: unknown Election timer: 50000 Log: [7764, 7855] Entries not yet committed: 1 Entries not yet applied: 1 Connections: ->1bbc ->6a8c <-6a8c <-1bbc Servers: 9aca (9aca at ssl:10.0.210.251:9644) (self) 1bbc (1bbc at ssl:10.0.181.129:9644) 6a8c (6a8c at ssl:10.0.148.1:9644) --------- # oc exec -n openshift-ovn-kubernetes ovnkube-master-k9ph2 -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 1bbc Name: OVN_Southbound Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949) Server ID: 1bbc (1bbc51cb-4b41-4c27-8407-0ae991b07bbc) Address: ssl:10.0.181.129:9644 Status: cluster member Role: candidate Term: 274 Leader: unknown Vote: self Election timer: 32000 Log: [6853, 6853] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->9aca ->6a8c <-9aca <-6a8c Servers: 9aca (9aca at ssl:10.0.210.251:9644) 1bbc (1bbc at ssl:10.0.181.129:9644) (self) (voted for 1bbc) 6a8c (6a8c at ssl:10.0.148.1:9644) ------ # oc exec -n openshift-ovn-kubernetes ovnkube-master-nfsvs -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 6a8c Name: OVN_Southbound Cluster ID: 58d4 (58d4e94f-d4e5-4e4f-8a02-4918b7802949) Server ID: 6a8c (6a8c859f-f54a-4b61-a028-3c86e3f1bd62) Address: ssl:10.0.148.1:9644 Status: cluster member Role: follower Term: 272 Leader: 9aca Vote: 9aca Election timer: 50000 Log: [7764, 7855] Entries not yet committed: 1 Entries not yet applied: 1 Connections: ->9aca ->1bbc <-9aca <-1bbc Servers: 9aca (9aca at ssl:10.0.210.251:9644) 1bbc (1bbc at ssl:10.0.181.129:9644) 6a8c (6a8c at ssl:10.0.148.1:9644) (self) Looking at the log it seems like 32 second of election-timer was not enough for the scale of 500 nodes. Seems like the raft election goes in the loop. It selects the leader, but again the leader gets busy for more than 32 seconds and the follower triggers the election again, and the leader changes again. This keeps going on in loop. To break the loop, I tried to increase the election-timer to 500 second on the leader node, but it seems like it was not able to propagate those changes to one of the followers. Now out of 3 nodes, two have 50 seconds of election-timer and the remaining node has 32 seconds. To see if reducing the number of nodes helps the cluster to recover, but seems like it does not. I reduced the number of nodes from 500 to 100 and I see that there is no churn in the system and sb-db calmed down as well. It took around 15 minutes to recover at the 100 nodes and 50 seconds of election-timer. Seems like while moving the scale to 500 nodes, the singleton cluster ends up incrementing the term and when both the other followers try to ask for a vote (At lesser term), and it was rejecting the vote. ----- 2020-05-20T22:31:40Z|06733|raft|DBG|raft_is_connected: false 2020-05-20T22:31:41Z|06734|raft|DBG|raft_is_connected: false 2020-05-20T22:31:41Z|06735|raft|DBG|raft.c:1512 -->9aca vote_request: term=277 last_log_index=6852 last_log_term=2 2020-05-20T22:31:41Z|06736|raft|DBG|raft.c:1512 -->6a8c vote_request: term=277 last_log_index=6852 last_log_term=2 2020-05-20T22:31:41Z|06737|raft|DBG|raft_is_connected: false 2020-05-20T22:31:41Z|06738|raft|DBG|raft_is_connected: false 2020-05-20T22:31:41Z|06739|raft|DBG|<--9aca vote_reply: term=275 vote=6a8c 2020-05-20T22:31:41Z|06740|raft|INFO|rejecting term 275 < current term 277 received in vote_reply message from server 9aca 2020-05-20T22:31:41Z|06741|raft|DBG|raft_is_connected: false 2020-05-20T22:31:41Z|06742|raft|DBG|raft_is_connected: false ---- Once I reduced the number of nodes to 100, I captured the logs in debug mode for two election terms to provide some more logs to help debugging. Version-Release number of selected component (if applicable): How reproducible: Easily with higher scale. It can probably be recreated with lower scale (200 nodes) and lower election-timer (5 second, with inactivity probe disabled), but probably take longer. Steps to Reproduce: 1. Set the election timer to 32 seconds for the sb-db raft cluster. 2. Scale the number of nodes to 500 (increment 100 nodes per iteration) 3. You should see that partition started happening because sb-db is busy for more than election-timer. Actual results: sb-db raft cluster doesn't recover and no new nodes can join the cluster and neither user can deploy any pod as well. Basically network outage. Expected results: Cluster should never get into that state, or if it does, it should recoverge in reasonable time (in my opinion it should not take more than 2*election-timer) Additional info:
Created attachment 1690416 [details] sb-db raft cluster nodes logs
I think this might be the same issue as in: https://bugzilla.redhat.com/show_bug.cgi?id=1834838 Looking at the logs I see the high memory consumption that mostly goes from the jsonrpc backlog: 2020-05-20T19:51:48Z|03350|memory|INFO|peak resident set size grew 147% in last 2126.7 seconds, from 5128996 kB to 12655768 kB 2020-05-20T19:51:48Z|03351|memory|INFO|backlog:3947087447 cells:2120596 monitors:2 sessions:114 triggers:1 Above lines shows that ovsdb-server consumes ~12 GB of system memory, ~4 GB of which are the messages that waits for sending inside jsonrpc for the database (monitors, probably) connections. I think that most of the other 8 GB are in jsonrpc backlog, but for the raft connections. I'm working on a patch to add raft memory consumption to the memory reports. This will show as a more clear picture. In general, for me it looks like huge amount of messages lends into raft rpc and destination ovsdb-server just not able to timely receive and process them. This mostly happens because of the huge install_snapshot_request's which contains the whole database. At the time remote ovsdb-server replies to votes or other control messages, they are already far outdated and this way this server is not able to ever leave 'candidate' state. Same, probably, happens to followers that are not able to select a new leader due to continuously outdated votes and inability to communicate normally. And it's possible that this condition will not recover itself just because each outdated vote generates vote_reply that lands to the same backlog and eventually generates other control messages. One possible workaround might be to monitor the size of a backlog and just drop connection if it reaches some critical value, but we need to think how to avoid such conditions in general. -- I changed component to OVN as we're usually opening raft/OVN related bugs for OVN component. Feel free to change back if needed.
Anil, I prepared a patch to collect information about raft memory consumption: https://patchwork.ozlabs.org/project/openvswitch/patch/20200522173630.417106-1-i.maximets@ovn.org/ It's Acked, but not merged yet. Could you, please, run your tests with it and collect output of ovs-appctl memory/show while the system is unstable? This way we could identify if it's the same issue as in BZ 1834838.
Sure, i will get to it by sometime tomorrow. do you have any github branch for this patch that i can pull to build my local ovn-kubernetes image?
(In reply to Anil Vishnoi from comment #4) > Sure, i will get to it by sometime tomorrow. do you have any github branch > for this patch that i can pull to build my local ovn-kubernetes image? This patch is in upstream master branch already. BTW, in a couple of hours the second patch (that hopefully fixes the issue) will be in upstream master branch too: https://patchwork.ozlabs.org/project/openvswitch/patch/20200523173412.477681-1-i.maximets@ovn.org/ For your convenience, here is the branch with raft memory report and without the fix: https://github.com/igsilya/ovs/tree/tmp-raft-memory-report
(In reply to Ilya Maximets from comment #5) > (In reply to Anil Vishnoi from comment #4) > > Sure, i will get to it by sometime tomorrow. do you have any github branch > > for this patch that i can pull to build my local ovn-kubernetes image? > > This patch is in upstream master branch already. > > BTW, in a couple of hours the second patch (that hopefully fixes the issue) > will be in upstream master branch too: > > https://patchwork.ozlabs.org/project/openvswitch/patch/20200523173412.477681- > 1-i.maximets/ > > For your convenience, here is the branch with raft memory report and without > the fix: https://github.com/igsilya/ovs/tree/tmp-raft-memory-report Looks like both the patches are merged upstream. I will try the master branch sometime this week and report.
Hi, Anil. Did you have a chance to test with patches applied?
Ilya's patches are in openvswitch2.13-2.13.0-29.el7fdp which is tagged into OCP 4.6 for the last week or so.