Hi, I have detected SBDB partitioning after running 2k iterations of our cluster-density test in a 250 node cluster based on 4.8.0-0.nightly-2021-03-26-054333. This OCP version includes OVN 20.12.0-25.el8fdp and openvswitch-2.15.0-2.el8fdp which includes the OVS-DB column diff patch. Before triggering the test I manually enabled column-diff in both sbdb and nbdb with: ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb/file/column-diff-enable ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb/file/column-diff-enable The test shows an optimized memory usage from several OVN components, however at some point of the benchmark, between 1300 and 1500 namespaces, I started to observe sbdb RAFT partitioning and cluster instability, which ended up in a failure of the ovn-controllers processes running in each node. Find nbdb, sbdb dumps and logs at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/ovn-logs.tgz
Reproduced in a 100 node cluster scenario. Attaching more sbdb logs (now after enabling debug at json, raft, raft_rpc, reconnect and stream_ssl) and dump sbdb dump. http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/100nodes/ovn-logs-100nodes-debug.tgz
cc Ilya
I'm taking a look at the logs today, and one thing I see is many connection dropped logs from the ssl module, with a reason of protocol error. Here are a couple of examples from ovnkube-master-rzlxn-sb.log file: 2021-03-26T14:32:18Z|04927|reconnect|WARN|ssl:10.0.150.200:60938: connection dropped (Protocol error) 2021-03-26T14:32:30Z|04943|reconnect|WARN|ssl:10.0.150.200:32878: connection dropped (Protocol error) 2021-03-26T14:32:53Z|04961|reconnect|WARN|ssl:10.0.193.136:56676: connection dropped (Protocol error) 2021-03-26T14:32:53Z|04963|reconnect|WARN|ssl:10.0.150.200:33200: connection dropped (Protocol error) 2021-03-26T14:32:53Z|04965|reconnect|WARN|ssl:10.0.150.200:33256: connection dropped (Protocol error) [vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-rzlxn-sb.log | wc -l 801 There are 801 connection drops in just a little over 4 mins of logs: 1st log timestamp ----------------- 2021-03-26T10:14:17Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log last log timestamp ------------------ 2021-03-26T14:33:39Z|04971|raft|INFO|server 7fbb is leader for term 103 I see similiar dropped connection logs in the other *sb.log files: [vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-gcfsg-sb.log | wc -l 752 [vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-rg4kr-sb.log | wc -l 676 [vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-rzlxn-sb.log | wc -l 801 [vpickard@rippleRider$][~/bz1943631/ovn-logs]$ @ i.maximets Do you have any idea on what might be causing these protocol errors?
(In reply to Victor Pickard from comment #3) > There are 801 connection drops in just a little over 4 mins of logs: > > 1st log timestamp > ----------------- > 2021-03-26T10:14:17Z|00001|vlog|INFO|opened log file > /var/log/ovn/ovsdb-server-sb.log > > last log timestamp > ------------------ > 2021-03-26T14:33:39Z|04971|raft|INFO|server 7fbb is leader for term 103 It's 4 hours, actually, not minutes, so it's not that bad. > i.maximets Do you have any idea on what might be causing these > protocol errors? On reconnection clients are trying to close the SSL connection gracefully with SSL_shutdown(), but success of this function depends on the current stete of the SSL connection. It will fail if re-negotiation is in progress or something else. However, stream-ssl library has no way to keep this connection open if stream_close() called, so it just closes the connection regardless of the result of SSL shutdown. This will likely result in a protocol error on the server side. Should be harmless. I'm more impressed with the size of all databases. Nb DB fully compacted takes 20 MB and Sb DB - 500 MB. This is likely the main issue here as it requires significant amount of time to process this amount of data. @rsevilla, had these databases the same huge size on the first iteration of the test or each iteration increases the size? Maybe some resources are not cleaned up correctly after the iteration?
Thanks for the quick update and, you are right, this was 4 hours, not minutes! And, thanks for the detailed explanation about the SSL connections, certainly helpful, at least to me.
Also, looking at content of Sb DB I don't see Logical_DP_Group records. So, my second question is: Are logical datapath groups disabled in this test? If so, we should, probably, enable them to reduce the size of Southbound DB. IIUC, they are enabled now in OpenShift, so should be enabled in the test, right?
(In reply to Ilya Maximets from comment #6) > Also, looking at content of Sb DB I don't see Logical_DP_Group records. > > So, my second question is: > Are logical datapath groups disabled in this test? If so, we should, > probably, enable them to reduce the size of Southbound DB. > > IIUC, they are enabled now in OpenShift, so should be enabled in the > test, right? They should be... https://github.com/openshift/ovn-kubernetes/pull/440/commits/7e7710240c399d4a55dffd1d4a3ef52df06d5489 I wanted to double-check that the right option was set in NB, but on Fedora even with openvswitch-2.15 I get: $ ovsdb-tool cluster-to-standalone nb.db ovnkube-master-rzlxn-nb.db ovsdb-tool: ovsdb error: ovnkube-master-rzlxn-nb.db: unexpected file format
Hi, as far as I can tell, datapath groups are enabled by default in latest OpenShift builds. sh-4.4# ovn-nbctl --no-leader-only list NB_Global _uuid : eab75a87-72da-409a-9a0b-8e1dbfd396c4 connections : [7fa433f1-47c3-461d-b3b5-260ed335644c] external_ids : {} hv_cfg : 0 hv_cfg_timestamp : 0 ipsec : false name : "" nb_cfg : 0 nb_cfg_timestamp : 0 options : {controller_event="true", e2e_timestamp="1617187521", mac_prefix="aa:9a:0e", max_tunid="16711680", northd_internal_version="20.12.0-20.16.1-56.0", northd_probe_interval="5000", svc_monitor_mac="be:a7:74:d3:fc:ab", use_logical_dp_groups="true"} sb_cfg : 0 sb_cfg_timestamp : 0 ssl : [] sh-4.4# ovn-sbctl --no-leader-only list SB_Global _uuid : 45eeb890-3562-48b4-ad68-9544a416905e connections : [0e912e7b-a22c-4f5d-96cb-cdaff13a445b] external_ids : {} ipsec : false nb_cfg : 0 options : {controller_event="true", e2e_timestamp="1617187521", mac_prefix="aa:9a:0e", max_tunid="16711680", northd_internal_version="20.12.0-20.16.1-56.0", northd_probe_interval="5000", svc_monitor_mac="be:a7:74:d3:fc:ab", use_logical_dp_groups="true"} ssl : [] Have made some tests to check database sizes, all of them after enabling column-diff with: ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb/file/column-diff-enable ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb/file/column-diff-enable - Steady state with 100 nodes sh-4.4# ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact # Manually compact databases before checking size sh-4.4# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/compact sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l 88729 sh-4.4# ls -lh /etc/openvswitch/* -rw-r-----. 1 root root 1.7M Mar 31 10:41 /etc/openvswitch/ovnnb_db.db -rw-r-----. 1 root root 21M Mar 31 10:41 /etc/openvswitch/ovnsb_db.db - After 100 executing cluster-density iterations in the same 100 node cluster sh-4.4# ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact # Manually compact databases before checking size sh-4.4# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/compact sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l 297729 # Number of lflows is x3.35 however size of sbdb which more or less matches with the increment of the size in sbdb sh-4.4# ls -lh /etc/openvswitch/* -rw-r-----. 1 root root 2.5M Mar 31 10:38 /etc/openvswitch/ovnnb_db.db -rw-r-----. 1 root root 66M Mar 31 10:38 /etc/openvswitch/ovnsb_db.db
Created attachment 1768091 [details] OVN DBs with 100 nodes with 50 cluster-density interations Attaching smaller OVN DBs to ease analysis
Ok. The status so far: 1. It's OK that DBs are growing because the test only adds more and more resources on each iteration. 2. Some attached DBs are corrupted. At least the one that I looked initially, so I had no datapath groups. Looking at other DBs I see that datapath groups are enabled and works as expected. (also, some DBs for some reason has DOS-style line breaks that triggers "unexpected file format" error. Need to dos2unix them.) 3. Content of Sb DB looks fine, but the size is really big: 500 MB. Logical flows contributes most of the size. There are ~850K of them. 4. Datapath groups works fine. Without them database will have ~1.2M logical flows more, so ~2M in total. 5. There are 840K logical flows that cannot be combined. As Raul pointed out, majority of them generated in a single place in northd: ovn-northd.c:9131. These are ARP flows for logical router generated by function build_lrouter_arp_flow() . And database contains ~655K of such flows. We may try to think, how to reduce their number or optimize somehow. Other unique flows are far less frequent: 655055 ovn-northd.c:9131 27168 ovn-northd.c:4713 27168 ovn-northd.c:4676 14120 ovn-northd.c:5112 14120 ovn-northd.c:5017 13905 ovn-northd.c:7609 13691 ovn-northd.c:10202 13587 ovn-northd.c:11624 13584 ovn-northd.c:4635 13584 ovn-northd.c:4581 13584 ovn-northd.c:4564 4721 ovn-northd.c:7198 4721 ovn-northd.c:7180 1287 ovn-northd.c:8704 643 ovn-northd.c:5163 643 ovn-northd.c:5160 ... 6. For the Sb DB itself, it spends too much time processing changes. There are "unreasonably long poll interval" warnings with poll intervals from 5 to 19 seconds. With 16 second election timer this could easily trigger elections and inability to recover if every iteration takes so much time.
Talking with folks it looks like the 655k lflows are caused by k8s service VIPs * number of nodes. Going to file a bug in OVN on that so we can track it there. The consequence of too many lflows * too many clients seems to cause sbdb to lock up for over 16 seconds. We should try to quantify what this limit actually is with ovsdb, so that we can have a realistic sense of how far we can scale.
With the patch provided at https://bugzilla.redhat.com/show_bug.cgi?id=1945415#c3 I managed to scale to 250 nodes + 2k cluster-density. However, at higher scale levels 250 nodes + 3K cluster-density I still observe sbdb instability due to raft partitioning. Some info: # RPM version sh-4.4# rpm -qa | egrep "ovn|openvswitch" openvswitch2.15-2.15.0-2.el8fdp.x86_64 openvswitch2.15-ipsec-2.15.0-2.el8fdp.x86_64 ovn2.13-20.12.0-99.el8fdp.x86_64 ovn2.13-host-20.12.0-99.el8fdp.x86_64 python3-openvswitch2.15-2.15.0-2.el8fdp.x86_64 openvswitch2.15-devel-2.15.0-2.el8fdp.x86_64 ovn2.13-central-20.12.0-99.el8fdp.x86_64 ovn2.13-vtep-20.12.0-99.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch # Cluster status sh-4.4# ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound a2cc Name: OVN_Southbound Cluster ID: c4bb (c4bb92ca-ca4f-4a59-8fcc-b26d4abb8eb7) Server ID: a2cc (a2cc23f2-040e-41fb-9e86-52faa9475e0b) Address: ssl:10.0.198.11:9644 Status: cluster member Role: follower Term: 88 Leader: 758b Vote: 758b Last Election started 27991320 ms ago, reason: timeout Election timer: 16000 Log: [141856, 141872] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->43bf ->758b <-43bf <-758b Disconnections: 1 Servers: 43bf (43bf at ssl:10.0.145.213:9644) last msg 28014771 ms ago 758b (758b at ssl:10.0.161.133:9644) last msg 2914 ms ago a2cc (a2cc at ssl:10.0.198.11:9644) (self) # Total lflows sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l 8791502 # Flows without logical datapath group sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' | wc -l 352570 # Database sizes sh-4.4# ls -lh /etc/openvswitch/ total 303M -rw-r-----. 1 root root 20M Apr 7 07:27 ovnnb_db.db -rw-r-----. 1 root root 251M Apr 7 07:28 ovnsb_db.db I have uploaded sbdb logs and databases to: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/ovn-sbdb-04-07.tgz
More interesting info from the test above: # Breakdown of not grouped lflows sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | grep -oE "ovn-northd.c:[0-9]*" | sort | uniq -c | sort| tail 24302 ovn-northd.c:11574 24302 ovn-northd.c:4570 24302 ovn-northd.c:4587 24302 ovn-northd.c:4641 24551 ovn-northd.c:10294 25049 ovn-northd.c:7615 25547 ovn-northd.c:5023 25547 ovn-northd.c:5118 48604 ovn-northd.c:4682 48604 ovn-northd.c:4719 sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | grep -oE '"table_id":[0-9]*' | sort | uniq -c | sort| tail 2330 "table_id":6 10821 "table_id":3 25431 "table_id":14 26003 "table_id":17 28262 "table_id":0 28777 "table_id":10 29290 "table_id":23 48806 "table_id":9 48807 "table_id":2 99929 "table_id":1 # Breakdown of lflows sh-4.4# ovn-sbctl --no-leader-only lflow-list | grep -oE ' table=[0-9]*' | sort | uniq -c | sort | tail 26987 table=14 30075 table=10 30330 table=23 31620 table=0 51661 table=2 72988 table=9 103573 table=1 2090955 table=4 2096872 table=13 4190905 table=6
One option may be to try to limit the time ovsdb processes actual database requests to be a fraction of the election timer which is comparable to the ping timeout (election_timer /3). It is difficult to enforce a hard limit, so in first instance I have tried to make this a "best effort". The first version of patch can be fetched from here: https://github.com/kot-begemot-uk/ovs/commit/056417310763e6a55468bd954ac5c47e45205390 If we see some improvement I can try adding some bells and whistles (ensure that all sessions get the same priority, etc).
Thanks Anton, may I have a build including this patch using as code base openvswitch2.15-2.15.0-2?
Hi, after restarting the components with the patched ovs, sbdb does not come up. I see traces like the below 2021-04-08T23:53:03Z|00013|reconnect|INFO|ssl:10.0.145.213:9644: connecting... 2021-04-08T23:53:03Z|00014|reconnect|INFO|ssl:10.0.145.213:9644: connected 2021-04-08T23:53:11Z|00015|ovsdb_server|INFO|memory trimming after compaction enabled. 2021-04-08T23:53:15Z|00016|raft|INFO|term 89: 16016 ms timeout expired, starting election 2021-04-08T23:53:15Z|00017|raft|INFO|server a2cc is leader for term 90 2021-04-08T23:53:15Z|00018|raft|INFO|rejecting append_request because previous entry 89,211629 not in local log (mismatch past end of log) 2021-04-08T23:53:16Z|00019|jsonrpc|WARN|unix#2: send error: Broken pipe 2021-04-08T23:53:16Z|00020|reconnect|WARN|unix#2: connection dropped (Broken pipe) 2021-04-08T23:53:16Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2021-04-08T23:53:16Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected Waiting for OVN_Southbound to come up.
A few questions. 1. How many clients do you have at that point? 2. What is the size of the database? If the database is very big, I would have expected the clients to drop a few times until they settle due to not having enough CPU time. Raft not getting enough time is weird though. It should be getting extra opportunities to run with the patch. Once I have some data on the actual scale you are trying to run it at, I will try to reproduce this on my rig.
(In reply to Anton Ivanov from comment #18) > A few questions. > > 1. How many clients do you have at that point? > 2. What is the size of the database? > > If the database is very big, I would have expected the clients to drop a few > times until they settle due to not having enough CPU time. Raft not getting > enough time is weird though. It should be getting extra opportunities to run > with the patch. > > Once I have some data on the actual scale you are trying to run it at, I > will try to reproduce this on my rig. This OVN deployment is working on top of a small OpenShift cluster, with only 35 nodes at the moment. The databases are also quite small (the cluster is almost empty) sh-4.4# ls -lh /etc/openvswitch/ total 17M -rw-r-----. 1 root root 5.6M Apr 9 08:49 ovnnb_db.db -rw-r-----. 1 root root 8.3M Apr 9 08:49 ovnsb_db.db
If you can observe it at that scale, then I should be able to do that too - I can spin up something on that scale on my rig. I will update the bug once I am clear what's going on. We should not be seeing this after these changes.
Quick update. After restarting OVN control plane pods things have started working properly sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt --ovsdb-disable-file-column-diff=yes '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb 2021-04-09T12:05:56Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log 2021-04-09T12:05:57Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.1 2021-04-09T12:05:57Z|00003|reconnect|INFO|ssl:10.0.161.133:9644: connecting... 2021-04-09T12:05:57Z|00004|reconnect|INFO|ssl:10.0.198.11:9644: connecting... 2021-04-09T12:05:57Z|00005|reconnect|INFO|ssl:10.0.161.133:9644: connected 2021-04-09T12:05:57Z|00006|reconnect|INFO|ssl:10.0.198.11:9644: connected 2021-04-09T12:05:57Z|00007|raft|INFO|ssl:10.0.161.133:60200: learned server ID 758b 2021-04-09T12:05:57Z|00008|raft|INFO|ssl:10.0.161.133:60200: learned remote address ssl:10.0.161.133:9644 2021-04-09T12:05:58Z|00009|raft|INFO|ssl:10.0.198.11:56254: learned server ID a2cc 2021-04-09T12:05:58Z|00010|raft|INFO|ssl:10.0.198.11:56254: learned remote address ssl:10.0.198.11:9644 2021-04-09T12:06:07Z|00011|memory|INFO|241176 kB peak resident set size after 10.8 seconds 2021-04-09T12:06:07Z|00012|memory|INFO|cells:402248 monitors:1 raft-connections:4 raft-log:2352 sessions:1 2021-04-09T12:06:12Z|00013|raft|WARN|ignoring vote request received after only 15069 ms (minimum election time is 16000 ms) 2021-04-09T12:06:13Z|00014|raft|INFO|term 92: 16700 ms timeout expired, starting election 2021-04-09T12:06:13Z|00015|raft|INFO|term 92: elected leader by 2+ of 3 servers 2021-04-09T12:06:13Z|00016|raft|INFO|rejected append_reply (not leader) 2021-04-09T12:06:13Z|00017|raft|INFO|rejecting term 92 < current term 93 received in append_reply message from server a2cc 2021-04-09T12:06:19Z|00018|jsonrpc|WARN|unix#13: send error: Broken pipe 2021-04-09T12:06:19Z|00019|reconnect|WARN|unix#13: connection dropped (Broken pipe) 2021-04-09T12:06:19Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2021-04-09T12:06:19Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected Waiting for OVN_Southbound to come up. 2021-04-09T12:06:28Z|00020|raft|WARN|ignoring vote request received after only 14707 ms (minimum election time is 16000 ms) Will scale-up now and trigger some performance tests.
Created attachment 1771975 [details] sbdbs with lflow and election priority fixes
Hey Anton, Ilya, From my understanding initially the new patch you provided helped, but we still see leader elections. I'm not sure how to parse all of the logs. I've attached them to the bz. Do you mind taking a look? There are 27 terms and some stuff I see that looks suspicious: A lot of these replies when sending commands: 2021-04-14T19:24:17Z|5438301|raft|DBG|<--3a75 execute_command_reply: result=143af1ba-7dcb-41c9-8f3a-22afe48a0368 status="prerequisite check failed" SSL 100% then protocol error (one instance directly precedes election happening for term 26): 2021-04-14T19:25:23Z|5446594|raft|INFO|term 26: 16649 ms timeout expired, starting election 2021-04-14T19:25:23Z|5446595|stream_ssl|WARN|SSL_accept: system error (Success) 2021-04-14T19:25:23Z|5446596|timeval|WARN|Unreasonably long 5391ms poll interval (4844ms user, 464ms system) 2021-04-14T19:25:23Z|5446597|timeval|WARN|faults: 8902 minor, 0 major 2021-04-14T19:25:23Z|5446598|timeval|WARN|disk: 0 reads, 8 writes 2021-04-14T19:25:23Z|5446599|timeval|WARN|context switches: 0 voluntary, 29 involuntary 2021-04-14T19:25:23Z|5446600|poll_loop|INFO|Dropped 409 log messages in last 11 seconds (most recently, 5 seconds ago) due to excessive rate 2021-04-14T19:25:23Z|5446601|poll_loop|INFO|wakeup due to [POLLOUT] on fd 101 (10.0.168.65:9642<->10.0.155.96:57786) at ../lib/stream-ssl.c:798 (100% CPU usage) 2021-04-14T19:25:26Z|5446602|jsonrpc|WARN|ssl:10.0.168.65:52744: receive error: Protocol error 2021-04-14T19:25:26Z|5446603|jsonrpc|WARN|ssl:10.0.168.65:52970: receive error: Protocol error 2021-04-14T19:25:26Z|5446604|jsonrpc|WARN|ssl:10.0.168.65:53044: receive error: Protocol error 2021-04-14T19:25:26Z|5446605|jsonrpc|WARN|ssl:10.0.168.65:53274: receive error: Protocol error 2021-04-14T19:25:26Z|5446606|jsonrpc|WARN|ssl:10.0.168.65:53484: receive error: Protocol error 2021-04-14T19:25:26Z|5446607|raft|DBG|raft.c:1601 -->3a75 vote_request: term=26 last_log_index=174963 last_log_term=25 2021-04-14T19:25:26Z|5446608|raft|DBG|raft.c:1601 -->c77c vote_request: term=26 last_log_index=174963 last_log_term=25 2021-04-14T19:25:26Z|5446609|raft|DBG|<--c77c vote_reply: term=25 vote=c77c 2021-04-14T19:25:26Z|5446610|raft|INFO|rejecting term 25 < current term 26 received in vote_reply message from server c77c Looks like one node is still thinking its on term 25 and another is on 27: 2021-04-14T19:25:40Z|5446662|raft|INFO|term 27: 16757 ms timeout expired, starting election 2021-04-14T19:25:40Z|5446663|raft|DBG|Dropped 16 log messages in last 15121 seconds (most recently, 15120 seconds ago) due to excessive rate 2021-04-14T19:25:40Z|5446664|raft|DBG|raft_is_connected: false 2021-04-14T19:25:40Z|5446665|raft|DBG|raft.c:1601 -->3a75 vote_request: term=27 last_log_index=174963 last_log_term=25 2021-04-14T19:25:40Z|5446666|raft|DBG|raft.c:1601 -->c77c vote_request: term=27 last_log_index=174963 last_log_term=25 2021-04-14T19:25:40Z|5446667|raft|DBG|raft_is_connected: false 2021-04-14T19:25:40Z|5446668|raft|DBG|<--c77c vote_reply: term=25 vote=c77c 2021-04-14T19:25:40Z|5446669|raft|INFO|rejecting term 25 < current term 27 received in vote_reply message from server c77c 2021-04-14T19:25:40Z|5446670|raft|DBG|raft_is_connected: false
(In reply to Tim Rozet from comment #24) > Hey Anton, Ilya, > From my understanding initially the new patch you provided helped, but we > still see leader elections. I'm not sure how to parse all of the logs. I've > attached them to the bz. Do you mind taking a look? There are 27 terms and > some stuff I see that looks suspicious: > > A lot of these replies when sending commands: > 2021-04-14T19:24:17Z|5438301|raft|DBG|<--3a75 execute_command_reply: > result=143af1ba-7dcb-41c9-8f3a-22afe48a0368 status="prerequisite check > failed" These are completely normal. This just means that there was a race between 2 transactions that came from, presumably, different servers. Client will retry. > > SSL 100% then protocol error (one instance directly precedes election > happening for term 26): > 2021-04-14T19:25:23Z|5446594|raft|INFO|term 26: 16649 ms timeout expired, > starting election > 2021-04-14T19:25:23Z|5446595|stream_ssl|WARN|SSL_accept: system error > (Success) > 2021-04-14T19:25:23Z|5446596|timeval|WARN|Unreasonably long 5391ms poll > interval (4844ms user, 464ms system) > 2021-04-14T19:25:23Z|5446597|timeval|WARN|faults: 8902 minor, 0 major > 2021-04-14T19:25:23Z|5446598|timeval|WARN|disk: 0 reads, 8 writes > 2021-04-14T19:25:23Z|5446599|timeval|WARN|context switches: 0 voluntary, 29 > involuntary > 2021-04-14T19:25:23Z|5446600|poll_loop|INFO|Dropped 409 log messages in last > 11 seconds (most recently, 5 seconds ago) due to excessive rate > 2021-04-14T19:25:23Z|5446601|poll_loop|INFO|wakeup due to [POLLOUT] on fd > 101 (10.0.168.65:9642<->10.0.155.96:57786) at ../lib/stream-ssl.c:798 (100% > CPU usage) > 2021-04-14T19:25:26Z|5446602|jsonrpc|WARN|ssl:10.0.168.65:52744: receive > error: Protocol error > 2021-04-14T19:25:26Z|5446603|jsonrpc|WARN|ssl:10.0.168.65:52970: receive > error: Protocol error > 2021-04-14T19:25:26Z|5446604|jsonrpc|WARN|ssl:10.0.168.65:53044: receive > error: Protocol error > 2021-04-14T19:25:26Z|5446605|jsonrpc|WARN|ssl:10.0.168.65:53274: receive > error: Protocol error > 2021-04-14T19:25:26Z|5446606|jsonrpc|WARN|ssl:10.0.168.65:53484: receive > error: Protocol error > 2021-04-14T19:25:26Z|5446607|raft|DBG|raft.c:1601 -->3a75 vote_request: > term=26 last_log_index=174963 last_log_term=25 > 2021-04-14T19:25:26Z|5446608|raft|DBG|raft.c:1601 -->c77c vote_request: > term=26 last_log_index=174963 last_log_term=25 > 2021-04-14T19:25:26Z|5446609|raft|DBG|<--c77c vote_reply: term=25 vote=c77c > 2021-04-14T19:25:26Z|5446610|raft|INFO|rejecting term 25 < current term 26 > received in vote_reply message from server c77c Protocol errors for SSL are normal too. Just means that client disconnected not waiting for graceful termination of SSL handshakes. That is how stream-ssl implemented. It doesn't wait. > > > Looks like one node is still thinking its on term 25 and another is on 27: > 2021-04-14T19:25:40Z|5446662|raft|INFO|term 27: 16757 ms timeout expired, > starting election > 2021-04-14T19:25:40Z|5446663|raft|DBG|Dropped 16 log messages in last 15121 > seconds (most recently, 15120 seconds ago) due to excessive rate > 2021-04-14T19:25:40Z|5446664|raft|DBG|raft_is_connected: false > 2021-04-14T19:25:40Z|5446665|raft|DBG|raft.c:1601 -->3a75 vote_request: > term=27 last_log_index=174963 last_log_term=25 > 2021-04-14T19:25:40Z|5446666|raft|DBG|raft.c:1601 -->c77c vote_request: > term=27 last_log_index=174963 last_log_term=25 > 2021-04-14T19:25:40Z|5446667|raft|DBG|raft_is_connected: false > 2021-04-14T19:25:40Z|5446668|raft|DBG|<--c77c vote_reply: term=25 vote=c77c > 2021-04-14T19:25:40Z|5446669|raft|INFO|rejecting term 25 < current term 27 > received in vote_reply message from server c77c > 2021-04-14T19:25:40Z|5446670|raft|DBG|raft_is_connected: false There was a lot of elections, and term changes could happen several times within a short period of time if servers are fighting for leadership. So, it's fine too. What is not fine is that we still have poll intervals up to 16 seconds. Looking at all the 10+ second poll intervals, and all of them has lots of disk writes like this: |timeval|WARN|Unreasonably long 16148ms poll interval (14428ms user, 1223ms system) |09384|timeval|WARN|disk: 0 reads, 672264 writes Also looking at times when this happens, I noticed that such big intervals happens every 10-20 minutes and this suggests that it's a database snapshotting during automatic compaction. And I understand that. If we're writing 300-500MB of data to the disk on compaction, this might take a lot of time. I added sending of heartbeats to the snapshot writing functions. I also added some extra logging to see how much time each part of snapshotting process actually takes. Plus, I fixed some math in the Anton's patch. Here is the resulted build (it includes fix for ARP responder flows): https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36181450 Could you, please, try it out and provide logs? So we will see how much time compacting actually takes and if sending of heartbeats in these places is enough. Diff of the changes: --- diff --git a/ovsdb/jsonrpc-server.c b/ovsdb/jsonrpc-server.c index 22bc6f38b3..87be21e97c 100644 --- a/ovsdb/jsonrpc-server.c +++ b/ovsdb/jsonrpc-server.c @@ -384,7 +384,7 @@ ovsdb_jsonrpc_server_run(struct ovsdb_jsonrpc_server *svr, uint64_t limit) uint64_t elapsed = 0, start_time = 0; if (limit) { - start_time = time_now(); + start_time = time_msec(); } SHASH_FOR_EACH (node, &svr->remotes) { @@ -410,7 +410,7 @@ ovsdb_jsonrpc_server_run(struct ovsdb_jsonrpc_server *svr, uint64_t limit) if (limit) { - elapsed = start_time - time_now(); + elapsed = time_msec() - start_time; if (elapsed >= limit) { poll_immediate_wake(); break; diff --git a/ovsdb/raft.c b/ovsdb/raft.c index b6cc6b6d55..7f367b4185 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -3954,18 +3954,45 @@ raft_save_snapshot(struct raft *raft, { struct ovsdb_log *new_log; struct ovsdb_error *error; + uint64_t start_time, elapsed; + error = ovsdb_log_replace_start(raft->log, &new_log); if (error) { return error; } + if (raft->role == RAFT_LEADER) { + /* Leader is about to write database snapshot to the disk and this + * might take significant amount of time. Sending heartbeats now + * to avoid falling out of the cluster. */ + raft_send_heartbeats(raft); + } + + start_time = time_msec(); error = raft_write_snapshot(raft, new_log, new_start, new_snapshot); if (error) { ovsdb_log_replace_abort(new_log); return error; } + elapsed = time_msec() - start_time; + if (elapsed > 1000) { + VLOG_INFO("raft_write_snapshot() took %"PRIu64"ms", elapsed); + } - return ovsdb_log_replace_commit(raft->log, new_log); + if (raft->role == RAFT_LEADER) { + /* Leader is about to fsync database snapshot to ensure that it is + * fully written to the disk and this might take even more time. + * Sending heartbeats now to avoid falling out of the cluster. */ + raft_send_heartbeats(raft); + } + + start_time = time_msec(); + error = ovsdb_log_replace_commit(raft->log, new_log); + elapsed = time_msec() - start_time; + if (elapsed > 1000) { + VLOG_INFO("ovsdb_log_replace_commit() took %"PRIu64"ms", elapsed); + } + return error; } static bool @@ -4158,6 +4200,7 @@ raft_store_snapshot(struct raft *raft, const struct json *new_snapshot_data) return ovsdb_error(NULL, "not storing a duplicate snapshot"); } + uint64_t elapsed, start_time = time_msec(); uint64_t new_log_start = raft->last_applied + 1; struct raft_entry new_snapshot = { .term = raft_get_term(raft, new_log_start - 1), @@ -4182,6 +4225,12 @@ raft_store_snapshot(struct raft *raft, const struct json *new_snapshot_data) memmove(&raft->entries[0], &raft->entries[new_log_start - raft->log_start], (raft->log_end - new_log_start) * sizeof *raft->entries); raft->log_start = new_log_start; + + elapsed = time_msec() - start_time; + if (elapsed > 1000) { + VLOG_INFO("snapshot storing took %"PRIu64"ms", elapsed); + } + return NULL; } ---
(In reply to Ilya Maximets from comment #25) > Here is the resulted build (it includes fix for ARP responder flows): Sorry, fix for ARP responder flows is in OVN package, not in OVS one, so it's, obviously, not included in this build. OVN build with fixes for ARP responder flows available in BZ 1945415.
Thanks Ilya! Joe, can you please try with the new OVS rpm and the fixed OVN and see how scale goes then? Ilya, is there anyway we can make saving the snapshot as background operation? Or some way we can ensure that we still send raft messages while we are snapshotting in a another thread? Ideally we want to lower the sbdb timer to something much less than 16 seconds. So if disk write was to take over 1 or 2 seconds we would still be in trouble.
(In reply to Tim Rozet from comment #27) > Ilya, is there anyway we can make saving the snapshot as background > operation? Or some way we can ensure that we still send raft messages while > we are snapshotting in a another thread? Ideally we want to lower the sbdb > timer to something much less than 16 seconds. So if disk write was to take > over 1 or 2 seconds we would still be in trouble. Yes, I thought about this and it looks challenging. The problem is that we really can't process or compose any real raft messages while snapshotting, just because raft log and file transactions will be wiped out during this process and will lose all the updates not saying that we will crash trying to access cuncurrently changing data structures. I guess, we can keep sending dummy heartbeats somewhere in a separate process. However, maybe we can take another approach and just transfer the leadership before starting creation of a snapshot? It looks better from the point that cluster will still be responsive while we're creating a snapshot because there will be functioning leader. Of course, this approach will mean more frequent leader changes, but that should not be a big problem. Clients will not re-connect, except for northd that needs to be connected the the leader. 'term' will grow and some number of transactions might need to be re-sent, but that's a normal thing. What do you think? I can craft an RPM for PoC purposes.
I think that is a good idea to transfer the leadership while trying to do a big disk write. We just need some log message to indicate that is why leadership is changing due to compaction or something like that. Are there any other potential places where you think we can get stalled and miss sending heartbeats? Maybe Joe we should wait to try out Ilya's OVSDB patch until we get his next rev with the leadership transfer.
Here is a build with a leadership transfer before snapshotting: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36278228 It includes Anton's patch with adjusted math. I also kept some debug prints for time measurements around actual snapshot creation functions to have an idea how much time it actually takes, so would like to take a look at logs from the tests. Patch looks like this: https://github.com/igsilya/ovs/commit/c04244359151b4ae79b928258e320126ee3e4c38
I built an image that has the above fix, as well as the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1945415 in this PR: https://github.com/openshift/ovn-kubernetes/pull/508 I then used cluster-bot to build an install image, which can be found here: quay.io/vpickard/ovnk-images:ci-ln-vmhn66k The scale team used this image and ran the 100 node test, and see inactivity probe timeouts: 2021-04-22T03:24:42Z|14968|reconnect|ERR|ssl:10.0.180.149:9644: no response to inactivity probe after 17.1 seconds, disconnecting 2021-04-22T03:24:42Z|14969|reconnect|INFO|ssl:10.0.180.149:9644: connection dropped 2021-04-22T03:24:43Z|14970|reconnect|INFO|ssl:10.0.180.149:9644: connecting... 2021-04-22T03:24:44Z|14971|reconnect|INFO|ssl:10.0.180.149:9644: connection attempt timed out 2021-04-22T03:24:44Z|14972|reconnect|INFO|ssl:10.0.180.149:9644: waiting 2 seconds before reconnect 2021-04-22T03:24:46Z|14973|reconnect|INFO|ssl:10.0.180.149:9644: connecting... @msheth can you please attach the logs from this run?
100 node cluster-density with 1000 projects start time: 2021-04-21T19:37 end time: 2021-04-21T20:25 250 node cluster-density with 2000 projects (failed due to image expiry) start time: 2021-04-22T00:22 end time: failed The ovn sbdb/nbdb logs for the above tests http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn-logs/100and250nodes/ 250 node cluster-density with 2000 projects some pod scheduling error encountered (https://bugzilla.redhat.com/show_bug.cgi?id=1946420) with 97 pods that was eventually resolved on its own ``` error adding container to network "ovn-kubernetes": CNI request failed with status 400: <snip> failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding ``` start time: 2021-04-22T17:03 end time: 2021-04-22T19:41 250 node cluster-density with 3000 projects (same port binding error as above; this time with 1105 pods; not resolved) start time: 22-04-2021T20:44 end time: failed The ovn sbdb/nbdb logs for the above tests http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn-logs/250nodes/
(In reply to Mohit Sheth from comment #33) > 100 node cluster-density with 1000 projects > start time: 2021-04-21T19:37 > end time: 2021-04-21T20:25 > > 250 node cluster-density with 2000 projects (failed due to image expiry) > start time: 2021-04-22T00:22 > end time: failed > The ovn sbdb/nbdb logs for the above tests > http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn- > logs/100and250nodes/ > > > 250 node cluster-density with 2000 projects > some pod scheduling error encountered > (https://bugzilla.redhat.com/show_bug.cgi?id=1946420) with 97 pods that was > eventually resolved on its own > ``` > error adding container to network "ovn-kubernetes": CNI request failed with > status 400: <snip> failed to configure pod interface: error while waiting on > OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting > for OVS port binding > ``` > start time: 2021-04-22T17:03 > end time: 2021-04-22T19:41 > > 250 node cluster-density with 3000 projects (same port binding error as > above; this time with 1105 pods; not resolved) > start time: 22-04-2021T20:44 > end time: failed > > The ovn sbdb/nbdb logs for the above tests > http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn- > logs/250nodes/ I'm pretty sure that above tests was performed not with openvswitch packages I built. 2 reasons: 1. There are no logs that I added. 2. In previous tests with Anton's patch most of poll intervals was < 6 seconds (except for ones with big number of disk writes), but it these tests they are 10-16 seconds all at the end of the test. Could you, please, re-check?
(In reply to Ilya Maximets from comment #34) > I'm pretty sure that above tests was performed not with openvswitch packages > I built. 2 reasons: > 1. There are no logs that I added. > 2. In previous tests with Anton's patch most of poll intervals was < 6 > seconds > (except for ones with big number of disk writes), but it these tests they > are 10-16 seconds all at the end of the test. > > Could you, please, re-check? To avoid this kind of issues for the later tests I re-built the package: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/2.bz1943631.2.3.el8fdp/ Now it has descriptive name and ovsdb-server will print extra version information on startup. Following line will be present in ovsdb log with above package: ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.90-imaximets-bz1943631-start-3-gc0bb30bc9c02 Please, use this package for future tests.
(In reply to Ilya Maximets from comment #35) > > ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.90-imaximets-bz1943631-start-3-gc0bb30bc9c02 s/2.15.90/2.15.1/
I rebuilt the OCP image with the above ovs rpms, and also, ovn rpms from https://bugzilla.redhat.com/show_bug.cgi?id=1945415 in this PR: https://github.com/openshift/ovn-kubernetes/pull/508 With this image, I see the following OVS and OVN versions: [vpickard@rippleRider$][~/go/src/github.com/openshift/ovn-kubernetes-1] (scale-raftLeaderChange-lflow=)$ oc exec -it -n openshift-ovn-kubernetes ovnkube-master-fjvkj sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-fjvkj -n openshift-ovn-kubernetes' to see all of the containers in this pod. sh-4.4# rpm -aq |grep ovn ovn2.13-20.12.0-99.el8fdp.x86_64 ovn2.13-host-20.12.0-99.el8fdp.x86_64 ovn2.13-central-20.12.0-99.el8fdp.x86_64 ovn2.13-vtep-20.12.0-99.el8fdp.x86_64 sh-4.4# rpm -qa |grep openvswitch openvswitch2.15-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 openvswitch2.15-ipsec-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 openvswitch2.15-devel-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 python3-openvswitch2.15-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch sh-4.4# I copied the image to here: http://quay.io/vpickard/ovnk-images:ci-ln-vkrtd0k @msheth Please use the image above for the next scale test iteration. Thanks
Ilya, it the results from your patch look promising. All of the leader changes look to be triggered by snapshot: https://gist.github.com/jtaleric/c03c65ad09d1cd74c560abc290937131#gistcomment-3730250 I would like to see if we can now lower the election timer from like 16 seconds down to 2 and see what happens. Also on the latest scale up run, we can see the SBDB has 1.5 million lflows: [root@ip-10-0-214-218 ~]# ovn-sbctl --no-leader-only lflow-list | grep -oE ' table=[^,]+' | sort | uniq -c | sort | tail 8517 table=10(ls_out_port_sec_l2 ) 9032 table=0 (ls_in_port_sec_l2 ) 10819 table=3 (lr_in_ip_input ) 12633 table=23(ls_in_l2_lkup ) 13937 table=2 (ls_in_port_sec_nd ) 13937 table=9 (ls_out_port_sec_ip ) 20648 table=1 (ls_in_port_sec_ip ) 343096 table=4 (lr_in_defrag ) 352098 table=13(ls_in_stateful ) 696471 table=6 (lr_in_dnat ) [root@ip-172-31-20-81 trozet]# kubectl get svc -A | wc -l 2116 [root@ip-172-31-20-81 trozet]# kubectl get pod -A | wc -l 11989 [root@ip-172-31-20-81 trozet]# kubectl get namespace -A | wc -l 747 [root@ip-172-31-20-81 trozet]# kubectl get node -A | wc -l 258 Will attach dbs.
Created attachment 1779139 [details] dbs with VIP fix and ovsdb raft fix, 1.5 million lflows
(In reply to Tim Rozet from comment #40) > Ilya, it the results from your patch look promising. All of the leader > changes look to be triggered by snapshot: > https://gist.github.com/jtaleric/ > c03c65ad09d1cd74c560abc290937131#gistcomment-3730250 Sounds good. Do you have ovsdb logs? > > I would like to see if we can now lower the election timer from like 16 > seconds down to 2 and see what happens. > > Also on the latest scale up run, we can see the SBDB has 1.5 million lflows: > [root@ip-10-0-214-218 ~]# ovn-sbctl --no-leader-only lflow-list | grep -oE > ' table=[^,]+' | sort | uniq -c | sort | tail Yes, the number of lflows "logically" is around 1.5M, but it only because 'ovn-sbctl lflow-list' unwinds datapath groups before printing. Actual number of lflows is: $ ovsdb-tool query ovnsdb_db.db.standalone \ '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[]}]' | \ sed 's/{"_uuid"/\n{"_uuid"/g' | grep -c '_uuid' 124684 So, it's only 124K lflows in the database. Out of them, flows that are not grouped: $ ovsdb-tool query ovnsdb_db.db.standalone \ '["OVN_Southbound",{"op":"select","table":"Logical_Flow", "where":[["logical_dp_group", "==", ["set", []]]]}]' | \ sed 's/{"_uuid"/\n{"_uuid"/g' | grep -c '_uuid' 118895 The rest 5789 lflows has datapath groups with ~250-260 datapaths. 5789 * ~250 + 118895 = ~1.5M This database is actually much smaller than we had in previous tests. We had 250MB database in the test here: https://bugzilla.redhat.com/show_bug.cgi?id=1943631#c13 , and now it's only 82MB. So, it's almost 3 times smaller. Was it a smaller test? Can we scale higher?
Created attachment 1779399 [details] sbdb logs ilyas ovsdb patch with VIP fix
Hey Ilya, this is the same test according to Mohit: 2k cluster density on 250 nodes. The cluster was unable to reach the max number for this test due to some other reason which I'm trying to figure out now.
(In reply to Tim Rozet from comment #44) > Hey Ilya, this is the same test according to Mohit: 2k cluster density on > 250 nodes. The cluster was unable to reach the max number for this test due > to some other reason which I'm trying to figure out now. Raul said at https://bugzilla.redhat.com/show_bug.cgi?id=1943631#c13 that he was able to scale up to 250 + 2K just with the fix for ARP responder flows, But cluster fell apart at 250 nodes + 3K cluster-density. That's why we started to work on additional fixes (Anton's patch and election transfer), IIUC. Can we test with 250 + 3K too?
Ilya, can you please take a look at must gather provided by Joe. There are some election timers going off there, however the cluster still seemed operational. Just wondering if anything more can be done to ensure we send the heartbeat.
I have sent a rebased version of the "give raft some priority" patch upstream. I have my scale-up rig working again, so I can test it at scale. It tests out for a 450 ports scenario.
https://bugzilla.redhat.com/show_bug.cgi?id=1960393 looks like it is never going to make any progress. There have been significant fixes in ovsdb-server to help make it less likely to hit the root cause of missing raft messages. However, since the root cause is not going to be addressed going to close this as deferred.