Bug 1943631 - Frequent southbound DB leader elections in large scale scenario
Summary: Frequent southbound DB leader elections in large scale scenario
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard: aos-scalability-48, perfscale-ovn
Depends On: 1945415 1960391 1960393 1963948
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-26 16:57 UTC by Raul Sevilla
Modified: 2022-03-10 13:32 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-22 17:02:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OVN DBs with 100 nodes with 50 cluster-density interations (6.47 MB, application/gzip)
2021-03-31 13:53 UTC, Raul Sevilla
no flags Details
sbdbs with lflow and election priority fixes (1.63 MB, application/gzip)
2021-04-14 19:59 UTC, Tim Rozet
no flags Details
dbs with VIP fix and ovsdb raft fix, 1.5 million lflows (15.40 MB, application/gzip)
2021-05-03 21:58 UTC, Tim Rozet
no flags Details
sbdb logs ilyas ovsdb patch with VIP fix (199.11 KB, application/gzip)
2021-05-04 14:12 UTC, Tim Rozet
no flags Details

Description Raul Sevilla 2021-03-26 16:57:36 UTC
Hi, I have detected SBDB partitioning after running 2k iterations of our cluster-density test in a 250 node cluster based on 4.8.0-0.nightly-2021-03-26-054333. This OCP version includes OVN 20.12.0-25.el8fdp and openvswitch-2.15.0-2.el8fdp which includes the OVS-DB column diff patch.

Before triggering the test I manually enabled column-diff in both sbdb and nbdb with:

ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb/file/column-diff-enable
ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb/file/column-diff-enable

The test shows an optimized memory usage from several OVN components, however at some point of the benchmark, between 1300 and 1500 namespaces, I started to observe sbdb RAFT partitioning and cluster instability, which ended up in a failure of the ovn-controllers processes running in each node.

Find nbdb, sbdb dumps and logs at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/ovn-logs.tgz

Comment 1 Raul Sevilla 2021-03-29 18:32:53 UTC
Reproduced in a 100 node cluster scenario. Attaching more sbdb logs (now after enabling debug at json, raft, raft_rpc, reconnect and stream_ssl) and dump sbdb dump.
http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/100nodes/ovn-logs-100nodes-debug.tgz

Comment 2 Dan Williams 2021-03-30 17:38:44 UTC
cc Ilya

Comment 3 Victor Pickard 2021-03-30 19:18:05 UTC
I'm taking a look at the logs today, and one thing I see is many connection dropped logs from the ssl module, with a reason of protocol error. Here are a couple of examples from ovnkube-master-rzlxn-sb.log file:

2021-03-26T14:32:18Z|04927|reconnect|WARN|ssl:10.0.150.200:60938: connection dropped (Protocol error)
2021-03-26T14:32:30Z|04943|reconnect|WARN|ssl:10.0.150.200:32878: connection dropped (Protocol error)
2021-03-26T14:32:53Z|04961|reconnect|WARN|ssl:10.0.193.136:56676: connection dropped (Protocol error)
2021-03-26T14:32:53Z|04963|reconnect|WARN|ssl:10.0.150.200:33200: connection dropped (Protocol error)
2021-03-26T14:32:53Z|04965|reconnect|WARN|ssl:10.0.150.200:33256: connection dropped (Protocol error)


[vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-rzlxn-sb.log | wc -l
801

There are 801 connection drops in just a little over 4 mins of logs:


1st log timestamp
-----------------
2021-03-26T10:14:17Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log

last log timestamp
------------------
2021-03-26T14:33:39Z|04971|raft|INFO|server 7fbb is leader for term 103


I see similiar dropped connection logs in the other *sb.log files:

[vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-gcfsg-sb.log | wc -l
752
[vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-rg4kr-sb.log | wc -l
676
[vpickard@rippleRider$][~/bz1943631/ovn-logs]$ grep "Protocol error" ovnkube-master-rzlxn-sb.log | wc -l
801
[vpickard@rippleRider$][~/bz1943631/ovn-logs]$ 



@

i.maximets Do you have any idea on what might be causing these protocol errors?

Comment 4 Ilya Maximets 2021-03-30 19:35:56 UTC
(In reply to Victor Pickard from comment #3)
> There are 801 connection drops in just a little over 4 mins of logs:
> 
> 1st log timestamp
> -----------------
> 2021-03-26T10:14:17Z|00001|vlog|INFO|opened log file
> /var/log/ovn/ovsdb-server-sb.log
> 
> last log timestamp
> ------------------
> 2021-03-26T14:33:39Z|04971|raft|INFO|server 7fbb is leader for term 103

It's 4 hours, actually, not minutes, so it's not that bad.

> i.maximets Do you have any idea on what might be causing these
> protocol errors?

On reconnection clients are trying to close the SSL connection gracefully
with SSL_shutdown(), but success of this function depends on the current
stete of the SSL connection.  It will fail if re-negotiation is in progress
or something else.  However, stream-ssl library has no way to keep this
connection open if stream_close() called, so it just closes the connection
regardless of the result of SSL shutdown.  This will likely result in a
protocol error on the server side.  Should be harmless.

I'm more impressed with the size of all databases.  Nb DB fully compacted
takes 20 MB and Sb DB - 500 MB.  This is likely the main issue here as
it requires significant amount of time to process this amount of data.

@rsevilla, had these databases the same huge size on the first iteration
of the test or each iteration increases the size?  Maybe some resources
are not cleaned up correctly after the iteration?

Comment 5 Victor Pickard 2021-03-30 19:42:56 UTC
Thanks for the quick update and, you are right, this was 4 hours, not minutes! 

And, thanks for the detailed explanation about the SSL connections, certainly helpful, at least to me.

Comment 6 Ilya Maximets 2021-03-30 19:55:15 UTC
Also, looking at content of Sb DB I don't see Logical_DP_Group records.

So, my second question is:
Are logical datapath groups disabled in this test?  If so, we should,
probably, enable them to reduce the size of Southbound DB.

IIUC, they are enabled now in OpenShift, so should be enabled in the
test, right?

Comment 7 Dan Williams 2021-03-31 01:54:09 UTC
(In reply to Ilya Maximets from comment #6)
> Also, looking at content of Sb DB I don't see Logical_DP_Group records.
> 
> So, my second question is:
> Are logical datapath groups disabled in this test?  If so, we should,
> probably, enable them to reduce the size of Southbound DB.
> 
> IIUC, they are enabled now in OpenShift, so should be enabled in the
> test, right?

They should be... https://github.com/openshift/ovn-kubernetes/pull/440/commits/7e7710240c399d4a55dffd1d4a3ef52df06d5489

I wanted to double-check that the right option was set in NB, but on Fedora even with openvswitch-2.15 I get:

$ ovsdb-tool cluster-to-standalone nb.db ovnkube-master-rzlxn-nb.db
ovsdb-tool: ovsdb error: ovnkube-master-rzlxn-nb.db: unexpected file format

Comment 8 Raul Sevilla 2021-03-31 11:05:08 UTC
Hi, as far as I can tell, datapath groups are enabled by default in latest OpenShift builds.

sh-4.4# ovn-nbctl --no-leader-only list NB_Global
_uuid               : eab75a87-72da-409a-9a0b-8e1dbfd396c4
connections         : [7fa433f1-47c3-461d-b3b5-260ed335644c]
external_ids        : {}
hv_cfg              : 0
hv_cfg_timestamp    : 0
ipsec               : false
name                : ""
nb_cfg              : 0
nb_cfg_timestamp    : 0
options             : {controller_event="true", e2e_timestamp="1617187521", mac_prefix="aa:9a:0e", max_tunid="16711680", northd_internal_version="20.12.0-20.16.1-56.0", northd_probe_interval="5000", svc_monitor_mac="be:a7:74:d3:fc:ab", use_logical_dp_groups="true"}
sb_cfg              : 0
sb_cfg_timestamp    : 0
ssl                 : []

sh-4.4# ovn-sbctl --no-leader-only list SB_Global
_uuid               : 45eeb890-3562-48b4-ad68-9544a416905e
connections         : [0e912e7b-a22c-4f5d-96cb-cdaff13a445b]
external_ids        : {}
ipsec               : false
nb_cfg              : 0
options             : {controller_event="true", e2e_timestamp="1617187521", mac_prefix="aa:9a:0e", max_tunid="16711680", northd_internal_version="20.12.0-20.16.1-56.0", northd_probe_interval="5000", svc_monitor_mac="be:a7:74:d3:fc:ab", use_logical_dp_groups="true"}
ssl                 : []


Have made some tests to check database sizes, all of them after enabling column-diff with:
ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb/file/column-diff-enable
ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb/file/column-diff-enable


- Steady state with 100 nodes

sh-4.4# ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact                # Manually compact databases before checking size
sh-4.4# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/compact
sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l
88729
sh-4.4# ls -lh /etc/openvswitch/*
-rw-r-----. 1 root root 1.7M Mar 31 10:41 /etc/openvswitch/ovnnb_db.db
-rw-r-----. 1 root root  21M Mar 31 10:41 /etc/openvswitch/ovnsb_db.db


- After 100 executing cluster-density iterations in the same 100 node cluster

sh-4.4# ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact                # Manually compact databases before checking size
sh-4.4# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/compact
sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l
297729                                                                              # Number of lflows is x3.35 however size of sbdb which more or less matches with the increment of the size in sbdb
sh-4.4# ls -lh /etc/openvswitch/*
-rw-r-----. 1 root root 2.5M Mar 31 10:38 /etc/openvswitch/ovnnb_db.db
-rw-r-----. 1 root root  66M Mar 31 10:38 /etc/openvswitch/ovnsb_db.db

Comment 10 Raul Sevilla 2021-03-31 13:53:10 UTC
Created attachment 1768091 [details]
OVN DBs with 100 nodes with 50 cluster-density interations

Attaching smaller OVN DBs to ease analysis

Comment 11 Ilya Maximets 2021-03-31 15:55:36 UTC
Ok.  The status so far:

1. It's OK that DBs are growing because the test only adds more
   and more resources on each iteration.

2. Some attached DBs are corrupted.  At least the one that I looked
   initially, so I had no datapath groups.  Looking at other DBs
   I see that datapath groups are enabled and works as expected.

   (also, some DBs for some reason has DOS-style line breaks that
    triggers "unexpected file format" error.  Need to dos2unix them.)

3. Content of Sb DB looks fine, but the size is really big: 500 MB.
   Logical flows contributes most of the size.  There are ~850K of
   them.

4. Datapath groups works fine.  Without them database will have
   ~1.2M logical flows more, so ~2M in total.

5. There are 840K logical flows that cannot be combined.
   As Raul pointed out, majority of them generated in a single
   place in northd: ovn-northd.c:9131.

   These are ARP flows for logical router generated by function
   build_lrouter_arp_flow() .  And database contains ~655K of such
   flows.  We may try to think, how to reduce their number or
   optimize somehow.

   Other unique flows are far less frequent:

    655055 ovn-northd.c:9131
     27168 ovn-northd.c:4713
     27168 ovn-northd.c:4676
     14120 ovn-northd.c:5112
     14120 ovn-northd.c:5017
     13905 ovn-northd.c:7609
     13691 ovn-northd.c:10202
     13587 ovn-northd.c:11624
     13584 ovn-northd.c:4635
     13584 ovn-northd.c:4581
     13584 ovn-northd.c:4564
      4721 ovn-northd.c:7198
      4721 ovn-northd.c:7180
      1287 ovn-northd.c:8704
       643 ovn-northd.c:5163
       643 ovn-northd.c:5160
       ...

6. For the Sb DB itself, it spends too much time processing changes.
   There are "unreasonably long poll interval" warnings with poll
   intervals from 5 to 19 seconds.  With 16 second election timer
   this could easily trigger elections and inability to recover if
   every iteration takes so much time.

Comment 12 Tim Rozet 2021-03-31 20:16:05 UTC
Talking with folks it looks like the 655k lflows are caused by k8s service VIPs * number of nodes. Going to file a bug in OVN on that so we can track it there. The consequence of too many lflows * too many clients seems to cause sbdb to lock up for over 16 seconds. We should try to quantify what this limit actually is with ovsdb, so that we can have a realistic sense of how far we can scale.

Comment 13 Raul Sevilla 2021-04-07 07:43:52 UTC
With the patch provided at https://bugzilla.redhat.com/show_bug.cgi?id=1945415#c3 I managed to scale to 250 nodes + 2k cluster-density.
However, at higher scale levels 250 nodes + 3K cluster-density I still observe sbdb instability due to raft partitioning.

Some info:

# RPM version
sh-4.4# rpm -qa | egrep "ovn|openvswitch"
openvswitch2.15-2.15.0-2.el8fdp.x86_64
openvswitch2.15-ipsec-2.15.0-2.el8fdp.x86_64
ovn2.13-20.12.0-99.el8fdp.x86_64
ovn2.13-host-20.12.0-99.el8fdp.x86_64
python3-openvswitch2.15-2.15.0-2.el8fdp.x86_64
openvswitch2.15-devel-2.15.0-2.el8fdp.x86_64
ovn2.13-central-20.12.0-99.el8fdp.x86_64
ovn2.13-vtep-20.12.0-99.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch

# Cluster status
sh-4.4# ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
a2cc
Name: OVN_Southbound
Cluster ID: c4bb (c4bb92ca-ca4f-4a59-8fcc-b26d4abb8eb7)
Server ID: a2cc (a2cc23f2-040e-41fb-9e86-52faa9475e0b)
Address: ssl:10.0.198.11:9644
Status: cluster member
Role: follower
Term: 88
Leader: 758b
Vote: 758b

Last Election started 27991320 ms ago, reason: timeout
Election timer: 16000
Log: [141856, 141872]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->43bf ->758b <-43bf <-758b
Disconnections: 1
Servers:
    43bf (43bf at ssl:10.0.145.213:9644) last msg 28014771 ms ago
    758b (758b at ssl:10.0.161.133:9644) last msg 2914 ms ago
    a2cc (a2cc at ssl:10.0.198.11:9644) (self)


# Total lflows
sh-4.4# ovn-sbctl --no-leader-only lflow-list  | wc -l
8791502
# Flows without logical datapath group
sh-4.4#  ovsdb-tool query  ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' |  wc -l
352570

# Database sizes
sh-4.4# ls -lh /etc/openvswitch/
total 303M
-rw-r-----. 1 root root  20M Apr  7 07:27 ovnnb_db.db
-rw-r-----. 1 root root 251M Apr  7 07:28 ovnsb_db.db

I have uploaded sbdb logs and databases to:
http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/ovn-sbdb-04-07.tgz

Comment 14 Raul Sevilla 2021-04-07 08:13:25 UTC
More interesting info from the test above:

# Breakdown of not grouped lflows
sh-4.4#  ovsdb-tool query  ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | grep -oE "ovn-northd.c:[0-9]*"  | sort | uniq -c | sort| tail 
  24302 ovn-northd.c:11574
  24302 ovn-northd.c:4570
  24302 ovn-northd.c:4587
  24302 ovn-northd.c:4641
  24551 ovn-northd.c:10294
  25049 ovn-northd.c:7615
  25547 ovn-northd.c:5023
  25547 ovn-northd.c:5118
  48604 ovn-northd.c:4682
  48604 ovn-northd.c:4719

sh-4.4#  ovsdb-tool query  ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | grep -oE '"table_id":[0-9]*' | sort | uniq -c | sort| tail 
   2330 "table_id":6
  10821 "table_id":3
  25431 "table_id":14
  26003 "table_id":17
  28262 "table_id":0
  28777 "table_id":10
  29290 "table_id":23
  48806 "table_id":9
  48807 "table_id":2
  99929 "table_id":1


# Breakdown of lflows
sh-4.4# ovn-sbctl --no-leader-only lflow-list  | grep -oE '  table=[0-9]*' | sort | uniq -c  | sort  | tail
  26987   table=14
  30075   table=10
  30330   table=23
  31620   table=0
  51661   table=2
  72988   table=9
 103573   table=1
2090955   table=4
2096872   table=13
4190905   table=6

Comment 15 Anton Ivanov 2021-04-07 13:15:42 UTC
One option may be to try to limit the time ovsdb processes actual database requests to be a fraction of the election timer which is comparable to the ping timeout (election_timer /3).

It is difficult to enforce a hard limit, so in first instance I have tried to make this a "best effort".

The first version of patch can be fetched from here: https://github.com/kot-begemot-uk/ovs/commit/056417310763e6a55468bd954ac5c47e45205390

If we see some improvement I can try adding some bells and whistles (ensure that all sessions get the same priority, etc).

Comment 16 Raul Sevilla 2021-04-07 15:10:07 UTC
Thanks Anton, may I have a build including this patch using as code base openvswitch2.15-2.15.0-2?

Comment 17 Raul Sevilla 2021-04-09 08:06:51 UTC
Hi, after restarting the components with the patched ovs, sbdb does not come up. I see traces like the below

2021-04-08T23:53:03Z|00013|reconnect|INFO|ssl:10.0.145.213:9644: connecting...
2021-04-08T23:53:03Z|00014|reconnect|INFO|ssl:10.0.145.213:9644: connected
2021-04-08T23:53:11Z|00015|ovsdb_server|INFO|memory trimming after compaction enabled.
2021-04-08T23:53:15Z|00016|raft|INFO|term 89: 16016 ms timeout expired, starting election
2021-04-08T23:53:15Z|00017|raft|INFO|server a2cc is leader for term 90
2021-04-08T23:53:15Z|00018|raft|INFO|rejecting append_request because previous entry 89,211629 not in local log (mismatch past end of log)
2021-04-08T23:53:16Z|00019|jsonrpc|WARN|unix#2: send error: Broken pipe
2021-04-08T23:53:16Z|00020|reconnect|WARN|unix#2: connection dropped (Broken pipe)
2021-04-08T23:53:16Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2021-04-08T23:53:16Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
Waiting for OVN_Southbound to come up.

Comment 18 Anton Ivanov 2021-04-09 08:47:06 UTC
A few questions. 

1. How many clients do you have at that point?
2. What is the size of the database?

If the database is very big, I would have expected the clients to drop a few times until they settle due to not having enough CPU time. Raft not getting enough time is weird though. It should be getting extra opportunities to run with the patch.

Once I have some data on the actual scale you are trying to run it at, I will try to reproduce this on my rig.

Comment 19 Raul Sevilla 2021-04-09 08:50:50 UTC
(In reply to Anton Ivanov from comment #18)
> A few questions. 
> 
> 1. How many clients do you have at that point?
> 2. What is the size of the database?
> 
> If the database is very big, I would have expected the clients to drop a few
> times until they settle due to not having enough CPU time. Raft not getting
> enough time is weird though. It should be getting extra opportunities to run
> with the patch.
> 
> Once I have some data on the actual scale you are trying to run it at, I
> will try to reproduce this on my rig.

This OVN deployment is working on top of a small OpenShift cluster, with only 35 nodes at the moment. The databases are also quite small (the cluster is almost empty)

sh-4.4# ls -lh /etc/openvswitch/
total 17M
-rw-r-----. 1 root root 5.6M Apr  9 08:49 ovnnb_db.db
-rw-r-----. 1 root root 8.3M Apr  9 08:49 ovnsb_db.db

Comment 20 Anton Ivanov 2021-04-09 09:19:04 UTC
If you can observe it at that scale, then I should be able to do that too - I can spin up something on that scale on my rig.

I will update the bug once I am clear what's going on. We should not be seeing this after these changes.

Comment 21 Raul Sevilla 2021-04-09 12:09:01 UTC
Quick update. After restarting OVN control plane pods things have started working properly

sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt --ovsdb-disable-file-column-diff=yes '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb
2021-04-09T12:05:56Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log
2021-04-09T12:05:57Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.1
2021-04-09T12:05:57Z|00003|reconnect|INFO|ssl:10.0.161.133:9644: connecting...
2021-04-09T12:05:57Z|00004|reconnect|INFO|ssl:10.0.198.11:9644: connecting...
2021-04-09T12:05:57Z|00005|reconnect|INFO|ssl:10.0.161.133:9644: connected
2021-04-09T12:05:57Z|00006|reconnect|INFO|ssl:10.0.198.11:9644: connected
2021-04-09T12:05:57Z|00007|raft|INFO|ssl:10.0.161.133:60200: learned server ID 758b
2021-04-09T12:05:57Z|00008|raft|INFO|ssl:10.0.161.133:60200: learned remote address ssl:10.0.161.133:9644
2021-04-09T12:05:58Z|00009|raft|INFO|ssl:10.0.198.11:56254: learned server ID a2cc
2021-04-09T12:05:58Z|00010|raft|INFO|ssl:10.0.198.11:56254: learned remote address ssl:10.0.198.11:9644
2021-04-09T12:06:07Z|00011|memory|INFO|241176 kB peak resident set size after 10.8 seconds
2021-04-09T12:06:07Z|00012|memory|INFO|cells:402248 monitors:1 raft-connections:4 raft-log:2352 sessions:1
2021-04-09T12:06:12Z|00013|raft|WARN|ignoring vote request received after only 15069 ms (minimum election time is 16000 ms)
2021-04-09T12:06:13Z|00014|raft|INFO|term 92: 16700 ms timeout expired, starting election
2021-04-09T12:06:13Z|00015|raft|INFO|term 92: elected leader by 2+ of 3 servers
2021-04-09T12:06:13Z|00016|raft|INFO|rejected append_reply (not leader)
2021-04-09T12:06:13Z|00017|raft|INFO|rejecting term 92 < current term 93 received in append_reply message from server a2cc
2021-04-09T12:06:19Z|00018|jsonrpc|WARN|unix#13: send error: Broken pipe
2021-04-09T12:06:19Z|00019|reconnect|WARN|unix#13: connection dropped (Broken pipe)
2021-04-09T12:06:19Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2021-04-09T12:06:19Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
Waiting for OVN_Southbound to come up.
2021-04-09T12:06:28Z|00020|raft|WARN|ignoring vote request received after only 14707 ms (minimum election time is 16000 ms)

Will scale-up now and trigger some performance tests.

Comment 23 Tim Rozet 2021-04-14 19:59:03 UTC
Created attachment 1771975 [details]
sbdbs with lflow and election priority fixes

Comment 24 Tim Rozet 2021-04-14 20:05:05 UTC
Hey Anton, Ilya,
From my understanding initially the new patch you provided helped, but we still see leader elections. I'm not sure how to parse all of the logs. I've attached them to the bz. Do you mind taking a look? There are 27 terms and some stuff I see that looks suspicious:

A lot of these replies when sending commands:
2021-04-14T19:24:17Z|5438301|raft|DBG|<--3a75 execute_command_reply: result=143af1ba-7dcb-41c9-8f3a-22afe48a0368 status="prerequisite check failed"

SSL 100% then protocol error (one instance directly precedes election happening for term 26):
2021-04-14T19:25:23Z|5446594|raft|INFO|term 26: 16649 ms timeout expired, starting election
2021-04-14T19:25:23Z|5446595|stream_ssl|WARN|SSL_accept: system error (Success)
2021-04-14T19:25:23Z|5446596|timeval|WARN|Unreasonably long 5391ms poll interval (4844ms user, 464ms system)
2021-04-14T19:25:23Z|5446597|timeval|WARN|faults: 8902 minor, 0 major
2021-04-14T19:25:23Z|5446598|timeval|WARN|disk: 0 reads, 8 writes
2021-04-14T19:25:23Z|5446599|timeval|WARN|context switches: 0 voluntary, 29 involuntary
2021-04-14T19:25:23Z|5446600|poll_loop|INFO|Dropped 409 log messages in last 11 seconds (most recently, 5 seconds ago) due to excessive rate
2021-04-14T19:25:23Z|5446601|poll_loop|INFO|wakeup due to [POLLOUT] on fd 101 (10.0.168.65:9642<->10.0.155.96:57786) at ../lib/stream-ssl.c:798 (100% CPU usage)
2021-04-14T19:25:26Z|5446602|jsonrpc|WARN|ssl:10.0.168.65:52744: receive error: Protocol error
2021-04-14T19:25:26Z|5446603|jsonrpc|WARN|ssl:10.0.168.65:52970: receive error: Protocol error
2021-04-14T19:25:26Z|5446604|jsonrpc|WARN|ssl:10.0.168.65:53044: receive error: Protocol error
2021-04-14T19:25:26Z|5446605|jsonrpc|WARN|ssl:10.0.168.65:53274: receive error: Protocol error
2021-04-14T19:25:26Z|5446606|jsonrpc|WARN|ssl:10.0.168.65:53484: receive error: Protocol error
2021-04-14T19:25:26Z|5446607|raft|DBG|raft.c:1601 -->3a75 vote_request: term=26 last_log_index=174963 last_log_term=25
2021-04-14T19:25:26Z|5446608|raft|DBG|raft.c:1601 -->c77c vote_request: term=26 last_log_index=174963 last_log_term=25
2021-04-14T19:25:26Z|5446609|raft|DBG|<--c77c vote_reply: term=25 vote=c77c
2021-04-14T19:25:26Z|5446610|raft|INFO|rejecting term 25 < current term 26 received in vote_reply message from server c77c


Looks like one node is still thinking its on term 25 and another is on 27:
2021-04-14T19:25:40Z|5446662|raft|INFO|term 27: 16757 ms timeout expired, starting election
2021-04-14T19:25:40Z|5446663|raft|DBG|Dropped 16 log messages in last 15121 seconds (most recently, 15120 seconds ago) due to excessive rate
2021-04-14T19:25:40Z|5446664|raft|DBG|raft_is_connected: false
2021-04-14T19:25:40Z|5446665|raft|DBG|raft.c:1601 -->3a75 vote_request: term=27 last_log_index=174963 last_log_term=25
2021-04-14T19:25:40Z|5446666|raft|DBG|raft.c:1601 -->c77c vote_request: term=27 last_log_index=174963 last_log_term=25
2021-04-14T19:25:40Z|5446667|raft|DBG|raft_is_connected: false
2021-04-14T19:25:40Z|5446668|raft|DBG|<--c77c vote_reply: term=25 vote=c77c
2021-04-14T19:25:40Z|5446669|raft|INFO|rejecting term 25 < current term 27 received in vote_reply message from server c77c
2021-04-14T19:25:40Z|5446670|raft|DBG|raft_is_connected: false

Comment 25 Ilya Maximets 2021-04-15 23:25:01 UTC
(In reply to Tim Rozet from comment #24)
> Hey Anton, Ilya,
> From my understanding initially the new patch you provided helped, but we
> still see leader elections. I'm not sure how to parse all of the logs. I've
> attached them to the bz. Do you mind taking a look? There are 27 terms and
> some stuff I see that looks suspicious:
> 
> A lot of these replies when sending commands:
> 2021-04-14T19:24:17Z|5438301|raft|DBG|<--3a75 execute_command_reply:
> result=143af1ba-7dcb-41c9-8f3a-22afe48a0368 status="prerequisite check
> failed"

These are completely normal.  This just means that there was a race between
2 transactions that came from, presumably, different servers.  Client
will retry.

> 
> SSL 100% then protocol error (one instance directly precedes election
> happening for term 26):
> 2021-04-14T19:25:23Z|5446594|raft|INFO|term 26: 16649 ms timeout expired,
> starting election
> 2021-04-14T19:25:23Z|5446595|stream_ssl|WARN|SSL_accept: system error
> (Success)
> 2021-04-14T19:25:23Z|5446596|timeval|WARN|Unreasonably long 5391ms poll
> interval (4844ms user, 464ms system)
> 2021-04-14T19:25:23Z|5446597|timeval|WARN|faults: 8902 minor, 0 major
> 2021-04-14T19:25:23Z|5446598|timeval|WARN|disk: 0 reads, 8 writes
> 2021-04-14T19:25:23Z|5446599|timeval|WARN|context switches: 0 voluntary, 29
> involuntary
> 2021-04-14T19:25:23Z|5446600|poll_loop|INFO|Dropped 409 log messages in last
> 11 seconds (most recently, 5 seconds ago) due to excessive rate
> 2021-04-14T19:25:23Z|5446601|poll_loop|INFO|wakeup due to [POLLOUT] on fd
> 101 (10.0.168.65:9642<->10.0.155.96:57786) at ../lib/stream-ssl.c:798 (100%
> CPU usage)
> 2021-04-14T19:25:26Z|5446602|jsonrpc|WARN|ssl:10.0.168.65:52744: receive
> error: Protocol error
> 2021-04-14T19:25:26Z|5446603|jsonrpc|WARN|ssl:10.0.168.65:52970: receive
> error: Protocol error
> 2021-04-14T19:25:26Z|5446604|jsonrpc|WARN|ssl:10.0.168.65:53044: receive
> error: Protocol error
> 2021-04-14T19:25:26Z|5446605|jsonrpc|WARN|ssl:10.0.168.65:53274: receive
> error: Protocol error
> 2021-04-14T19:25:26Z|5446606|jsonrpc|WARN|ssl:10.0.168.65:53484: receive
> error: Protocol error
> 2021-04-14T19:25:26Z|5446607|raft|DBG|raft.c:1601 -->3a75 vote_request:
> term=26 last_log_index=174963 last_log_term=25
> 2021-04-14T19:25:26Z|5446608|raft|DBG|raft.c:1601 -->c77c vote_request:
> term=26 last_log_index=174963 last_log_term=25
> 2021-04-14T19:25:26Z|5446609|raft|DBG|<--c77c vote_reply: term=25 vote=c77c
> 2021-04-14T19:25:26Z|5446610|raft|INFO|rejecting term 25 < current term 26
> received in vote_reply message from server c77c

Protocol errors for SSL are normal too.  Just means that client disconnected
not waiting for graceful termination of SSL handshakes.  That is how stream-ssl
implemented.  It doesn't wait.

> 
> 
> Looks like one node is still thinking its on term 25 and another is on 27:
> 2021-04-14T19:25:40Z|5446662|raft|INFO|term 27: 16757 ms timeout expired,
> starting election
> 2021-04-14T19:25:40Z|5446663|raft|DBG|Dropped 16 log messages in last 15121
> seconds (most recently, 15120 seconds ago) due to excessive rate
> 2021-04-14T19:25:40Z|5446664|raft|DBG|raft_is_connected: false
> 2021-04-14T19:25:40Z|5446665|raft|DBG|raft.c:1601 -->3a75 vote_request:
> term=27 last_log_index=174963 last_log_term=25
> 2021-04-14T19:25:40Z|5446666|raft|DBG|raft.c:1601 -->c77c vote_request:
> term=27 last_log_index=174963 last_log_term=25
> 2021-04-14T19:25:40Z|5446667|raft|DBG|raft_is_connected: false
> 2021-04-14T19:25:40Z|5446668|raft|DBG|<--c77c vote_reply: term=25 vote=c77c
> 2021-04-14T19:25:40Z|5446669|raft|INFO|rejecting term 25 < current term 27
> received in vote_reply message from server c77c
> 2021-04-14T19:25:40Z|5446670|raft|DBG|raft_is_connected: false

There was a lot of elections, and term changes could happen several times
within a short period of time if servers are fighting for leadership.
So, it's fine too.

What is not fine is that we still have poll intervals up to 16 seconds.

Looking at all the 10+ second poll intervals, and all of them has lots
of disk writes like this:

|timeval|WARN|Unreasonably long 16148ms poll interval (14428ms user, 1223ms system)
|09384|timeval|WARN|disk: 0 reads, 672264 writes

Also looking at times when this happens, I noticed that such big intervals
happens every 10-20 minutes and this suggests that it's a database
snapshotting during automatic compaction.  And I understand that.
If we're writing 300-500MB of data to the disk on compaction, this might
take a lot of time.

I added sending of heartbeats to the snapshot writing functions.  I also
added some extra logging to see how much time each part of snapshotting
process actually takes.  Plus, I fixed some math in the Anton's patch.
Here is the resulted build (it includes fix for ARP responder flows):

  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36181450

Could you, please, try it out and provide logs? So we will see how much time
compacting actually takes and if sending of heartbeats in these places is
enough.

Diff of the changes:

---
diff --git a/ovsdb/jsonrpc-server.c b/ovsdb/jsonrpc-server.c
index 22bc6f38b3..87be21e97c 100644
--- a/ovsdb/jsonrpc-server.c
+++ b/ovsdb/jsonrpc-server.c
@@ -384,7 +384,7 @@ ovsdb_jsonrpc_server_run(struct ovsdb_jsonrpc_server *svr, uint64_t limit)
     uint64_t elapsed = 0, start_time = 0;
 
     if (limit) {
-        start_time = time_now();
+        start_time = time_msec();
     }
 
     SHASH_FOR_EACH (node, &svr->remotes) {
@@ -410,7 +410,7 @@ ovsdb_jsonrpc_server_run(struct ovsdb_jsonrpc_server *svr, uint64_t limit)
 
 
         if (limit) {
-            elapsed = start_time - time_now();
+            elapsed = time_msec() - start_time;
             if (elapsed >= limit) {
                 poll_immediate_wake();
                 break;
diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index b6cc6b6d55..7f367b4185 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -3954,18 +3954,45 @@ raft_save_snapshot(struct raft *raft,
 {
     struct ovsdb_log *new_log;
     struct ovsdb_error *error;
+    uint64_t start_time, elapsed;
+
     error = ovsdb_log_replace_start(raft->log, &new_log);
     if (error) {
         return error;
     }
 
+    if (raft->role == RAFT_LEADER) {
+        /* Leader is about to write database snapshot to the disk and this
+         * might take significant amount of time.  Sending heartbeats now
+         * to avoid falling out of the cluster.  */
+        raft_send_heartbeats(raft);
+    }
+
+    start_time = time_msec();
     error = raft_write_snapshot(raft, new_log, new_start, new_snapshot);
     if (error) {
         ovsdb_log_replace_abort(new_log);
         return error;
     }
+    elapsed = time_msec() - start_time;
+    if (elapsed > 1000) {
+        VLOG_INFO("raft_write_snapshot() took %"PRIu64"ms", elapsed);
+    }
 
-    return ovsdb_log_replace_commit(raft->log, new_log);
+    if (raft->role == RAFT_LEADER) {
+        /* Leader is about to fsync database snapshot to ensure that it is
+         * fully written to the disk and this might take even more time.
+         * Sending heartbeats now to avoid falling out of the cluster.  */
+        raft_send_heartbeats(raft);
+    }
+
+    start_time = time_msec();
+    error = ovsdb_log_replace_commit(raft->log, new_log);
+    elapsed = time_msec() - start_time;
+    if (elapsed > 1000) {
+        VLOG_INFO("ovsdb_log_replace_commit() took %"PRIu64"ms", elapsed);
+    }
+    return error;
 }
 
 static bool
@@ -4158,6 +4200,7 @@ raft_store_snapshot(struct raft *raft, const struct json *new_snapshot_data)
         return ovsdb_error(NULL, "not storing a duplicate snapshot");
     }
 
+    uint64_t elapsed, start_time = time_msec();
     uint64_t new_log_start = raft->last_applied + 1;
     struct raft_entry new_snapshot = {
         .term = raft_get_term(raft, new_log_start - 1),
@@ -4182,6 +4225,12 @@ raft_store_snapshot(struct raft *raft, const struct json *new_snapshot_data)
     memmove(&raft->entries[0], &raft->entries[new_log_start - raft->log_start],
             (raft->log_end - new_log_start) * sizeof *raft->entries);
     raft->log_start = new_log_start;
+
+    elapsed = time_msec() - start_time;
+    if (elapsed > 1000) {
+        VLOG_INFO("snapshot storing took %"PRIu64"ms", elapsed);
+    }
+
     return NULL;
 }
 
---

Comment 26 Ilya Maximets 2021-04-15 23:30:03 UTC
(In reply to Ilya Maximets from comment #25)
> Here is the resulted build (it includes fix for ARP responder flows):

Sorry, fix for ARP responder flows is in OVN package, not in OVS one, so it's,
obviously, not included in this build.

OVN build with fixes for ARP responder flows available in BZ 1945415.

Comment 27 Tim Rozet 2021-04-16 16:19:44 UTC
Thanks Ilya! Joe, can you please try with the new OVS rpm and the fixed OVN and see how scale goes then?

Ilya, is there anyway we can make saving the snapshot as background operation? Or some way we can ensure that we still send raft messages while we are snapshotting in a another thread? Ideally we want to lower the sbdb timer to something much less than 16 seconds. So if disk write was to take over 1 or 2 seconds we would still be in trouble.

Comment 28 Ilya Maximets 2021-04-16 16:45:33 UTC
(In reply to Tim Rozet from comment #27)
> Ilya, is there anyway we can make saving the snapshot as background
> operation? Or some way we can ensure that we still send raft messages while
> we are snapshotting in a another thread? Ideally we want to lower the sbdb
> timer to something much less than 16 seconds. So if disk write was to take
> over 1 or 2 seconds we would still be in trouble.

Yes, I thought about this and it looks challenging.  The problem is that we
really can't process or compose any real raft messages while snapshotting,
just because raft log and file transactions will be wiped out during this
process and will lose all the updates not saying that we will crash trying
to access cuncurrently changing data  structures.  I guess, we can keep
sending dummy heartbeats somewhere in a separate process.  However, maybe
we can take another approach and just transfer the leadership before starting
creation of a snapshot?  It looks better from the point that cluster will
still be responsive while we're creating a snapshot because there will be
functioning leader.

Of course, this approach will mean more frequent leader changes, but that
should not be a big problem.  Clients will not re-connect, except for northd
that needs to be connected the the leader. 'term' will grow and some number
of  transactions might need to be re-sent, but that's a normal thing.

What do you think?

I can craft an RPM for PoC purposes.

Comment 29 Tim Rozet 2021-04-16 21:29:19 UTC
I think that is a good idea to transfer the leadership while trying to do a big disk write. We just need some log message to indicate that is why leadership is changing due to compaction or something like that. Are there any other potential places where you think we can get stalled and miss sending heartbeats?

Maybe Joe we should wait to try out Ilya's OVSDB patch until we get his next rev with the leadership transfer.

Comment 31 Ilya Maximets 2021-04-19 20:24:53 UTC
Here is a build with a leadership transfer before snapshotting:

   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36278228

It includes Anton's patch with adjusted math.  I also kept some debug
prints for time measurements around actual snapshot creation functions
to have an idea how much time it actually takes, so would like to take
a look at logs from the tests.

Patch looks like this:

   https://github.com/igsilya/ovs/commit/c04244359151b4ae79b928258e320126ee3e4c38

Comment 32 Victor Pickard 2021-04-22 21:27:59 UTC
I built an image that has the above fix, as well as the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1945415 in this PR:
https://github.com/openshift/ovn-kubernetes/pull/508 

I then used cluster-bot to build an install image, which can be found here:
quay.io/vpickard/ovnk-images:ci-ln-vmhn66k

The scale team used this image and ran the 100 node test, and see inactivity probe timeouts:

2021-04-22T03:24:42Z|14968|reconnect|ERR|ssl:10.0.180.149:9644: no response to inactivity probe after 17.1 seconds, disconnecting
2021-04-22T03:24:42Z|14969|reconnect|INFO|ssl:10.0.180.149:9644: connection dropped
2021-04-22T03:24:43Z|14970|reconnect|INFO|ssl:10.0.180.149:9644: connecting...
2021-04-22T03:24:44Z|14971|reconnect|INFO|ssl:10.0.180.149:9644: connection attempt timed out
2021-04-22T03:24:44Z|14972|reconnect|INFO|ssl:10.0.180.149:9644: waiting 2 seconds before reconnect
2021-04-22T03:24:46Z|14973|reconnect|INFO|ssl:10.0.180.149:9644: connecting...

@msheth can you please attach the logs from this run?

Comment 33 Mohit Sheth 2021-04-23 00:45:00 UTC
100 node cluster-density with 1000 projects
start time: 2021-04-21T19:37
end time: 2021-04-21T20:25

250 node cluster-density with 2000 projects (failed due to image expiry)
start time: 2021-04-22T00:22
end time: failed
The ovn sbdb/nbdb logs for the above tests http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn-logs/100and250nodes/


250 node cluster-density with 2000 projects 
   some pod scheduling error encountered (https://bugzilla.redhat.com/show_bug.cgi?id=1946420) with 97 pods that was eventually resolved on its own
```
error adding container to network "ovn-kubernetes": CNI request failed with status 400: <snip> failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding
```
start time: 2021-04-22T17:03
end time: 2021-04-22T19:41

250 node cluster-density with 3000 projects (same port binding error as above; this time with 1105 pods; not resolved)
start time: 22-04-2021T20:44
end time: failed

The ovn sbdb/nbdb logs for the above tests http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn-logs/250nodes/

Comment 34 Ilya Maximets 2021-04-23 12:29:37 UTC
(In reply to Mohit Sheth from comment #33)
> 100 node cluster-density with 1000 projects
> start time: 2021-04-21T19:37
> end time: 2021-04-21T20:25
> 
> 250 node cluster-density with 2000 projects (failed due to image expiry)
> start time: 2021-04-22T00:22
> end time: failed
> The ovn sbdb/nbdb logs for the above tests
> http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn-
> logs/100and250nodes/
> 
> 
> 250 node cluster-density with 2000 projects 
>    some pod scheduling error encountered
> (https://bugzilla.redhat.com/show_bug.cgi?id=1946420) with 97 pods that was
> eventually resolved on its own
> ```
> error adding container to network "ovn-kubernetes": CNI request failed with
> status 400: <snip> failed to configure pod interface: error while waiting on
> OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting
> for OVS port binding
> ```
> start time: 2021-04-22T17:03
> end time: 2021-04-22T19:41
> 
> 250 node cluster-density with 3000 projects (same port binding error as
> above; this time with 1105 pods; not resolved)
> start time: 22-04-2021T20:44
> end time: failed
> 
> The ovn sbdb/nbdb logs for the above tests
> http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/ovn-scale/bz1943631/ovn-
> logs/250nodes/

I'm pretty sure that above tests was performed not with openvswitch packages
I built.  2 reasons:
1. There are no logs that I added.
2. In previous tests with Anton's patch most of poll intervals was < 6 seconds
   (except for ones with big number of disk writes), but it these tests they
   are 10-16 seconds all at the end of the test.

Could you, please, re-check?

Comment 35 Ilya Maximets 2021-04-23 13:04:10 UTC
(In reply to Ilya Maximets from comment #34)
> I'm pretty sure that above tests was performed not with openvswitch packages
> I built.  2 reasons:
> 1. There are no logs that I added.
> 2. In previous tests with Anton's patch most of poll intervals was < 6
> seconds
>    (except for ones with big number of disk writes), but it these tests they
>    are 10-16 seconds all at the end of the test.
> 
> Could you, please, re-check?

To avoid this kind of issues for the later tests I re-built the package:

  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/2.bz1943631.2.3.el8fdp/

Now it has descriptive name and ovsdb-server will print extra version information
on startup.  Following line will be present in ovsdb log with above package:

  ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.90-imaximets-bz1943631-start-3-gc0bb30bc9c02

Please, use this package for future tests.

Comment 36 Ilya Maximets 2021-04-23 13:06:31 UTC
(In reply to Ilya Maximets from comment #35)
> 
>   ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.15.90-imaximets-bz1943631-start-3-gc0bb30bc9c02

s/2.15.90/2.15.1/

Comment 37 Victor Pickard 2021-04-26 18:45:47 UTC
I rebuilt the OCP image with the above ovs rpms, and also, ovn rpms from https://bugzilla.redhat.com/show_bug.cgi?id=1945415 in this PR:
https://github.com/openshift/ovn-kubernetes/pull/508 

With this image, I see the following OVS and OVN versions:


[vpickard@rippleRider$][~/go/src/github.com/openshift/ovn-kubernetes-1] (scale-raftLeaderChange-lflow=)$ oc exec -it -n openshift-ovn-kubernetes ovnkube-master-fjvkj sh 
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. 
Defaulting container name to northd. 
Use 'oc describe pod/ovnkube-master-fjvkj -n openshift-ovn-kubernetes' to see all of the containers in this pod. 
sh-4.4# rpm -aq |grep ovn 
ovn2.13-20.12.0-99.el8fdp.x86_64 
ovn2.13-host-20.12.0-99.el8fdp.x86_64 
ovn2.13-central-20.12.0-99.el8fdp.x86_64 
ovn2.13-vtep-20.12.0-99.el8fdp.x86_64 
sh-4.4# rpm -qa |grep openvswitch 
openvswitch2.15-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 
openvswitch2.15-ipsec-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 
openvswitch2.15-devel-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 
python3-openvswitch2.15-2.15.0-2.bz1943631.2.3.el8fdp.x86_64 
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch 
sh-4.4#

I copied the image to here:
http://quay.io/vpickard/ovnk-images:ci-ln-vkrtd0k 


@msheth Please use the image above for the next scale test iteration. Thanks

Comment 40 Tim Rozet 2021-05-03 21:56:26 UTC
Ilya, it the results from your patch look promising. All of the leader changes look to be triggered by snapshot:
https://gist.github.com/jtaleric/c03c65ad09d1cd74c560abc290937131#gistcomment-3730250

I would like to see if we can now lower the election timer from like 16 seconds down to 2 and see what happens.

Also on the latest scale up run, we can see the SBDB has 1.5 million lflows:
[root@ip-10-0-214-218 ~]# ovn-sbctl --no-leader-only lflow-list  | grep -oE '  table=[^,]+' | sort | uniq -c  | sort  | tail
   8517   table=10(ls_out_port_sec_l2 )
   9032   table=0 (ls_in_port_sec_l2  )
  10819   table=3 (lr_in_ip_input     )
  12633   table=23(ls_in_l2_lkup      )
  13937   table=2 (ls_in_port_sec_nd  )
  13937   table=9 (ls_out_port_sec_ip )
  20648   table=1 (ls_in_port_sec_ip  )
 343096   table=4 (lr_in_defrag       )
 352098   table=13(ls_in_stateful     )
 696471   table=6 (lr_in_dnat         )

[root@ip-172-31-20-81 trozet]# kubectl get svc -A | wc -l
2116

[root@ip-172-31-20-81 trozet]# kubectl get pod -A | wc -l
11989

[root@ip-172-31-20-81 trozet]# kubectl get namespace -A | wc -l
747

[root@ip-172-31-20-81 trozet]# kubectl get node -A | wc -l
258

Will attach dbs.

Comment 41 Tim Rozet 2021-05-03 21:58:20 UTC
Created attachment 1779139 [details]
dbs with VIP fix and ovsdb raft fix, 1.5 million lflows

Comment 42 Ilya Maximets 2021-05-04 09:46:07 UTC
(In reply to Tim Rozet from comment #40)
> Ilya, it the results from your patch look promising. All of the leader
> changes look to be triggered by snapshot:
> https://gist.github.com/jtaleric/
> c03c65ad09d1cd74c560abc290937131#gistcomment-3730250

Sounds good.  Do you have ovsdb logs?

> 
> I would like to see if we can now lower the election timer from like 16
> seconds down to 2 and see what happens.
> 
> Also on the latest scale up run, we can see the SBDB has 1.5 million lflows:
> [root@ip-10-0-214-218 ~]# ovn-sbctl --no-leader-only lflow-list  | grep -oE
> '  table=[^,]+' | sort | uniq -c  | sort  | tail

Yes, the number of lflows "logically" is around 1.5M, but it only because
'ovn-sbctl lflow-list' unwinds datapath groups before printing.  Actual number
of lflows is:

  $ ovsdb-tool query ovnsdb_db.db.standalone \
      '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[]}]' | \
      sed 's/{"_uuid"/\n{"_uuid"/g' | grep -c '_uuid'
  124684

So, it's only 124K lflows in the database.  Out of them, flows that are not
grouped:

  $ ovsdb-tool query ovnsdb_db.db.standalone \
      '["OVN_Southbound",{"op":"select","table":"Logical_Flow",
                          "where":[["logical_dp_group", "==", ["set", []]]]}]' | \
      sed 's/{"_uuid"/\n{"_uuid"/g' | grep -c '_uuid'
  118895

The rest 5789 lflows has datapath groups with ~250-260 datapaths.
5789 * ~250 + 118895 = ~1.5M

This database is actually much smaller than we had in previous tests.  We had 250MB
database in the test here: https://bugzilla.redhat.com/show_bug.cgi?id=1943631#c13 ,
and now it's only 82MB.  So, it's almost 3 times smaller.  Was it a smaller test?
Can we scale higher?

Comment 43 Tim Rozet 2021-05-04 14:12:08 UTC
Created attachment 1779399 [details]
sbdb logs ilyas ovsdb patch with VIP fix

Comment 44 Tim Rozet 2021-05-04 14:20:28 UTC
Hey Ilya, this is the same test according to Mohit: 2k cluster density on 250 nodes. The cluster was unable to reach the max number for this test due to some other reason which I'm trying to figure out now.

Comment 45 Ilya Maximets 2021-05-04 14:36:52 UTC
(In reply to Tim Rozet from comment #44)
> Hey Ilya, this is the same test according to Mohit: 2k cluster density on
> 250 nodes. The cluster was unable to reach the max number for this test due
> to some other reason which I'm trying to figure out now.

Raul said at https://bugzilla.redhat.com/show_bug.cgi?id=1943631#c13 that he
was able to scale up to 250 + 2K just with the fix for ARP responder flows,
But cluster fell apart at 250 nodes + 3K cluster-density.  That's why we
started to work on additional fixes (Anton's patch and election transfer), IIUC.

Can we test with 250 + 3K too?

Comment 47 Tim Rozet 2021-05-13 18:53:34 UTC
Ilya, can you please take a look at must gather provided by Joe. There are some election timers going off there, however the cluster still seemed operational. Just wondering if anything more can be done to ensure we send the heartbeat.

Comment 48 Anton Ivanov 2021-06-09 07:26:36 UTC
I have sent a rebased version of the "give raft some priority" patch upstream.

I have my scale-up rig working again, so I can test it at scale. It tests out for a 450 ports scenario.

Comment 54 Tim Rozet 2022-02-22 17:02:08 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1960393 looks like it is never going to make any progress. There have been significant fixes in ovsdb-server to help make it less likely to hit the root cause of missing raft messages. However, since the root cause is not going to be addressed going to close this as deferred.


Note You need to log in before you can comment on or make changes to this bug.