1808580 – [OVN SCALE] many flows missing from new node after scaling up 179 nodes

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1808580 - [OVN SCALE] many flows missing from new node after scaling up 179 nodes

Summary: [OVN SCALE] many flows missing from new node after scaling up 179 nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn2.12
Sub Component:
Version:	RHEL 8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dumitru Ceara
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-28 20:43 UTC by Tim Rozet
Modified:	2020-07-15 13:01 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-15 13:00:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs, dbs (6.54 MB, application/gzip) 2020-02-28 20:47 UTC, Tim Rozet	no flags	Details
Full logs and OVN DBs test 2 (13.29 MB, application/gzip) 2020-03-17 22:03 UTC, Tim Rozet	no flags	Details
sb dbs from all masters (16.65 MB, application/gzip) 2020-03-18 18:39 UTC, Tim Rozet	no flags	Details
core of the screwed up ovn-controller (9.51 MB, application/x-lz4) 2020-03-18 18:40 UTC, Tim Rozet	no flags	Details
ovn controller log from idl reconnect patch (42.98 KB, text/plain) 2020-03-19 15:48 UTC, Tim Rozet	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2941	0	None	None	None	2020-07-15 13:01:30 UTC

Description Tim Rozet 2020-02-28 20:43:13 UTC

Description of problem:
After scaling an openshift deployment up to 179 nodes, one of the new nodes is missing tons of flows.

Working node:

[root@ip-10-0-169-105 ~]# ovs-ofctl -O openflow15 dump-flows br-int | wc -l
72247

broken node: 
[root@ip-10-0-136-224 ~]# ovs-ofctl -O openflow15 dump-flows br-int |wc -l
28150

ovnkube-node eventually dies on the node because his readiness probes check for flows which are missing:

time="2020-02-28T18:46:34Z" level=debug msg="exec(1202): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=65,out_port=114"
time="2020-02-28T18:46:34Z" level=debug msg="exec(1202): stdout: \"\""
time="2020-02-28T18:46:34Z" level=debug msg="exec(1202): stderr: \"\""
time="2020-02-28T18:46:35Z" level=debug msg="exec(1203): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=41,ip,nw_src=10.128.0.0/10"
time="2020-02-28T18:46:35Z" level=debug msg="exec(1203): stdout: \"\""
time="2020-02-28T18:46:35Z" level=debug msg="exec(1203): stderr: \"\""


There are a lot of errors in ovn-controller and high cpu usage in ssl handler. If I kill ovn-controller pod, when it restarts the missing flows are programmed:

[root@ip-10-0-136-224 ~]# ovs-ofctl -O openflow15 dump-flows br-int |grep table=41 |grep 10\.128
 cookie=0xcadf2977, duration=55.227s, table=41, n_packets=247, n_bytes=39439, idle_age=0, priority=11,ip,metadata=0x1b9,nw_src=10.128.0.0/10 actions=ct(commit,table=42,zone=NXM_NX_REG12[0..15],nat(src=169.254.33.2))

[root@ip-10-0-136-224 ~]# ovs-ofctl -O openflow15 dump-flows br-int | wc -l                     
72257

see attachment for full logs, dbs

Comment 1 Tim Rozet 2020-02-28 20:47:12 UTC

Created attachment 1666472 [details]
logs, dbs

Comment 2 Dumitru Ceara 2020-02-28 23:05:09 UTC

Hi Tim,

Could you give it a try with the scratch build below? It fixes the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1808125 and it would be nice to know if you're hitting the same problem.

Build:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=26874857

Upstream patch:
https://patchwork.ozlabs.org/patch/1246910/

Thanks,
Dumitru

Comment 3 Dumitru Ceara 2020-03-02 18:29:01 UTC

(In reply to Tim Rozet from comment #0)
> 
> see attachment for full logs, dbs

Could you also please provide the SB DB logs?

Thanks,
Dumitru

Comment 5 Dumitru Ceara 2020-03-13 16:48:05 UTC

After loading the attached SB DB, seems like the SNAT Logical_Flows for the GW router that fails readiness checks are properly created:

$ ovn-sbctl lflow-list GR_ip-10-0-136-224.us-west-2.compute.internal | grep "ip4.src == 10.128.0.0"
  table=1 (lr_out_snat        ), priority=11   , match=(ip && ip4.src == 10.128.0.0/10), action=(ct_snat(169.254.33.2);)

From the ovn-kube logs:
time="2020-02-28T18:43:19Z" level=info msg="Waiting for GatewayReady and ManagementPortReady on node ip-10-0-136-224.us-west-2.compute.internal"
time="2020-02-28T18:43:19Z" level=debug msg="exec(27): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=41,ip,nw_src=10.128.0.0/10"
time="2020-02-28T18:43:19Z" level=debug msg="exec(28): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface k8s-ip-10-0-136 ofport"
time="2020-02-28T18:43:19Z" level=debug msg="exec(27): stdout: \"\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(27): stderr: \"\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(28): stdout: \"114\\n\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(28): stderr: \"\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(29): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=65,out_port=114"
time="2020-02-28T18:43:19Z" level=debug msg="exec(29): stdout: \"\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(29): stderr: \"\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(30): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=41,ip,nw_src=10.128.0.0/10"
time="2020-02-28T18:43:19Z" level=debug msg="exec(30): stdout: \"\""
time="2020-02-28T18:43:19Z" level=debug msg="exec(30): stderr: \"\""

This indicates that most likely the problem is in ovn-controller.

Comment 6 Tim Rozet 2020-03-17 22:02:31 UTC

Saw this again today while scaling out the cluster. It looks like forcing recompute does not fix this particular issue, but interestingly restarting the ovn-controller pod does. Attaching all logs, dbs, output.

Comment 7 Tim Rozet 2020-03-17 22:03:31 UTC

Created attachment 1670963 [details]
Full logs and OVN DBs test 2

Comment 8 Tim Rozet 2020-03-18 14:34:00 UTC

A temporary fix for now could be when we see these healthchecks fail in ovn-k8s and the node is not ready, we kill the ovn-controller ourselves.

Comment 9 Tim Rozet 2020-03-18 18:39:58 UTC

Created attachment 1671193 [details]
sb dbs from all masters

Comment 10 Tim Rozet 2020-03-18 18:40:48 UTC

Created attachment 1671194 [details]
core of the screwed up ovn-controller

Comment 11 Dumitru Ceara 2020-03-19 13:38:18 UTC

Hi Tim,

Thanks for all the information. Looks like some Southbound DB updates didn't get sent properly (or got lost).

From the OVN Southbound databases (all 3 are in sync), we see that the logical flow ovnkube checks for exists:
ovn-sbctl --uuid lflow-list GR_ip-10-0-180-140.us-west-2.compute.internal | grep 4c2ef062
  uuid=0x4c2ef062, table=1 (lr_out_snat        ), priority=15   , match=(ip && ip4.src == 10.128.0.0/14), action=(ct_snat(169.254.33.2);)

From the ovn-controller core dump:
(gdb) bt
#0  0x00007ff352375c2d in parse_dollars () from /lib64/libc.so.6
#1  0x000055bd128b1170 in ?? ()
#2  0x000055bd128a3001 in ?? ()
#3  0x000055bd11f300cc in main (argc=12, argv=0x7ffcc5c6b088) at controller/ovn-controller.c:2197
(gdb) frame 3
#3  0x000055bd11f300cc in main (argc=12, argv=0x7ffcc5c6b088) at controller/ovn-controller.c:2197
2197            poll_block();
(gdb) p *ovnsb_idl_loop.idl->data.tables[14].class_
$51 = {
  name = 0x55bd120adec2 "Logical_Flow", 
  is_root = true, 
  is_singleton = false, 
  columns = 0x55bd12357280 <sbrec_logical_flow_columns>, 
  n_columns = 8, 
  allocation_size = 296, 
  row_init = 0x55bd11f722f0 <sbrec_logical_flow_init__>
}
(gdb) set $hmap=&ovnsb_idl_loop.idl->data.tables[14].rows                                                                                                                                                                                     
(gdb) set $hash=0x4c2ef062
(gdb) p $hmap->buckets[$hash & $hmap->mask]
$52 = (struct hmap_node *) 0x0   <<<< If the flow would be in IDL memory, it should be in this bucket..


Looking at the original logs attached to the BZ:
From the ovnkube logs, when the node gateway router is provisioned:
[...]
time="2020-02-28T18:43:19Z" level=info msg="Waiting for GatewayReady and ManagementPortReady on node ip-10-0-136-224.us-west-2.compute.internal"
[...]

From the ovn-controller logs (around the time when ovnkube created the node gateway router binding):
[...]
2020-02-28T15:39:41Z|00042|ovsdb_idl|WARN|Dropped 194 log messages in last 61 seconds (most recently, 13 seconds ago) due to excessive rate
2020-02-28T15:39:41Z|00043|ovsdb_idl|WARN|cannot modify missing row 58979090-c72a-41c3-9ea9-5853202ef037 in table Multicast_Group
2020-02-28T16:03:01Z|00044|ovsdb_idl|WARN|Dropped 410 log messages in last 1400 seconds (most recently, 1379 seconds ago) due to excessive rate
2020-02-28T16:03:01Z|00045|ovsdb_idl|WARN|cannot delete missing row 1e128d07-f57d-4ef1-8068-7d26423c00da from table Logical_Flow
[...]

These warnings indicate that ovn-controller received a "modify" update2 message from SB ovsdb-server for a row that doesn't exist in local IDL memory. The fact that some rows don't exist suggest that previous updates have been lost at some point. This might be a raft or ovsdb-server or ovsdb protocol implementation bug.

In order to confirm and work around the bug I prepared a private build that triggers an IDL reconnect (jsonrpc reconnect) in ovn-controller if updates are received for rows that don't exist in local memory. Could you please try it out and see if you still hit the issue of the missing openflow entries?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27360523

Thanks,
Dumitru

Comment 12 Tim Rozet 2020-03-19 15:45:10 UTC

I still see it after I scale up:
root@ip-172-31-73-134: ~/scale-ci-deploy # kubectl get node | grep -i notready
ip-10-0-173-79.us-west-2.compute.internal    NotReady   worker   20m   v1.17.1
ip-10-0-191-160.us-west-2.compute.internal   NotReady   worker   19m   v1.17.1
ip-10-0-191-229.us-west-2.compute.internal   NotReady   worker   20m   v1.17.1



2020-03-19T15:39:48Z|00369|main|INFO|OVNSB commit failed, force recompute next time.
2020-03-19T15:39:56Z|00370|reconnect|INFO|ssl:10.0.152.100:9642: connected
2020-03-19T15:39:56Z|00371|ovsdb_idl|INFO|ssl:10.0.152.100:9642: clustered database server is disconnected from cluster; trying another server
2020-03-19T15:40:04Z|00372|reconnect|INFO|ssl:10.0.143.119:9642: connected
2020-03-19T15:40:04Z|00373|main|INFO|OVNSB commit failed, force recompute next time.
2020-03-19T15:40:04Z|00374|main|INFO|User triggered force recompute.
2020-03-19T15:40:15Z|00375|reconnect|INFO|ssl:10.0.164.58:9642: connected
2020-03-19T15:40:16Z|00376|ovsdb_idl|WARN|Dropped 5 log messages in last 52 seconds (most recently, 12 seconds ago) due to excessive rate
2020-03-19T15:40:16Z|00377|ovsdb_idl|WARN|cannot modify missing row 90149052-5968-4e35-b5af-3b0e2df88487 in table Multicast_Group
2020-03-19T15:40:16Z|00378|ovsdb_idl|WARN|Dropped 6 log messages in last 60 seconds (most recently, 12 seconds ago) due to excessive rate
2020-03-19T15:40:16Z|00379|main|INFO|OVNSB commit failed, force recompute next time.
2020-03-19T15:40:24Z|00380|reconnect|INFO|ssl:10.0.152.100:9642: connected

Comment 13 Tim Rozet 2020-03-19 15:48:32 UTC

Created attachment 1671491 [details]
ovn controller log from idl reconnect patch

Comment 14 Dumitru Ceara 2020-03-19 16:23:06 UTC

Hi Tim,

Thanks for trying it out and for the logs.

It seems that with this change ovn-controller is trigering a reconnect to Southbound but once that happens it will request only the changes that happened since the last successful transaction performed by ovn-controller. If the missing updates happened to be in this interval (between last successful txn and now) then it will keep failing and reconnecting.

I built a new rpm where we now request the complete DB in such cases. It would be great if you could give it another try:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27364765

Thanks,
Dumitru

Comment 15 Tim Rozet 2020-03-19 22:17:30 UTC

Hi Dumitru,
With your latest fix + recompute patch. I did:
1. scale to 176 nodes - works
2. scale to 324 nodes - works
3. scale to 2k nodes - works up to 671 nodes, then will not proceed any further. Many pods crashing. Need to investigate further tmrw.

In every scenario above I see many ovn-controller logs the missing flows message and then IDL reconnect.

Also definition of "works" here is cluster operators all eventually go to ready state and not failed, all of the nodes report ready state. No other pods have been deployed.

Note when scaling up large numbers of nodes it takes a pretty long time for DNS pods (scheduled to every node). The nodes are all reporting ready, but it seems like CNI takes a long time to bring up the DNS pods and im usually waiting the longest period of time for the DNS cluster operator to become ready.

I think with these workarounds we can at least deploy a reasonably large cluster. I will probably need to try a few more times to be sure. Then we can try pod scale tests.

Comment 16 Dumitru Ceara 2020-03-20 08:54:43 UTC

(In reply to Tim Rozet from comment #15)
> Hi Dumitru,
> With your latest fix + recompute patch. I did:
> 1. scale to 176 nodes - works
> 2. scale to 324 nodes - works
> 3. scale to 2k nodes - works up to 671 nodes, then will not proceed any
> further. Many pods crashing. Need to investigate further tmrw.
> 
> In every scenario above I see many ovn-controller logs the missing flows
> message and then IDL reconnect.
> 
> Also definition of "works" here is cluster operators all eventually go to
> ready state and not failed, all of the nodes report ready state. No other
> pods have been deployed.

Thanks for trying it out. Ok, so this confirms the suspicion that some SB updates are lost and that the last_txn_id set by ovsdb-servers in update3 messages includes those. This causes ovn-controller to not request again the updates that were lost.

> 
> Note when scaling up large numbers of nodes it takes a pretty long time for
> DNS pods (scheduled to every node). The nodes are all reporting ready, but
> it seems like CNI takes a long time to bring up the DNS pods and im usually
> waiting the longest period of time for the DNS cluster operator to become
> ready.

This might be a completely different issue and we should probably track it with a different BZ. Wdyt?

> 
> I think with these workarounds we can at least deploy a reasonably large
> cluster. I will probably need to try a few more times to be sure. Then we
> can try pod scale tests.

These workarounds are just for debugging for now to get more information and to pin point the real problem. They're not sent for review upstream yet and I'd like to spend a bit more time digging to find the original reason for missing updates before adding workarounds like these that might hide the real problem.

Thanks,
Dumitru

Comment 17 Dumitru Ceara 2020-03-20 13:50:47 UTC

Hi Tim,

Just to make sure we don't miss any critical fixes from upstream, could you please run one more test with the following private OVS rpm:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27379010

This is based on OVS FDN 2.12.0-23 + most recent ovsdb-server raft fixes from upstream.

Also, let's use the pristine OVN FDN, so without the workarounds I tried in my previous builds. I.e.:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1126908

In the meantime I'll keep investigating on my side too.

Thanks,
Dumitru

Comment 18 Tim Rozet 2020-03-20 14:27:06 UTC

(In reply to Dumitru Ceara from comment #16)
> (In reply to Tim Rozet from comment #15)
> > Hi Dumitru,
> > With your latest fix + recompute patch. I did:
> > 1. scale to 176 nodes - works
> > 2. scale to 324 nodes - works
> > 3. scale to 2k nodes - works up to 671 nodes, then will not proceed any
> > further. Many pods crashing. Need to investigate further tmrw.
> > 
> > In every scenario above I see many ovn-controller logs the missing flows
> > message and then IDL reconnect.
> > 
> > Also definition of "works" here is cluster operators all eventually go to
> > ready state and not failed, all of the nodes report ready state. No other
> > pods have been deployed.
> 
> Thanks for trying it out. Ok, so this confirms the suspicion that some SB
> updates are lost and that the last_txn_id set by ovsdb-servers in update3
> messages includes those. This causes ovn-controller to not request again the
> updates that were lost.
> 
> > 
> > Note when scaling up large numbers of nodes it takes a pretty long time for
> > DNS pods (scheduled to every node). The nodes are all reporting ready, but
> > it seems like CNI takes a long time to bring up the DNS pods and im usually
> > waiting the longest period of time for the DNS cluster operator to become
> > ready.
> 
> This might be a completely different issue and we should probably track it
> with a different BZ. Wdyt?

Yeah I think this is probably a different issue. I just wanted to make note of it here so I didn't forget. Once we get past these initial bugs we will need to look at that next.

> 
> > 
> > I think with these workarounds we can at least deploy a reasonably large
> > cluster. I will probably need to try a few more times to be sure. Then we
> > can try pod scale tests.
> 
> These workarounds are just for debugging for now to get more information and
> to pin point the real problem. They're not sent for review upstream yet and
> I'd like to spend a bit more time digging to find the original reason for
> missing updates before adding workarounds like these that might hide the
> real problem.
> 
> Thanks,
> Dumitru

Comment 19 Dumitru Ceara 2020-03-20 14:32:16 UTC

(In reply to Dumitru Ceara from comment #17)
> Hi Tim,
> 
> Just to make sure we don't miss any critical fixes from upstream, could you
> please run one more test with the following private OVS rpm:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27379010
> 
> This is based on OVS FDN 2.12.0-23 + most recent ovsdb-server raft fixes
> from upstream.
> 
> Also, let's use the pristine OVN FDN, so without the workarounds I tried in
> my previous builds. I.e.:
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1126908
> 

Sorry, the above link for OVN is wrong. Correct links are:

OVN: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1130418
OVS: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27379010

Thanks,
Dumitru

Comment 20 Tim Rozet 2020-03-20 22:18:42 UTC

I tried and I still see the issue:

2020-03-20T22:01:45Z|00106|main|INFO|User triggered force recompute.
2020-03-20T22:02:15Z|00107|main|INFO|User triggered force recompute.
2020-03-20T22:02:19Z|00108|ovsdb_idl|WARN|Dropped 19 log messages in last 42 seconds (most recently, 12 seconds ago) due to excessive rate
2020-03-20T22:02:19Z|00109|ovsdb_idl|WARN|cannot modify missing row 4bf77832-259d-4624-833c-3d4b6f3af03e in table Multicast_Group
2020-03-20T22:02:24Z|00110|ovsdb_idl|INFO|ssl:10.0.129.76:9642: clustered database server is disconnected from cluster; trying another server
2020-03-20T22:02:24Z|00111|main|INFO|OVNSB commit failed, force recompute next time.
2020-03-20T22:02:35Z|00112|reconnect|INFO|ssl:10.0.172.80:9642: connected
2020-03-20T22:02:45Z|00113|main|INFO|User triggered force recompute.
2020-03-20T22:03:15Z|00114|main|INFO|User triggered force recompute.
2020-03-20T22:03:17Z|00115|ovsdb_idl|WARN|Dropped 21 log messages in last 59 seconds (most recently, 13 seconds ago) due to excessive rate
2020-03-20T22:03:17Z|00116|ovsdb_idl|WARN|cannot modify missing row 42e5f4ff-bbe0-49b4-9c34-c57249403a82 in table Multicast_Group
2020-03-20T22:03:45Z|00117|main|INFO|User triggered force recompute.
2020-03-20T22:04:11Z|00118|ovsdb_idl|WARN|Dropped 21 log messages in last 54 seconds (most recently, 2 seconds ago) due to excessive rate
2020-03-20T22:04:11Z|00119|ovsdb_idl|WARN|cannot modify missing row a49c8fce-a1a9-4477-a5c5-cfa885fbe000 in table Multicast_Group

Comment 21 Dumitru Ceara 2020-03-26 21:29:54 UTC

Patch to force ovn-controller to recover from such situations sent upstream:
https://patchwork.ozlabs.org/patch/1262361/

However, the above patch still doesn't address the root cause, just recovers from its effects.

Comment 22 Dumitru Ceara 2020-04-08 16:19:39 UTC

I managed to reproduce the issue on my own setup with plain OVN:
- 3 central nodes running ovn-northd, OVN NB DB, OVN SB DB, databases clustered with RAFT, debug logs enabled for SB DBs.
- try to bring up 200 chassis nodes and provision them one by one with a similar config to what ovn-kube does.

At some point, some of the controllers start missing SB DB updates. E.g.:
2020-04-08T10:56:35.700Z|00059|ovsdb_idl|WARN|cannot modify missing row d0e2eb0e-2223-40da-ba54-de0d963cc393 in table Multicast_Group.

The sequence of events is as follows.

From the ovn-controller logs on the node that detects the missing update:
2020-04-08T10:54:22.770Z|00052|reconnect|INFO|ssl:192.16.0.2:6642: connected                           
2020-04-08T10:55:22.726Z|00053|ovsdb_idl|INFO|ssl:192.16.0.2:6642: clustered database server is disconnected from cluster; trying another server            
2020-04-08T10:55:22.727Z|00054|main|INFO|OVNSB commit failed, force recompute next time.

The ovn-controller on the node was connected to 192.16.0.2 (node ovn-central-2) but ovn-central-1 is leader. 
Checking logs on ovn-central-1 (192.16.0.1), this is the update sent to ovn-northd after the Multicast_Group entry is added:

2020-04-08T10:55:29.479Z|3888377|jsonrpc|DBG|ssl:192.16.0.1:55754: send notification, method="update3", params=[["monid","OVN_Southbound"],"a5df72cf-1f74-4dc5-946a-ae285c3f19cb",{"Multicast_Group":{"c12b348d-736a-4975-885b-e35f8f36fcbc":{"insert":{"name":"_MC_flood","tunnel_key":32768,"ports":["uuid","f3a7898c-9c81-4514-bff3-d96b42f9684a"],"datapath":["uuid","220317b7-452e-4de5-b1f6-8d96aa458e81"]}},"d0e2eb0e-2223-40da-ba54-de0d963cc393":{"insert":....

The transaction ID that added the multicast group is: a5df72cf-1f74-4dc5-946a-ae285c3f19cb.

In the meantime ovn-controller reconnects to ovn-central-2. From ovn-controller logs:
2020-04-08T10:56:03.083Z|00055|reconnect|INFO|ssl:192.16.0.2:6642: connected

ovn-controller sends monitor_cond_since to ovn-central-2 and ovn-central-2 replies with:
2020-04-08T10:56:16.399Z|6130548|jsonrpc|DBG|ssl:192.16.0.4:58778: send reply, result=[true,"bbe2f86c-ad92-41ba-b706-9c86909e2fd3",{"Chassis":{"1e8ebd9b-7e49-4210-877c-d59f20d21f5b":{"insert":{"name":"ovn-netlab11-34","hostname":"ovn-netlab11-34","encaps":["uuid","aa7be811-dcb2-4697-bc36-07b4e3255476"....

The reply doesn't contain Multicast_Group d0e2eb0e-2223-40da-ba54-de0d963cc393 even though it was added before transaction bbe2f86c-ad92-41ba-b706-9c86909e2fd3 but that's expected because ovn-controller didn't register yet to monitor the datapath that contains the Multicast_Group.

Nevertheless, ovn-controller records locally that the last known transaction ID is bbe2f86c-ad92-41ba-b706-9c86909e2fd3 wich is *AFTER* transaction a5df72cf-1f74-4dc5-946a-ae285c3f19cb that created the Multicast_Group.

ovn-controller sends now monitor_cond_change and registers for updates to the above datapath. From logs on ovn-central-2:
2020-04-08T10:56:21.954Z|6196920|jsonrpc|DBG|ssl:192.16.0.4:58778: received request, method="monitor_cond_change", params=[["monid","OVN_Southbound"],["monid","OVN_Southbound"],{"DNS":[{"where":[["datapaths","includes",["uuid","94f4d80b-82d7-4637-9c9b-edc808f16118"]],["datapaths","includes",["uuid","ecf1cbec-a752-40ac-9761-08e9be0dd963"]],["datapaths","....

However, ovn-controller closes connection to ovn-central-2 SB DB:
2020-04-08T10:56:21.954Z|00056|ovsdb_idl|INFO|ssl:192.16.0.2:6642: clustered database server is disconnected from cluster; trying another server

SB DB on ovn-central-2 doesn't detect yet that ovn-controller has disconnected and replies to the monitor_cond_change:
2020-04-08T10:56:21.954Z|6196920|jsonrpc|DBG|ssl:192.16.0.4:58778: received request, method="monitor_cond_change", params=[["monid","OVN_Southbound"],["monid","OVN_Southbound"],{"DNS":[{"where":[["datapaths","includes",["uuid","94f4d80b-82d7-4637-9c9b-edc808f16118"]],["datapaths","includes",["uuid","ecf1cbec-a752-40ac-9761-08e9be0dd963"]],["datapaths","

But this fails, because the connection is already down:
2020-04-08T10:56:24.699Z|6229270|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 256 (5 bytes)
2020-04-08T10:56:24.699Z|6229271|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 257 (1 bytes)
2020-04-08T10:56:24.699Z|6229272|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 256 (5 bytes)
2020-04-08T10:56:24.699Z|6229273|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 257 (1 bytes)
2020-04-08T10:56:24.700Z|6229274|stream_ssl|WARN|SSL_write: system error (Broken pipe) 

In the meantime ovn-northd updates the Multicast_Group, transaction a57a0f0f-fe1a-42a6-ac9b-c94aee5a67bc, recoreded on ovn-central-1:
2020-04-08T10:56:35.782Z|4201025|jsonrpc|DBG|ssl:192.16.0.1:55976: send notification, method="update3", params=[["monid","OVN_Southbound"],"a57a0f0f-fe1a-42a6-ac9b-c94aee5a67bc",{"Multicast_Group":{"b66bb6a0-5362-40eb-816a-069369024fa1":{"insert":{"name":"_MC_flood","tunnel_key":32768,"ports":["set",[["uuid","8d5a48cb-11c2-49a3-bb6b-28a1aa21d43d"],["uuid","cf379ca6-251e-4725-9999-2d8f1c124335"]]],"datapath":["uuid","79550407-032e-487d-818a-70315df6e6de"]}},"59d5d797-c36a-44cf-b5c6-773c96411

ovn-controller reconnects, but this time to ovn-central-1:
2020-04-08T10:56:29.982Z|00058|reconnect|INFO|ssl:192.16.0.1:6642: connected

ovn-controller sends monitor_cond_since request using the last known transaction ID (bbe2f86c-ad92-41ba-b706-9c86909e2fd3). From logs on ovn-central-1:
2020-04-08T10:56:29.991Z|4196560|jsonrpc|DBG|ssl:192.16.0.4:36564: received request, method="monitor_cond_since", .... last_txn=bbe2f86c-ad92-41ba-b706-9c86909e2fd3

ovn-central-1 replies with the updates that happened since bbe2f86c-ad92-41ba-b706-9c86909e2fd3, i.e., just the Multicast_Group "modify" operation because "insert" happened in an earlier transaction:
2020-04-08T10:56:35.700Z|4199825|jsonrpc|DBG|ssl:192.16.0.4:36564: send notification, method="update3", params=[["monid","OVN_Southbound"],"a57a0f0f-fe1a-42a6-ac9b-c94aee5a67bc",{"Multicast_Group":{"d0e2eb0e-2223-40da-ba54-de0d963cc393":{"modify"....

This issue can be hit only when monitor_cond_change is issued by ovn-controller so enabling ovn-monitor-all=true should avoid it.

Tim, could you please try with the latest OVN2.12/2.13 FDN builds and with latest openshift/ovn-kubernetes which enables ovn-monitor-all=true:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1151819
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1157285

Thanks,
Dumitru

Comment 23 Tim Rozet 2020-04-08 17:04:59 UTC

Thanks Dumitru for the detailed investigation. I will try with monitor all once we get 4.5 updated with latest ovn-k8s. Another issue here is not just the transaction order got messed up, but that ovn-controller disconnected in the first place. From my discussion with you about this offline it is due to raft timers. I filed both of these to track these as well:

https://bugzilla.redhat.com/show_bug.cgi?id=1822290 - disable inactivity probe for raft
https://bugzilla.redhat.com/show_bug.cgi?id=1822296 - adjust election timer

Comment 24 Dumitru Ceara 2020-04-09 00:05:35 UTC

Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1268457/

Comment 25 Tim Rozet 2020-04-17 20:35:47 UTC

Hi Dumitru,
I tried with latest openshift 4.5, which should have all the relevant fixes. I scaled first to 200 nodes, which worked OK and all the cluster operators were fine. I then scaled to 400 nodes, and ovnkube-nodes started failing readiness check:

I0417 20:19:59.095515   35712 ovs.go:246] exec(1821): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=65,out_port=211
I0417 20:19:59.098366   35712 ovs.go:249] exec(1821): stdout: ""
I0417 20:19:59.098383   35712 ovs.go:250] exec(1821): stderr: ""
F0417 20:19:59.098407   35712 ovnkube.go:105] error waiting for node readiness: timed out waiting for the condition

Looking at the flows, there are no flows for table 65:
[root@ip-10-0-139-86 ~]# ovs-ofctl dump-flows br-int table=64
 cookie=0x0, duration=1606.960s, table=64, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,65)
[root@ip-10-0-139-86 ~]# ovs-ofctl dump-flows br-int table=65
[root@ip-10-0-139-86 ~]#

I then tried to force recompute of the flows via ovn-controller:
2020-04-17T20:18:14Z|00158|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T20:18:40Z|00159|reconnect|INFO|ssl:10.0.162.250:9642: connected
2020-04-17T20:19:01Z|00160|main|INFO|User triggered force recompute.
2020-04-17T20:20:09Z|00161|ovsdb_idl|INFO|ssl:10.0.162.250:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T20:20:09Z|00162|main|INFO|OVNSB commit failed, force recompute next time.


Verified monitor-all was enabled:
[root@ip-10-0-139-86 ~]# ovs-vsctl list open_vswitch
_uuid               : f25e2a18-c2fb-4769-8bea-24a5f0bc46aa
bridges             : [476e1773-be28-4d85-a081-60deb4ac2383, 70ffb876-3b1d-459b-a96f-a52707deddb8]
cur_cfg             : 22
datapath_types      : [netdev, system]
datapaths           : {}
db_version          : "8.2.0"
dpdk_initialized    : false
dpdk_version        : none
external_ids        : {hostname=ip-10-0-139-86.us-west-2.compute.internal, ovn-bridge-mappings="physnet:br-local", ovn-encap-ip="10.0.139.86", ovn-encap-type=geneve, ovn-monitor-all="true", ovn-nb="ssl:10
.0.136.25:9641,ssl:10.0.156.159:9641,ssl:10.0.162.250:9641", ovn-openflow-probe-interval="180", ovn-remote="ssl:10.0.136.25:9642,ssl:10.0.156.159:9642,ssl:10.0.162.250:9642", ovn-remote-probe-interval="10
0000", rundir="/var/run/openvswitch", system-id="b3cd14a1-e380-4038-bb59-653e557119c3"}

I'll leave the setup and we can debug it further on Monday. Will try to collect some relevant logs/dbs and attach them here.

Comment 26 Dumitru Ceara 2020-04-17 20:57:29 UTC

(In reply to Tim Rozet from comment #25)
> Hi Dumitru,
> I tried with latest openshift 4.5, which should have all the relevant fixes.
> I scaled first to 200 nodes, which worked OK and all the cluster operators
> were fine. I then scaled to 400 nodes, and ovnkube-nodes started failing
> readiness check:
> 

[...]

> 
> I'll leave the setup and we can debug it further on Monday. Will try to
> collect some relevant logs/dbs and attach them here.

Hi Tim,

Thanks for trying it out. Just to confirm that we're hitting the same issue even with ovn-monitor-all=true, can you please check if you find the following logs in the ovn-controller.log on the node failing checks:
"ovsdb_idl|WARN|cannot modify missing row"

Thanks!

Comment 27 Tim Rozet 2020-04-17 21:12:37 UTC

I checked a handful of them and do not see that warning. It see:

2020-04-17T21:04:05Z|00204|reconnect|INFO|ssl:10.0.156.159:9642: connected
2020-04-17T21:04:32Z|00205|ovsdb_idl|INFO|ssl:10.0.156.159:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:04:32Z|00206|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T21:04:41Z|00207|reconnect|INFO|ssl:10.0.136.25:9642: connected
2020-04-17T21:05:22Z|00208|ovsdb_idl|INFO|ssl:10.0.136.25:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:05:22Z|00209|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T21:05:36Z|00210|reconnect|INFO|ssl:10.0.162.250:9642: connected
2020-04-17T21:06:26Z|00211|ovsdb_idl|INFO|ssl:10.0.162.250:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:06:26Z|00212|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T21:06:41Z|00213|reconnect|INFO|ssl:10.0.156.159:9642: connected
2020-04-17T21:07:38Z|00214|ovsdb_idl|INFO|ssl:10.0.156.159:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:07:38Z|00215|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T21:07:53Z|00216|reconnect|INFO|ssl:10.0.136.25:9642: connected
2020-04-17T21:08:50Z|00217|ovsdb_idl|INFO|ssl:10.0.136.25:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:08:50Z|00218|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T21:09:02Z|00219|reconnect|INFO|ssl:10.0.162.250:9642: connected
2020-04-17T21:10:24Z|00220|ovsdb_idl|INFO|ssl:10.0.162.250:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:10:24Z|00221|main|INFO|OVNSB commit failed, force recompute next time.
2020-04-17T21:10:35Z|00222|reconnect|INFO|ssl:10.0.156.159:9642: connected
2020-04-17T21:10:35Z|00223|ovsdb_idl|INFO|ssl:10.0.156.159:9642: clustered database server is disconnected from cluster; trying another server
2020-04-17T21:10:43Z|00224|reconnect|INFO|ssl:10.0.136.25:9642: connected







and from sbdb:
2020-04-17T21:09:44Z|08341|raft|INFO|rejected append_reply (not leader)
2020-04-17T21:09:45Z|08342|raft|INFO|rejected append_reply (not leader)
2020-04-17T21:09:45Z|08343|raft|INFO|rejected append_reply (not leader)
2020-04-17T21:09:53Z|08344|raft|INFO|server e7de is leader for term 957
2020-04-17T21:10:01Z|08345|reconnect|ERR|ssl:10.0.156.159:9644: no response to inactivity probe after 5 seconds, disconnecting
2020-04-17T21:10:01Z|08346|reconnect|INFO|ssl:10.0.156.159:9644: connection dropped
2020-04-17T21:10:02Z|08347|reconnect|ERR|ssl:10.0.156.159:36322: no response to inactivity probe after 5 seconds, disconnecting
2020-04-17T21:10:02Z|08348|reconnect|INFO|ssl:10.0.156.159:9644: connecting...
2020-04-17T21:10:03Z|08349|reconnect|INFO|ssl:10.0.156.159:9644: connection attempt timed out
2020-04-17T21:10:03Z|08350|reconnect|INFO|ssl:10.0.156.159:9644: waiting 2 seconds before reconnect
2020-04-17T21:10:05Z|08351|reconnect|INFO|ssl:10.0.156.159:9644: connecting...
2020-04-17T21:10:07Z|08352|reconnect|INFO|ssl:10.0.156.159:9644: connection attempt timed out
2020-04-17T21:10:07Z|08353|reconnect|INFO|ssl:10.0.156.159:9644: waiting 4 seconds before reconnect
2020-04-17T21:10:11Z|08354|reconnect|INFO|ssl:10.0.156.159:9644: connecting...
2020-04-17T21:10:15Z|08355|reconnect|INFO|ssl:10.0.156.159:9644: connection attempt timed out
2020-04-17T21:10:15Z|08356|reconnect|INFO|ssl:10.0.156.159:9644: continuing to reconnect in the background but suppressing further logging


Perhaps we need https://bugzilla.redhat.com/show_bug.cgi?id=1822290 ?

Comment 28 Dumitru Ceara 2020-04-17 21:23:55 UTC

Would it be possible to check ovn-controller.log on all nodes just to make sure that indeed we don't hit the original issue due to missing SB DB updates with monitor_cond_change reply lost during reconnect? That should be fixed by https://patchwork.ozlabs.org/patch/1268457/ and should anyway not be hit with ovn-monitor-all=true.

If we don't find any "ovsdb_idl|WARN|cannot modify missing row" logs we're probably hitting a different bug.

Regarding the sbdb disconnects, yes, https://bugzilla.redhat.com/show_bug.cgi?id=1822290 might address that.

Comment 29 Tim Rozet 2020-04-20 16:33:01 UTC

It isn't possible at this point, the auth on the cluster is no longer working and I can't access the logs via the kube client:

root@ip-172-31-73-134: ~/scale-ci-deploy # kubectl get pod -A |grep auth
openshift-authentication-operator                  authentication-operator-65cd7bc7d7-9j8bh                              1/1     Terminating         7          2d22h
openshift-authentication-operator                  authentication-operator-65cd7bc7d7-9k5hs                              0/1     Pending             0          45h
openshift-authentication                           oauth-openshift-58b9fc97ff-hhdp8                                      1/1     Terminating         0          2d22h
openshift-authentication                           oauth-openshift-58b9fc97ff-kg2g8                                      1/1     Terminating         0          2d22h

I did check a bunch of the nodes with issues on Friday and did not see missing row, and at 200 nodes did not see any nodes with errors.

Comment 35 Jianlin Shi 2020-06-29 08:15:31 UTC

Hi Tim,

Is it possible that you help to test with the latest ovn with ovn_monitor_all=false and confirm if it is fixed?
the latest ovn version: http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/2.13.0/37.el7fdp/x86_64/

Comment 36 Jianlin Shi 2020-07-09 07:42:31 UTC

tested with ovn scale test: https://github.com/dceara/ovn-heater.git

set ovn_monitor_all=false:
[root@hp-dl380pg8-12 ovn-fake-multinode]# git diff .
diff --git a/ovn_cluster.sh b/ovn_cluster.sh
index 59cac33..c629f04 100755
--- a/ovn_cluster.sh
+++ b/ovn_cluster.sh
@@ -271,6 +271,7 @@ ovs-vsctl set open . external-ids:ovn-remote-probe-interval=180000
 if [ "\$ovn_monitor_all" = "yes" ]; then
     ovs-vsctl set open . external-ids:ovn-monitor-all=true
 fi
+    ovs-vsctl set open . external-ids:ovn-monitor-all=false

 ovs-vsctl --if-exists del-br br-ex
 ovs-vsctl add-br br-ex

change ovn-sbctl for ovn-scale-test:
[root@hp-dl380pg8-13 ovn-scale-test]# git diff .
diff --git a/rally_ovs/plugins/ovs/ovsclients_impl.py b/rally_ovs/plugins/ovs/ovsclients_impl.py      
index b21efcf..96a6fc0 100644
--- a/rally_ovs/plugins/ovs/ovsclients_impl.py
+++ b/rally_ovs/plugins/ovs/ovsclients_impl.py
@@ -338,7 +338,7 @@ class OvnSbctl(OvsClient):
             self.sandbox = None
             self.batch_mode = False
             self.cmds = None
-            self.sbctl_cmd = "ovn-sbctl --no-leader-only"                                            
+            self.sbctl_cmd = "ovn-sbctl --no-leader-only --db=ssl:localhost:6642 -p /opt/ovn/ovn-pri 
 
         def enable_batch_mode(self, value=True):                                                     
             self.batch_mode = bool(value

reproduced on ovn2.13.0.-34:

[root@dell-per740-12 ovn-fake-multinode]# for d in `seq 90 108`
> do
> docker exec ovn-scale-1-$d grep "missing row" /var/log/ovn/ovn-controller.log
> done
2020-07-09T05:08:55.686Z|00844|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:05:52.005Z|00666|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:07:51.376Z|00747|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:07:50.027Z|00622|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:09:45.381Z|00643|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:04:18.322Z|00626|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:09:43.663Z|00598|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
2020-07-09T05:09:45.025Z|00433|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding
[root@dell-per740-12 ovn-fake-multinode]# docker exec -it ovn-scale-1-108 bash                        
[root@ovn-scale-1-108 /]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.13-2.13.0-30.el7fdp.x86_64
ovn2.13-host-2.13.0-34.el7fdp.x86_64
ovn2.13-2.13.0-34.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-central-2.13.0-34.el7fdp.x86_64

Verified on ovn2.13.0-37:

[root@dell-per740-12 ovn-fake-multinode]# docker exec -it ovn-scale-1-1 bash
[root@ovn-scale-1-1 /]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.13-2.13.0-30.el7fdp.x86_64
ovn2.13-host-2.13.0-37.el7fdp.x86_64
ovn2.13-2.13.0-37.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-central-2.13.0-37.el7fdp.x86_64

no "cannot modify missing row" in ovn-controller.log

Comment 39 errata-xmlrpc 2020-07-15 13:00:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2941

Note You need to log in before you can comment on or make changes to this bug.