Description of problem: After scaling an openshift deployment up to 179 nodes, one of the new nodes is missing tons of flows. Working node: [root@ip-10-0-169-105 ~]# ovs-ofctl -O openflow15 dump-flows br-int | wc -l 72247 broken node: [root@ip-10-0-136-224 ~]# ovs-ofctl -O openflow15 dump-flows br-int |wc -l 28150 ovnkube-node eventually dies on the node because his readiness probes check for flows which are missing: time="2020-02-28T18:46:34Z" level=debug msg="exec(1202): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=65,out_port=114" time="2020-02-28T18:46:34Z" level=debug msg="exec(1202): stdout: \"\"" time="2020-02-28T18:46:34Z" level=debug msg="exec(1202): stderr: \"\"" time="2020-02-28T18:46:35Z" level=debug msg="exec(1203): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=41,ip,nw_src=10.128.0.0/10" time="2020-02-28T18:46:35Z" level=debug msg="exec(1203): stdout: \"\"" time="2020-02-28T18:46:35Z" level=debug msg="exec(1203): stderr: \"\"" There are a lot of errors in ovn-controller and high cpu usage in ssl handler. If I kill ovn-controller pod, when it restarts the missing flows are programmed: [root@ip-10-0-136-224 ~]# ovs-ofctl -O openflow15 dump-flows br-int |grep table=41 |grep 10\.128 cookie=0xcadf2977, duration=55.227s, table=41, n_packets=247, n_bytes=39439, idle_age=0, priority=11,ip,metadata=0x1b9,nw_src=10.128.0.0/10 actions=ct(commit,table=42,zone=NXM_NX_REG12[0..15],nat(src=169.254.33.2)) [root@ip-10-0-136-224 ~]# ovs-ofctl -O openflow15 dump-flows br-int | wc -l 72257 see attachment for full logs, dbs
Created attachment 1666472 [details] logs, dbs
Hi Tim, Could you give it a try with the scratch build below? It fixes the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1808125 and it would be nice to know if you're hitting the same problem. Build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=26874857 Upstream patch: https://patchwork.ozlabs.org/patch/1246910/ Thanks, Dumitru
(In reply to Tim Rozet from comment #0) > > see attachment for full logs, dbs Could you also please provide the SB DB logs? Thanks, Dumitru
After loading the attached SB DB, seems like the SNAT Logical_Flows for the GW router that fails readiness checks are properly created: $ ovn-sbctl lflow-list GR_ip-10-0-136-224.us-west-2.compute.internal | grep "ip4.src == 10.128.0.0" table=1 (lr_out_snat ), priority=11 , match=(ip && ip4.src == 10.128.0.0/10), action=(ct_snat(169.254.33.2);) From the ovn-kube logs: time="2020-02-28T18:43:19Z" level=info msg="Waiting for GatewayReady and ManagementPortReady on node ip-10-0-136-224.us-west-2.compute.internal" time="2020-02-28T18:43:19Z" level=debug msg="exec(27): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=41,ip,nw_src=10.128.0.0/10" time="2020-02-28T18:43:19Z" level=debug msg="exec(28): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface k8s-ip-10-0-136 ofport" time="2020-02-28T18:43:19Z" level=debug msg="exec(27): stdout: \"\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(27): stderr: \"\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(28): stdout: \"114\\n\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(28): stderr: \"\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(29): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=65,out_port=114" time="2020-02-28T18:43:19Z" level=debug msg="exec(29): stdout: \"\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(29): stderr: \"\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(30): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=41,ip,nw_src=10.128.0.0/10" time="2020-02-28T18:43:19Z" level=debug msg="exec(30): stdout: \"\"" time="2020-02-28T18:43:19Z" level=debug msg="exec(30): stderr: \"\"" This indicates that most likely the problem is in ovn-controller.
Saw this again today while scaling out the cluster. It looks like forcing recompute does not fix this particular issue, but interestingly restarting the ovn-controller pod does. Attaching all logs, dbs, output.
Created attachment 1670963 [details] Full logs and OVN DBs test 2
A temporary fix for now could be when we see these healthchecks fail in ovn-k8s and the node is not ready, we kill the ovn-controller ourselves.
Created attachment 1671193 [details] sb dbs from all masters
Created attachment 1671194 [details] core of the screwed up ovn-controller
Hi Tim, Thanks for all the information. Looks like some Southbound DB updates didn't get sent properly (or got lost). From the OVN Southbound databases (all 3 are in sync), we see that the logical flow ovnkube checks for exists: ovn-sbctl --uuid lflow-list GR_ip-10-0-180-140.us-west-2.compute.internal | grep 4c2ef062 uuid=0x4c2ef062, table=1 (lr_out_snat ), priority=15 , match=(ip && ip4.src == 10.128.0.0/14), action=(ct_snat(169.254.33.2);) From the ovn-controller core dump: (gdb) bt #0 0x00007ff352375c2d in parse_dollars () from /lib64/libc.so.6 #1 0x000055bd128b1170 in ?? () #2 0x000055bd128a3001 in ?? () #3 0x000055bd11f300cc in main (argc=12, argv=0x7ffcc5c6b088) at controller/ovn-controller.c:2197 (gdb) frame 3 #3 0x000055bd11f300cc in main (argc=12, argv=0x7ffcc5c6b088) at controller/ovn-controller.c:2197 2197 poll_block(); (gdb) p *ovnsb_idl_loop.idl->data.tables[14].class_ $51 = { name = 0x55bd120adec2 "Logical_Flow", is_root = true, is_singleton = false, columns = 0x55bd12357280 <sbrec_logical_flow_columns>, n_columns = 8, allocation_size = 296, row_init = 0x55bd11f722f0 <sbrec_logical_flow_init__> } (gdb) set $hmap=&ovnsb_idl_loop.idl->data.tables[14].rows (gdb) set $hash=0x4c2ef062 (gdb) p $hmap->buckets[$hash & $hmap->mask] $52 = (struct hmap_node *) 0x0 <<<< If the flow would be in IDL memory, it should be in this bucket.. Looking at the original logs attached to the BZ: From the ovnkube logs, when the node gateway router is provisioned: [...] time="2020-02-28T18:43:19Z" level=info msg="Waiting for GatewayReady and ManagementPortReady on node ip-10-0-136-224.us-west-2.compute.internal" [...] From the ovn-controller logs (around the time when ovnkube created the node gateway router binding): [...] 2020-02-28T15:39:41Z|00042|ovsdb_idl|WARN|Dropped 194 log messages in last 61 seconds (most recently, 13 seconds ago) due to excessive rate 2020-02-28T15:39:41Z|00043|ovsdb_idl|WARN|cannot modify missing row 58979090-c72a-41c3-9ea9-5853202ef037 in table Multicast_Group 2020-02-28T16:03:01Z|00044|ovsdb_idl|WARN|Dropped 410 log messages in last 1400 seconds (most recently, 1379 seconds ago) due to excessive rate 2020-02-28T16:03:01Z|00045|ovsdb_idl|WARN|cannot delete missing row 1e128d07-f57d-4ef1-8068-7d26423c00da from table Logical_Flow [...] These warnings indicate that ovn-controller received a "modify" update2 message from SB ovsdb-server for a row that doesn't exist in local IDL memory. The fact that some rows don't exist suggest that previous updates have been lost at some point. This might be a raft or ovsdb-server or ovsdb protocol implementation bug. In order to confirm and work around the bug I prepared a private build that triggers an IDL reconnect (jsonrpc reconnect) in ovn-controller if updates are received for rows that don't exist in local memory. Could you please try it out and see if you still hit the issue of the missing openflow entries? https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27360523 Thanks, Dumitru
I still see it after I scale up: root@ip-172-31-73-134: ~/scale-ci-deploy # kubectl get node | grep -i notready ip-10-0-173-79.us-west-2.compute.internal NotReady worker 20m v1.17.1 ip-10-0-191-160.us-west-2.compute.internal NotReady worker 19m v1.17.1 ip-10-0-191-229.us-west-2.compute.internal NotReady worker 20m v1.17.1 2020-03-19T15:39:48Z|00369|main|INFO|OVNSB commit failed, force recompute next time. 2020-03-19T15:39:56Z|00370|reconnect|INFO|ssl:10.0.152.100:9642: connected 2020-03-19T15:39:56Z|00371|ovsdb_idl|INFO|ssl:10.0.152.100:9642: clustered database server is disconnected from cluster; trying another server 2020-03-19T15:40:04Z|00372|reconnect|INFO|ssl:10.0.143.119:9642: connected 2020-03-19T15:40:04Z|00373|main|INFO|OVNSB commit failed, force recompute next time. 2020-03-19T15:40:04Z|00374|main|INFO|User triggered force recompute. 2020-03-19T15:40:15Z|00375|reconnect|INFO|ssl:10.0.164.58:9642: connected 2020-03-19T15:40:16Z|00376|ovsdb_idl|WARN|Dropped 5 log messages in last 52 seconds (most recently, 12 seconds ago) due to excessive rate 2020-03-19T15:40:16Z|00377|ovsdb_idl|WARN|cannot modify missing row 90149052-5968-4e35-b5af-3b0e2df88487 in table Multicast_Group 2020-03-19T15:40:16Z|00378|ovsdb_idl|WARN|Dropped 6 log messages in last 60 seconds (most recently, 12 seconds ago) due to excessive rate 2020-03-19T15:40:16Z|00379|main|INFO|OVNSB commit failed, force recompute next time. 2020-03-19T15:40:24Z|00380|reconnect|INFO|ssl:10.0.152.100:9642: connected
Created attachment 1671491 [details] ovn controller log from idl reconnect patch
Hi Tim, Thanks for trying it out and for the logs. It seems that with this change ovn-controller is trigering a reconnect to Southbound but once that happens it will request only the changes that happened since the last successful transaction performed by ovn-controller. If the missing updates happened to be in this interval (between last successful txn and now) then it will keep failing and reconnecting. I built a new rpm where we now request the complete DB in such cases. It would be great if you could give it another try: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27364765 Thanks, Dumitru
Hi Dumitru, With your latest fix + recompute patch. I did: 1. scale to 176 nodes - works 2. scale to 324 nodes - works 3. scale to 2k nodes - works up to 671 nodes, then will not proceed any further. Many pods crashing. Need to investigate further tmrw. In every scenario above I see many ovn-controller logs the missing flows message and then IDL reconnect. Also definition of "works" here is cluster operators all eventually go to ready state and not failed, all of the nodes report ready state. No other pods have been deployed. Note when scaling up large numbers of nodes it takes a pretty long time for DNS pods (scheduled to every node). The nodes are all reporting ready, but it seems like CNI takes a long time to bring up the DNS pods and im usually waiting the longest period of time for the DNS cluster operator to become ready. I think with these workarounds we can at least deploy a reasonably large cluster. I will probably need to try a few more times to be sure. Then we can try pod scale tests.
(In reply to Tim Rozet from comment #15) > Hi Dumitru, > With your latest fix + recompute patch. I did: > 1. scale to 176 nodes - works > 2. scale to 324 nodes - works > 3. scale to 2k nodes - works up to 671 nodes, then will not proceed any > further. Many pods crashing. Need to investigate further tmrw. > > In every scenario above I see many ovn-controller logs the missing flows > message and then IDL reconnect. > > Also definition of "works" here is cluster operators all eventually go to > ready state and not failed, all of the nodes report ready state. No other > pods have been deployed. Thanks for trying it out. Ok, so this confirms the suspicion that some SB updates are lost and that the last_txn_id set by ovsdb-servers in update3 messages includes those. This causes ovn-controller to not request again the updates that were lost. > > Note when scaling up large numbers of nodes it takes a pretty long time for > DNS pods (scheduled to every node). The nodes are all reporting ready, but > it seems like CNI takes a long time to bring up the DNS pods and im usually > waiting the longest period of time for the DNS cluster operator to become > ready. This might be a completely different issue and we should probably track it with a different BZ. Wdyt? > > I think with these workarounds we can at least deploy a reasonably large > cluster. I will probably need to try a few more times to be sure. Then we > can try pod scale tests. These workarounds are just for debugging for now to get more information and to pin point the real problem. They're not sent for review upstream yet and I'd like to spend a bit more time digging to find the original reason for missing updates before adding workarounds like these that might hide the real problem. Thanks, Dumitru
Hi Tim, Just to make sure we don't miss any critical fixes from upstream, could you please run one more test with the following private OVS rpm: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27379010 This is based on OVS FDN 2.12.0-23 + most recent ovsdb-server raft fixes from upstream. Also, let's use the pristine OVN FDN, so without the workarounds I tried in my previous builds. I.e.: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1126908 In the meantime I'll keep investigating on my side too. Thanks, Dumitru
(In reply to Dumitru Ceara from comment #16) > (In reply to Tim Rozet from comment #15) > > Hi Dumitru, > > With your latest fix + recompute patch. I did: > > 1. scale to 176 nodes - works > > 2. scale to 324 nodes - works > > 3. scale to 2k nodes - works up to 671 nodes, then will not proceed any > > further. Many pods crashing. Need to investigate further tmrw. > > > > In every scenario above I see many ovn-controller logs the missing flows > > message and then IDL reconnect. > > > > Also definition of "works" here is cluster operators all eventually go to > > ready state and not failed, all of the nodes report ready state. No other > > pods have been deployed. > > Thanks for trying it out. Ok, so this confirms the suspicion that some SB > updates are lost and that the last_txn_id set by ovsdb-servers in update3 > messages includes those. This causes ovn-controller to not request again the > updates that were lost. > > > > > Note when scaling up large numbers of nodes it takes a pretty long time for > > DNS pods (scheduled to every node). The nodes are all reporting ready, but > > it seems like CNI takes a long time to bring up the DNS pods and im usually > > waiting the longest period of time for the DNS cluster operator to become > > ready. > > This might be a completely different issue and we should probably track it > with a different BZ. Wdyt? Yeah I think this is probably a different issue. I just wanted to make note of it here so I didn't forget. Once we get past these initial bugs we will need to look at that next. > > > > > I think with these workarounds we can at least deploy a reasonably large > > cluster. I will probably need to try a few more times to be sure. Then we > > can try pod scale tests. > > These workarounds are just for debugging for now to get more information and > to pin point the real problem. They're not sent for review upstream yet and > I'd like to spend a bit more time digging to find the original reason for > missing updates before adding workarounds like these that might hide the > real problem. > > Thanks, > Dumitru
(In reply to Dumitru Ceara from comment #17) > Hi Tim, > > Just to make sure we don't miss any critical fixes from upstream, could you > please run one more test with the following private OVS rpm: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27379010 > > This is based on OVS FDN 2.12.0-23 + most recent ovsdb-server raft fixes > from upstream. > > Also, let's use the pristine OVN FDN, so without the workarounds I tried in > my previous builds. I.e.: > https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1126908 > Sorry, the above link for OVN is wrong. Correct links are: OVN: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1130418 OVS: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27379010 Thanks, Dumitru
I tried and I still see the issue: 2020-03-20T22:01:45Z|00106|main|INFO|User triggered force recompute. 2020-03-20T22:02:15Z|00107|main|INFO|User triggered force recompute. 2020-03-20T22:02:19Z|00108|ovsdb_idl|WARN|Dropped 19 log messages in last 42 seconds (most recently, 12 seconds ago) due to excessive rate 2020-03-20T22:02:19Z|00109|ovsdb_idl|WARN|cannot modify missing row 4bf77832-259d-4624-833c-3d4b6f3af03e in table Multicast_Group 2020-03-20T22:02:24Z|00110|ovsdb_idl|INFO|ssl:10.0.129.76:9642: clustered database server is disconnected from cluster; trying another server 2020-03-20T22:02:24Z|00111|main|INFO|OVNSB commit failed, force recompute next time. 2020-03-20T22:02:35Z|00112|reconnect|INFO|ssl:10.0.172.80:9642: connected 2020-03-20T22:02:45Z|00113|main|INFO|User triggered force recompute. 2020-03-20T22:03:15Z|00114|main|INFO|User triggered force recompute. 2020-03-20T22:03:17Z|00115|ovsdb_idl|WARN|Dropped 21 log messages in last 59 seconds (most recently, 13 seconds ago) due to excessive rate 2020-03-20T22:03:17Z|00116|ovsdb_idl|WARN|cannot modify missing row 42e5f4ff-bbe0-49b4-9c34-c57249403a82 in table Multicast_Group 2020-03-20T22:03:45Z|00117|main|INFO|User triggered force recompute. 2020-03-20T22:04:11Z|00118|ovsdb_idl|WARN|Dropped 21 log messages in last 54 seconds (most recently, 2 seconds ago) due to excessive rate 2020-03-20T22:04:11Z|00119|ovsdb_idl|WARN|cannot modify missing row a49c8fce-a1a9-4477-a5c5-cfa885fbe000 in table Multicast_Group
Patch to force ovn-controller to recover from such situations sent upstream: https://patchwork.ozlabs.org/patch/1262361/ However, the above patch still doesn't address the root cause, just recovers from its effects.
I managed to reproduce the issue on my own setup with plain OVN: - 3 central nodes running ovn-northd, OVN NB DB, OVN SB DB, databases clustered with RAFT, debug logs enabled for SB DBs. - try to bring up 200 chassis nodes and provision them one by one with a similar config to what ovn-kube does. At some point, some of the controllers start missing SB DB updates. E.g.: 2020-04-08T10:56:35.700Z|00059|ovsdb_idl|WARN|cannot modify missing row d0e2eb0e-2223-40da-ba54-de0d963cc393 in table Multicast_Group. The sequence of events is as follows. From the ovn-controller logs on the node that detects the missing update: 2020-04-08T10:54:22.770Z|00052|reconnect|INFO|ssl:192.16.0.2:6642: connected 2020-04-08T10:55:22.726Z|00053|ovsdb_idl|INFO|ssl:192.16.0.2:6642: clustered database server is disconnected from cluster; trying another server 2020-04-08T10:55:22.727Z|00054|main|INFO|OVNSB commit failed, force recompute next time. The ovn-controller on the node was connected to 192.16.0.2 (node ovn-central-2) but ovn-central-1 is leader. Checking logs on ovn-central-1 (192.16.0.1), this is the update sent to ovn-northd after the Multicast_Group entry is added: 2020-04-08T10:55:29.479Z|3888377|jsonrpc|DBG|ssl:192.16.0.1:55754: send notification, method="update3", params=[["monid","OVN_Southbound"],"a5df72cf-1f74-4dc5-946a-ae285c3f19cb",{"Multicast_Group":{"c12b348d-736a-4975-885b-e35f8f36fcbc":{"insert":{"name":"_MC_flood","tunnel_key":32768,"ports":["uuid","f3a7898c-9c81-4514-bff3-d96b42f9684a"],"datapath":["uuid","220317b7-452e-4de5-b1f6-8d96aa458e81"]}},"d0e2eb0e-2223-40da-ba54-de0d963cc393":{"insert":.... The transaction ID that added the multicast group is: a5df72cf-1f74-4dc5-946a-ae285c3f19cb. In the meantime ovn-controller reconnects to ovn-central-2. From ovn-controller logs: 2020-04-08T10:56:03.083Z|00055|reconnect|INFO|ssl:192.16.0.2:6642: connected ovn-controller sends monitor_cond_since to ovn-central-2 and ovn-central-2 replies with: 2020-04-08T10:56:16.399Z|6130548|jsonrpc|DBG|ssl:192.16.0.4:58778: send reply, result=[true,"bbe2f86c-ad92-41ba-b706-9c86909e2fd3",{"Chassis":{"1e8ebd9b-7e49-4210-877c-d59f20d21f5b":{"insert":{"name":"ovn-netlab11-34","hostname":"ovn-netlab11-34","encaps":["uuid","aa7be811-dcb2-4697-bc36-07b4e3255476".... The reply doesn't contain Multicast_Group d0e2eb0e-2223-40da-ba54-de0d963cc393 even though it was added before transaction bbe2f86c-ad92-41ba-b706-9c86909e2fd3 but that's expected because ovn-controller didn't register yet to monitor the datapath that contains the Multicast_Group. Nevertheless, ovn-controller records locally that the last known transaction ID is bbe2f86c-ad92-41ba-b706-9c86909e2fd3 wich is *AFTER* transaction a5df72cf-1f74-4dc5-946a-ae285c3f19cb that created the Multicast_Group. ovn-controller sends now monitor_cond_change and registers for updates to the above datapath. From logs on ovn-central-2: 2020-04-08T10:56:21.954Z|6196920|jsonrpc|DBG|ssl:192.16.0.4:58778: received request, method="monitor_cond_change", params=[["monid","OVN_Southbound"],["monid","OVN_Southbound"],{"DNS":[{"where":[["datapaths","includes",["uuid","94f4d80b-82d7-4637-9c9b-edc808f16118"]],["datapaths","includes",["uuid","ecf1cbec-a752-40ac-9761-08e9be0dd963"]],["datapaths",".... However, ovn-controller closes connection to ovn-central-2 SB DB: 2020-04-08T10:56:21.954Z|00056|ovsdb_idl|INFO|ssl:192.16.0.2:6642: clustered database server is disconnected from cluster; trying another server SB DB on ovn-central-2 doesn't detect yet that ovn-controller has disconnected and replies to the monitor_cond_change: 2020-04-08T10:56:21.954Z|6196920|jsonrpc|DBG|ssl:192.16.0.4:58778: received request, method="monitor_cond_change", params=[["monid","OVN_Southbound"],["monid","OVN_Southbound"],{"DNS":[{"where":[["datapaths","includes",["uuid","94f4d80b-82d7-4637-9c9b-edc808f16118"]],["datapaths","includes",["uuid","ecf1cbec-a752-40ac-9761-08e9be0dd963"]],["datapaths"," But this fails, because the connection is already down: 2020-04-08T10:56:24.699Z|6229270|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 256 (5 bytes) 2020-04-08T10:56:24.699Z|6229271|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 257 (1 bytes) 2020-04-08T10:56:24.699Z|6229272|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 256 (5 bytes) 2020-04-08T10:56:24.699Z|6229273|stream_ssl|DBG|server244-->ssl:192.16.0.4:58778 type 257 (1 bytes) 2020-04-08T10:56:24.700Z|6229274|stream_ssl|WARN|SSL_write: system error (Broken pipe) In the meantime ovn-northd updates the Multicast_Group, transaction a57a0f0f-fe1a-42a6-ac9b-c94aee5a67bc, recoreded on ovn-central-1: 2020-04-08T10:56:35.782Z|4201025|jsonrpc|DBG|ssl:192.16.0.1:55976: send notification, method="update3", params=[["monid","OVN_Southbound"],"a57a0f0f-fe1a-42a6-ac9b-c94aee5a67bc",{"Multicast_Group":{"b66bb6a0-5362-40eb-816a-069369024fa1":{"insert":{"name":"_MC_flood","tunnel_key":32768,"ports":["set",[["uuid","8d5a48cb-11c2-49a3-bb6b-28a1aa21d43d"],["uuid","cf379ca6-251e-4725-9999-2d8f1c124335"]]],"datapath":["uuid","79550407-032e-487d-818a-70315df6e6de"]}},"59d5d797-c36a-44cf-b5c6-773c96411 ovn-controller reconnects, but this time to ovn-central-1: 2020-04-08T10:56:29.982Z|00058|reconnect|INFO|ssl:192.16.0.1:6642: connected ovn-controller sends monitor_cond_since request using the last known transaction ID (bbe2f86c-ad92-41ba-b706-9c86909e2fd3). From logs on ovn-central-1: 2020-04-08T10:56:29.991Z|4196560|jsonrpc|DBG|ssl:192.16.0.4:36564: received request, method="monitor_cond_since", .... last_txn=bbe2f86c-ad92-41ba-b706-9c86909e2fd3 ovn-central-1 replies with the updates that happened since bbe2f86c-ad92-41ba-b706-9c86909e2fd3, i.e., just the Multicast_Group "modify" operation because "insert" happened in an earlier transaction: 2020-04-08T10:56:35.700Z|4199825|jsonrpc|DBG|ssl:192.16.0.4:36564: send notification, method="update3", params=[["monid","OVN_Southbound"],"a57a0f0f-fe1a-42a6-ac9b-c94aee5a67bc",{"Multicast_Group":{"d0e2eb0e-2223-40da-ba54-de0d963cc393":{"modify".... This issue can be hit only when monitor_cond_change is issued by ovn-controller so enabling ovn-monitor-all=true should avoid it. Tim, could you please try with the latest OVN2.12/2.13 FDN builds and with latest openshift/ovn-kubernetes which enables ovn-monitor-all=true: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1151819 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1157285 Thanks, Dumitru
Thanks Dumitru for the detailed investigation. I will try with monitor all once we get 4.5 updated with latest ovn-k8s. Another issue here is not just the transaction order got messed up, but that ovn-controller disconnected in the first place. From my discussion with you about this offline it is due to raft timers. I filed both of these to track these as well: https://bugzilla.redhat.com/show_bug.cgi?id=1822290 - disable inactivity probe for raft https://bugzilla.redhat.com/show_bug.cgi?id=1822296 - adjust election timer
Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1268457/
Hi Dumitru, I tried with latest openshift 4.5, which should have all the relevant fixes. I scaled first to 200 nodes, which worked OK and all the cluster operators were fine. I then scaled to 400 nodes, and ovnkube-nodes started failing readiness check: I0417 20:19:59.095515 35712 ovs.go:246] exec(1821): /usr/bin/ovs-ofctl --no-stats --no-names dump-flows br-int table=65,out_port=211 I0417 20:19:59.098366 35712 ovs.go:249] exec(1821): stdout: "" I0417 20:19:59.098383 35712 ovs.go:250] exec(1821): stderr: "" F0417 20:19:59.098407 35712 ovnkube.go:105] error waiting for node readiness: timed out waiting for the condition Looking at the flows, there are no flows for table 65: [root@ip-10-0-139-86 ~]# ovs-ofctl dump-flows br-int table=64 cookie=0x0, duration=1606.960s, table=64, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,65) [root@ip-10-0-139-86 ~]# ovs-ofctl dump-flows br-int table=65 [root@ip-10-0-139-86 ~]# I then tried to force recompute of the flows via ovn-controller: 2020-04-17T20:18:14Z|00158|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T20:18:40Z|00159|reconnect|INFO|ssl:10.0.162.250:9642: connected 2020-04-17T20:19:01Z|00160|main|INFO|User triggered force recompute. 2020-04-17T20:20:09Z|00161|ovsdb_idl|INFO|ssl:10.0.162.250:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T20:20:09Z|00162|main|INFO|OVNSB commit failed, force recompute next time. Verified monitor-all was enabled: [root@ip-10-0-139-86 ~]# ovs-vsctl list open_vswitch _uuid : f25e2a18-c2fb-4769-8bea-24a5f0bc46aa bridges : [476e1773-be28-4d85-a081-60deb4ac2383, 70ffb876-3b1d-459b-a96f-a52707deddb8] cur_cfg : 22 datapath_types : [netdev, system] datapaths : {} db_version : "8.2.0" dpdk_initialized : false dpdk_version : none external_ids : {hostname=ip-10-0-139-86.us-west-2.compute.internal, ovn-bridge-mappings="physnet:br-local", ovn-encap-ip="10.0.139.86", ovn-encap-type=geneve, ovn-monitor-all="true", ovn-nb="ssl:10 .0.136.25:9641,ssl:10.0.156.159:9641,ssl:10.0.162.250:9641", ovn-openflow-probe-interval="180", ovn-remote="ssl:10.0.136.25:9642,ssl:10.0.156.159:9642,ssl:10.0.162.250:9642", ovn-remote-probe-interval="10 0000", rundir="/var/run/openvswitch", system-id="b3cd14a1-e380-4038-bb59-653e557119c3"} I'll leave the setup and we can debug it further on Monday. Will try to collect some relevant logs/dbs and attach them here.
(In reply to Tim Rozet from comment #25) > Hi Dumitru, > I tried with latest openshift 4.5, which should have all the relevant fixes. > I scaled first to 200 nodes, which worked OK and all the cluster operators > were fine. I then scaled to 400 nodes, and ovnkube-nodes started failing > readiness check: > [...] > > I'll leave the setup and we can debug it further on Monday. Will try to > collect some relevant logs/dbs and attach them here. Hi Tim, Thanks for trying it out. Just to confirm that we're hitting the same issue even with ovn-monitor-all=true, can you please check if you find the following logs in the ovn-controller.log on the node failing checks: "ovsdb_idl|WARN|cannot modify missing row" Thanks!
I checked a handful of them and do not see that warning. It see: 2020-04-17T21:04:05Z|00204|reconnect|INFO|ssl:10.0.156.159:9642: connected 2020-04-17T21:04:32Z|00205|ovsdb_idl|INFO|ssl:10.0.156.159:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:04:32Z|00206|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T21:04:41Z|00207|reconnect|INFO|ssl:10.0.136.25:9642: connected 2020-04-17T21:05:22Z|00208|ovsdb_idl|INFO|ssl:10.0.136.25:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:05:22Z|00209|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T21:05:36Z|00210|reconnect|INFO|ssl:10.0.162.250:9642: connected 2020-04-17T21:06:26Z|00211|ovsdb_idl|INFO|ssl:10.0.162.250:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:06:26Z|00212|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T21:06:41Z|00213|reconnect|INFO|ssl:10.0.156.159:9642: connected 2020-04-17T21:07:38Z|00214|ovsdb_idl|INFO|ssl:10.0.156.159:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:07:38Z|00215|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T21:07:53Z|00216|reconnect|INFO|ssl:10.0.136.25:9642: connected 2020-04-17T21:08:50Z|00217|ovsdb_idl|INFO|ssl:10.0.136.25:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:08:50Z|00218|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T21:09:02Z|00219|reconnect|INFO|ssl:10.0.162.250:9642: connected 2020-04-17T21:10:24Z|00220|ovsdb_idl|INFO|ssl:10.0.162.250:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:10:24Z|00221|main|INFO|OVNSB commit failed, force recompute next time. 2020-04-17T21:10:35Z|00222|reconnect|INFO|ssl:10.0.156.159:9642: connected 2020-04-17T21:10:35Z|00223|ovsdb_idl|INFO|ssl:10.0.156.159:9642: clustered database server is disconnected from cluster; trying another server 2020-04-17T21:10:43Z|00224|reconnect|INFO|ssl:10.0.136.25:9642: connected and from sbdb: 2020-04-17T21:09:44Z|08341|raft|INFO|rejected append_reply (not leader) 2020-04-17T21:09:45Z|08342|raft|INFO|rejected append_reply (not leader) 2020-04-17T21:09:45Z|08343|raft|INFO|rejected append_reply (not leader) 2020-04-17T21:09:53Z|08344|raft|INFO|server e7de is leader for term 957 2020-04-17T21:10:01Z|08345|reconnect|ERR|ssl:10.0.156.159:9644: no response to inactivity probe after 5 seconds, disconnecting 2020-04-17T21:10:01Z|08346|reconnect|INFO|ssl:10.0.156.159:9644: connection dropped 2020-04-17T21:10:02Z|08347|reconnect|ERR|ssl:10.0.156.159:36322: no response to inactivity probe after 5 seconds, disconnecting 2020-04-17T21:10:02Z|08348|reconnect|INFO|ssl:10.0.156.159:9644: connecting... 2020-04-17T21:10:03Z|08349|reconnect|INFO|ssl:10.0.156.159:9644: connection attempt timed out 2020-04-17T21:10:03Z|08350|reconnect|INFO|ssl:10.0.156.159:9644: waiting 2 seconds before reconnect 2020-04-17T21:10:05Z|08351|reconnect|INFO|ssl:10.0.156.159:9644: connecting... 2020-04-17T21:10:07Z|08352|reconnect|INFO|ssl:10.0.156.159:9644: connection attempt timed out 2020-04-17T21:10:07Z|08353|reconnect|INFO|ssl:10.0.156.159:9644: waiting 4 seconds before reconnect 2020-04-17T21:10:11Z|08354|reconnect|INFO|ssl:10.0.156.159:9644: connecting... 2020-04-17T21:10:15Z|08355|reconnect|INFO|ssl:10.0.156.159:9644: connection attempt timed out 2020-04-17T21:10:15Z|08356|reconnect|INFO|ssl:10.0.156.159:9644: continuing to reconnect in the background but suppressing further logging Perhaps we need https://bugzilla.redhat.com/show_bug.cgi?id=1822290 ?
Would it be possible to check ovn-controller.log on all nodes just to make sure that indeed we don't hit the original issue due to missing SB DB updates with monitor_cond_change reply lost during reconnect? That should be fixed by https://patchwork.ozlabs.org/patch/1268457/ and should anyway not be hit with ovn-monitor-all=true. If we don't find any "ovsdb_idl|WARN|cannot modify missing row" logs we're probably hitting a different bug. Regarding the sbdb disconnects, yes, https://bugzilla.redhat.com/show_bug.cgi?id=1822290 might address that.
It isn't possible at this point, the auth on the cluster is no longer working and I can't access the logs via the kube client: root@ip-172-31-73-134: ~/scale-ci-deploy # kubectl get pod -A |grep auth openshift-authentication-operator authentication-operator-65cd7bc7d7-9j8bh 1/1 Terminating 7 2d22h openshift-authentication-operator authentication-operator-65cd7bc7d7-9k5hs 0/1 Pending 0 45h openshift-authentication oauth-openshift-58b9fc97ff-hhdp8 1/1 Terminating 0 2d22h openshift-authentication oauth-openshift-58b9fc97ff-kg2g8 1/1 Terminating 0 2d22h I did check a bunch of the nodes with issues on Friday and did not see missing row, and at 200 nodes did not see any nodes with errors.
Hi Tim, Is it possible that you help to test with the latest ovn with ovn_monitor_all=false and confirm if it is fixed? the latest ovn version: http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/2.13.0/37.el7fdp/x86_64/
tested with ovn scale test: https://github.com/dceara/ovn-heater.git set ovn_monitor_all=false: [root@hp-dl380pg8-12 ovn-fake-multinode]# git diff . diff --git a/ovn_cluster.sh b/ovn_cluster.sh index 59cac33..c629f04 100755 --- a/ovn_cluster.sh +++ b/ovn_cluster.sh @@ -271,6 +271,7 @@ ovs-vsctl set open . external-ids:ovn-remote-probe-interval=180000 if [ "\$ovn_monitor_all" = "yes" ]; then ovs-vsctl set open . external-ids:ovn-monitor-all=true fi + ovs-vsctl set open . external-ids:ovn-monitor-all=false ovs-vsctl --if-exists del-br br-ex ovs-vsctl add-br br-ex change ovn-sbctl for ovn-scale-test: [root@hp-dl380pg8-13 ovn-scale-test]# git diff . diff --git a/rally_ovs/plugins/ovs/ovsclients_impl.py b/rally_ovs/plugins/ovs/ovsclients_impl.py index b21efcf..96a6fc0 100644 --- a/rally_ovs/plugins/ovs/ovsclients_impl.py +++ b/rally_ovs/plugins/ovs/ovsclients_impl.py @@ -338,7 +338,7 @@ class OvnSbctl(OvsClient): self.sandbox = None self.batch_mode = False self.cmds = None - self.sbctl_cmd = "ovn-sbctl --no-leader-only" + self.sbctl_cmd = "ovn-sbctl --no-leader-only --db=ssl:localhost:6642 -p /opt/ovn/ovn-pri def enable_batch_mode(self, value=True): self.batch_mode = bool(value reproduced on ovn2.13.0.-34: [root@dell-per740-12 ovn-fake-multinode]# for d in `seq 90 108` > do > docker exec ovn-scale-1-$d grep "missing row" /var/log/ovn/ovn-controller.log > done 2020-07-09T05:08:55.686Z|00844|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:05:52.005Z|00666|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:07:51.376Z|00747|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:07:50.027Z|00622|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:09:45.381Z|00643|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:04:18.322Z|00626|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:09:43.663Z|00598|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding 2020-07-09T05:09:45.025Z|00433|ovsdb_idl|WARN|cannot modify missing row ff9ce11e-3797-4c47-869b-5e7bc03b31ac in table Port_Binding [root@dell-per740-12 ovn-fake-multinode]# docker exec -it ovn-scale-1-108 bash [root@ovn-scale-1-108 /]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.13-2.13.0-30.el7fdp.x86_64 ovn2.13-host-2.13.0-34.el7fdp.x86_64 ovn2.13-2.13.0-34.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch ovn2.13-central-2.13.0-34.el7fdp.x86_64 Verified on ovn2.13.0-37: [root@dell-per740-12 ovn-fake-multinode]# docker exec -it ovn-scale-1-1 bash [root@ovn-scale-1-1 /]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.13-2.13.0-30.el7fdp.x86_64 ovn2.13-host-2.13.0-37.el7fdp.x86_64 ovn2.13-2.13.0-37.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch ovn2.13-central-2.13.0-37.el7fdp.x86_64 no "cannot modify missing row" in ovn-controller.log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2941