Description of problem: Deletion and re-creation of pods at scale(2000) across multiple namespaces stuck at containercreating state and events shows CNI request timeout with below error. This seems to be happened due to that ports are not properly be removed during pod deletion and leave stale data in db. Warning ErrorAddingLogicalPort 39m controlplane error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 8b534f6b-23a8-4984-863f-7b391b301067, was inserted by this transaction. Second row, with UUID 25107606-b83c-420b-a903-ae7f6278ecaf, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction Warning FailedCreatePodSandBox 4m22s (x139 over 56m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-6-served-job-30_f5-served-ns-30_0825a91a-27ab-4a66-9712-5dfceff04331_0(d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07): error adding pod f5-served-ns-30_pod-served-1-6-served-job-30 to CNI network "multus-cni-network": [f5-served-ns-30/pod-served-1-6-served-job-30:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] [f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows Version-Release number of selected component (if applicable): 4.7.24 How reproducible: always Steps to Reproduce: 1. Deploy a health cluster with 110+ worker nodes 2. Create 2000+ pods across 35 namespace 3. Delete and re-create(after several hours) pods Actual results: Deleted pods and waited for several to make sure everything cleaned and recreated pods across multiple namespace failed with CNI request timeout error Expected results: Pods should be deleted and DB be cleaned without any stale ports. Recreation of pods should work as normal with expected podready latency. Additional info: openvswitch2.13-2.13.0-79.el8fdp.x86_64 ovn2.13-20.12.0-24.el8fdp.x86_64
This is a nasty bug. There are multiple issues: 1) During creation, we keep seeing (NOTE we couldn't even spawn a must-gather pod), ~360 pods were stuck Warning ErrorAddingLogicalPort 39m controlplane error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 8b534f6b-23a8-4984-863f-7b391b301067, was inserted by this transaction. Second row, with UUID 25107606-b83c-420b-a903-ae7f6278ecaf, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction this error over and over and pod creation gets stuck. When looking into the nbdb, we can see a LSP already existing for that pod: 25107606-b83c-420b-a903-ae7f6278ecaf (f5-served-ns-30_pod-served-1-6-served-job-30) So I tried to delete the LSP from nbdb and we see the pod's creation being tried again: Warning FailedCreatePodSandBox 4m22s (x139 over 56m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-6-served-job-30_f5-served-ns-30_0825a91a-27ab-4a66-9712-5dfceff04331_0(d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07): error adding pod f5-served-ns-30_pod-served-1-6-served-job-30 to CNI network "multus-cni-network": [f5-served-ns-30/pod-served-1-6-served-job-30:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] [f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ' But almost immediately as soon as it times out for flows, we again see that error: Warning ErrorAddingLogicalPort 4m9s (x35 over 38m) controlplane (combined from similar events): error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. Second row, with UUID 35b3d95a-41ae-42ce-b365-55fccd521b95, was inserted by this transaction. in committing transaction Note that 46564154-61b2-4aed-8bf9-6cac17fcdc18 was the port that got created newly when I deleted the older port. For some reason master believes it needs to recreate 35b3d95a-41ae-42ce-b365-55fccd521b95 a new LSP again for that pod and tries the transaction? Seems like a go-ovn portcache bug to me. Workaround: Restarting the masters fixed the issue. - probably because we start refresh and end up deleting all the stale ports. Looking into the controller logs I saw many port claims and releases since the port was created. :this could very well be due to the timed out waiting for flows as well. [kni@f20-h21-000-r640 ~]$ cat ovnkube-node-sbjc5.logs | grep f5-served-ns-30_pod-served-1-6-served-job-30 2021-08-27T16:13:50Z|20708|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T16:13:50Z|20709|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112 2021-08-27T16:13:58Z|20710|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. 2021-08-27T16:13:59Z|20711|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T16:13:59Z|20712|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112 2021-08-27T16:14:20Z|20719|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. 2021-08-27T16:14:21Z|20720|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T16:14:21Z|20721|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112 <snippet> 2021-08-27T16:33:48Z|21868|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T16:33:48Z|21869|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112 2021-08-27T16:34:10Z|21882|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. 2021-08-27T16:34:10Z|21883|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T16:34:10Z|21884|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112 2021-08-27T16:34:32Z|21897|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. <snipped> 2021-08-27T17:08:22Z|23781|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:71 10.128.44.113 2021-08-27T17:09:05Z|23841|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. 2021-08-27T17:09:05Z|23842|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T17:09:05Z|23843|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:71 10.128.44.113 2021-08-27T17:09:48Z|23904|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. 2021-08-27T17:09:48Z|23907|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis. 2021-08-27T17:09:48Z|23908|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:71 10.128.44.113 2021-08-27T17:10:34Z|23994|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis. F) We deleted all the pods running and stuck in creation. We saw stale ports being left behind that belonged to the pods that were encountering these errors. This makes no sense whatsoever because ports are deleted using their "namespace/pod-name" id. master logs: W0827 16:55:17.156390 1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30 E0827 16:55:17.157750 1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constrain t violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on col umn "name". First row, with UUID 91b2b6c7-9ad5-441d-b446-1336ee17859e, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existe d in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction I0827 16:55:17.157846 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-97 12-5dfceff04331", APIVersion:"v1", ResourceVersion:"3177954", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns -30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 91b2b6c7-9ad5-441d-b446-1336ee17859e, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction W0827 16:56:18.220720 1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30 E0827 16:56:18.221797 1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constrain t violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on col umn "name". First row, with UUID a7d6e0d8-ee73-4a2a-a980-e677c7aa7f6f, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existe d in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction I0827 16:56:18.221887 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-97 12-5dfceff04331", APIVersion:"v1", ResourceVersion:"3177954", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns -30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID a7d6e0d8-ee73-4a2a-a980-e677c7aa7f6f, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction W0827 17:08:32.425187 1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30 E0827 17:08:32.426429 1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constrain t violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on col umn "name". First row, with UUID 8f95413e-e0b1-4684-bbc8-e757078f279f, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existe d in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction I0827 17:08:32.426655 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-9712-5dfceff04331", APIVersion:"v1", ResourceVersion:"3177954", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 8f95413e-e0b1-4684-bbc8-e757078f279f, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction W0827 17:10:29.977723 1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30 E0827 17:10:29.978953 1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 6fbbefc4-e1e1-451c-afdf-f3f875e0fdee, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction ====Here are the delete logs: I0827 17:10:29.978971 1 pods.go:91] Deleting pod: f5-served-ns-30/pod-served-1-6-served-job-30 E0827 17:10:29.978979 1 pods.go:96] logical port f5-served-ns-30_pod-served-1-6-served-job-30 not found in cache I0827 17:10:29.979041 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-9712-5dfceff04331", APIVersion:"v1", ResourceVersion:"3288048", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name". First row, with UUID 6fbbefc4-e1e1-451c-afdf-f3f875e0fdee, was inserted by this transaction. Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction I did not see any error during deletion. After master restart: I0827 18:07:03.178445 1 pods.go:70] Stale logical port found: f5-served-ns-34_pod-served-1-53-served-job-34. This logical port will be deleted. I0827 18:07:03.181342 1 pods.go:70] Stale logical port found: f5-served-ns-33_pod-served-1-25-served-job-33. This logical port will be deleted. I0827 18:07:03.183793 1 pods.go:70] Stale logical port found: f5-served-ns-32_pod-served-1-3-served-job-32. This logical port will be deleted. I0827 18:07:03.186315 1 pods.go:70] Stale logical port found: f5-served-ns-33_pod-served-1-26-served-job-33. This logical port will be deleted. I0827 18:07:03.188913 1 pods.go:70] Stale logical port found: f5-served-ns-32_pod-served-1-8-served-job-32. This logical port will be deleted. We had ~362 stale ports
- let's say in the first run they had stale ports that did not get deleted, - while trying to add pod it tries to create a new port and encounters that transaction error - request gets added to the retry queue - from there on, its a bomb of requests and retries. - every time an erro-red request is added it probably generates a new port UUID? I am 90% sure this is what happened. Because I saw 362 stale port entries and 360 pods stuck. However it still doesn't explain how a brand new pod creation failed with that transaction error.
Looking at the log we have for nbdb, it looks like there is high CPU, long poll intervals, and backoff on json rpc as well as rejecting append_requests: 2021-08-26T21:16:38Z|00198|poll_loop|INFO|Dropped 572 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2021-08-26T21:16:38Z|00199|poll_loop|INFO|wakeup due to 0-ms timeout at ../lib/stream-ssl.c:834 (98% CPU usage) 2021-08-26T21:16:39Z|00200|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-08-26T21:16:39Z|00201|jsonrpc|WARN|ssl:192.168.216.11:33646: receive error: Protocol error 2021-08-26T21:16:39Z|00202|reconnect|WARN|ssl:192.168.216.11:33646: connection dropped (Protocol error) 2021-08-26T21:16:40Z|00203|raft|INFO|ssl:192.168.216.11:33872: learned server ID a4f6 2021-08-26T21:16:40Z|00204|raft|INFO|ssl:192.168.216.11:33872: learned remote address ssl:192.168.216.11:9643 2021-08-26T21:16:44Z|00205|poll_loop|INFO|Dropped 575 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2021-08-26T21:16:44Z|00206|poll_loop|INFO|wakeup due to 0-ms timeout at ../lib/stream-ssl.c:834 (99% CPU usage) 2021-08-26T21:16:46Z|00207|stream_ssl|WARN|SSL_read: unexpected SSL connection close 2021-08-27T16:03:04Z|05355|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 50, backlog: 9167550. 2021-08-27T16:03:04Z|05356|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 51, backlog: 9350901. 2021-08-27T16:03:04Z|05357|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 52, backlog: 9534252. 2021-08-27T16:03:04Z|05358|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 53, backlog: 9717603 2021-08-26T21:16:08Z|00108|raft|INFO|rejecting append_request because previous entry 5,741882 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00109|raft|INFO|rejecting append_request because previous entry 5,741883 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00110|raft|INFO|rejecting append_request because previous entry 5,741883 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00111|raft|INFO|rejecting append_request because previous entry 5,741884 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00112|raft|INFO|rejecting append_request because previous entry 5,741884 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00113|raft|INFO|rejecting append_request because previous entry 5,741885 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00114|raft|INFO|rejecting append_request because previous entry 5,741885 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00115|raft|INFO|rejecting append_request because previous entry 5,741886 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00116|raft|INFO|rejecting append_request because previous entry 5,741886 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00117|raft|INFO|rejecting append_request because previous entry 5,741887 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00118|raft|INFO|rejecting append_request because previous entry 5,741887 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00119|raft|INFO|rejecting append_request because previous entry 5,741888 not in local log (mismatch past end of log) 2021-08-26T21:16:08Z|00120|raft|INFO|rejecting append_request because previous entry 5,741888 not in local log (mismatch past end of log) We dont have the other nbdb logs, but I suspect the others are similar. It looks like nbdb cannot keep up with the number of changes coming from ovnkube. One change that would help is: https://bugzilla.redhat.com/show_bug.cgi?id=1960391 - transfer raft leadership during snapshot write. This will relieve some pressure on the leader, and reduce time taken during poll intervals. 4.7 ovnkube image has OVS version 2.13.0-79.el8fdp, we would need openvswitch2.13-2.13.0-114.el8fdp or openvswitch2.13-2.13.0-94.el7fdp to pick up this change. Additionally in the logs it shows that go-ovn commands are failing, while nbctl commands are succeeding. nbctl will connect to the leader, while go-ovn client may connect to followers. We dont have full logs, so my best guess here is that this may be a consequence of go-ovn interacting with a follower that is in a bad state. Alternatively, we could be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1999650 It's impossible to tell without more information.
So we narrowed down this bug on 4.7.24. Thanks Tim, Numan, Murali, Sai for helping. Scale team ran the kube burner tests in a loop, i.e create all pods, delete all pods, wait for ~20mins. We saw the transaction error happening: I0831 18:15:39.788465 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP IVersion:"v1", ResourceVersion:"4213537", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn -worker) for index on column "name". First row, with UUID e9b5fd13-664c-470f-90ee-cf0500ab4dc4, was inserted by this transaction. Second row, with UUID 85c48bd1-8824-4b9c- bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction sh-4.4# ovn-nbctl lsp-list worker012-r640 85c48bd1-8824-4b9c-bb5e-11a658be4bc1 (bar_client-on-ovn-worker) Checking the master logs: [kni@f20-h21-000-r640 ~]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-kn2k4 -c ovnkube-master -f | grep client-on-ovn-worker I0831 18:15:39.733490 1 informer.go:294] Successfully synced 'bar/client-on-ovn-worker' W0831 18:15:39.742016 1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker First request succeeds: I0831 18:15:39.742629 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.130.17.32/23"],"mac_address":"0a:58:0a:82:11:20"," gateway_ips":["10.130.16.1"],"ip_address":"10.130.17.32/23","gateway_ip":"10.130.16.1"}}] on pod bar/client-on-ovn-worker I0831 18:15:39.771226 1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 32.411363ms But the goovn cache doesn't get updated :( and hence LSPGet fails. E0831 18:15:39.771254 1 ovn.go:635] failed to get the logical switch port: bar_client-on-ovn-worker from the ovn client, error: object not found I0831 18:15:39.771420 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP IVersion:"v1", ResourceVersion:"4213535", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' failed to get the logical switch port: bar_client-on-ovn-worker fr om the ovn client, error: object not found W0831 18:15:39.773443 1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker We try the pod add again and we hit that bug on transaction because the port already exists in the db. Then its a cycle of these errors: I0831 18:15:39.774158 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.130.17.33/23"],"mac_address":"0a:58:0a:82:11:21"," gateway_ips":["10.130.16.1"],"ip_address":"10.130.17.33/23","gateway_ip":"10.130.16.1"}}] on pod bar/client-on-ovn-worker I0831 18:15:39.788344 1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 17.058603ms E0831 18:15:39.788361 1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name". First row, with UUID e9b5fd 13-664c-470f-90ee-cf0500ab4dc4, was inserted by this transaction. Second row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transactio n, which modified some of the row's columns but not any columns in this index. in committing transaction I0831 18:15:39.788465 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP IVersion:"v1", ResourceVersion:"4213537", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn -worker) for index on column "name". First row, with UUID e9b5fd13-664c-470f-90ee-cf0500ab4dc4, was inserted by this transaction. Second row, with UUID 85c48bd1-8824-4b9c- bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction W0831 18:15:39.790757 1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker I0831 18:15:39.791545 1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 3.15309ms E0831 18:15:39.791566 1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name". First row, with UUID 85c48b d1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. Second row, wit h UUID 1caad6da-785d-427c-b3f0-cb17ce814dac, was inserted by this transaction. in committing transaction I0831 18:15:39.791729 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP IVersion:"v1", ResourceVersion:"4213538", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn -worker) for index on column "name". First row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. Second row, with UUID 1caad6da-785d-427c-b3f0-cb17ce814dac, was inserted by this transaction. in committing transaction W0831 18:15:39.794327 1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker E0831 18:15:39.795960 1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name". First row, with UUID 85c48b d1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. Second row, wit h UUID 4a8cb4e2-1116-4a0e-a566-d57317ef7904, was inserted by this transaction. in committing transaction I0831 18:15:39.796015 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP IVersion:"v1", ResourceVersion:"4213541", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn -worker) for index on column "name". First row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. Second row, with UUID 4a8cb4e2-1116-4a0e-a566-d57317ef7904, was inserted by this transaction. in committing transaction I0831 18:16:58.464661 1 ovn.go:584] [a04870f6-ca9e-4398-87b9-44688233b48a/bar/client-on-ovn-worker] retry pod setup W0831 18:16:58.466613 1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker I0831 18:16:58.467577 1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 2.888758ms E0831 18:16:58.467593 1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name". First row, with UUID a10eea ba-590f-4f89-852e-3f1f60d6fd3d, was inserted by this transaction. Second row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transactio n, which modified some of the row's columns but not any columns in this index. in committing transaction I0831 18:16:58.467608 1 ovn.go:590] [a04870f6-ca9e-4398-87b9-44688233b48a/bar/client-on-ovn-worker] setup retry failed; will try again later We enabled jsonrpc debug logging in the northd, nbdb and saw the request and reply messages for the lsp-add, but we did not see any "update" messages getting sent after the request and before the reply. Since the update message is the one that triggers the cacheLogicalSwitchPort to get repopulated from the nbdb, we don't have cache updates and we fail the creation ^ and retry and cause the bug. When we went on searching for why update messages were not sent by ovsdbserver, we saw: [root@master-0 ~]# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl memory/show cells:100055 monitors:0 raft-connections:4 raft-log:3650 sessions:3 there weren't any monitors running on the pod to which ovnkube-master was connected to. NOTE: We don't know why or when the monitor got cancelled, since it happened before we enabled the debug messaging. Triggering a appctl reconnect recreates the monitor and we started seeing the update messages and the cache getting populated. The transaction errors went away. Plan is to try and implement a way forward where we trigger reconnect in case we don't get an update between a request and response. PoC: https://github.com/trozet/ovn-kubernetes/commit/a1defb0bafc68ea18e6781e926e2611020e7b101
Was able to reproduce with jsonrpc logging and we can clearly see nbdb does not send us a monitor_cancel. It drops the monitor and the only indication of something went wrong is a transaction reply error with reason "canceled". Filed https://bugzilla.redhat.com/show_bug.cgi?id=2000375 Going to test the workaround here: https://github.com/openshift/ovn-kubernetes/pull/709 which will detect if we missed an update, and force a reconnect.
Fix PR posted: https://github.com/openshift/ovn-kubernetes/pull/717
4.7.z backport created: https://bugzilla.redhat.com/show_bug.cgi?id=2003901 was on PTO last week, but I see Tim has given the new image on 4.7.24 with the fix: https://coreos.slack.com/archives/CU9HKBZKJ/p1631125442091400?thread_ts=1631024275.468200&cid=CU9HKBZKJ Clearing needinfo on my part.
*** Bug 1997205 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days