Bug 1998614 - Pod creation failed with CNI request timeout due to stale data in cache.
Summary: Pod creation failed with CNI request timeout due to stale data in cache.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.9.0
Assignee: Surya Seetharaman
QA Contact: Anurag saxena
URL:
Whiteboard:
: 1997205 (view as bug list)
Depends On: 2000375
Blocks: 2001363
TreeView+ depends on / blocked
 
Reported: 2021-08-27 18:04 UTC by Murali Krishnasamy
Modified: 2023-09-15 01:14 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2001363 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:49:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 717 0 None None None 2021-09-04 00:46:29 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:49:58 UTC

Description Murali Krishnasamy 2021-08-27 18:04:04 UTC
Description of problem:
Deletion and re-creation of pods at scale(2000) across multiple namespaces stuck at containercreating state and events shows CNI request timeout with below error. 
This seems to be happened due to that ports are not properly be removed during pod deletion and leave stale data in db.

  Warning  ErrorAddingLogicalPort  39m                    controlplane  error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 8b534f6b-23a8-4984-863f-7b391b301067, was inserted by this transaction.  Second row, with UUID 25107606-b83c-420b-a903-ae7f6278ecaf, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction

  Warning  FailedCreatePodSandBox  4m22s (x139 over 56m)  kubelet       (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-6-served-job-30_f5-served-ns-30_0825a91a-27ab-4a66-9712-5dfceff04331_0(d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07): error adding pod f5-served-ns-30_pod-served-1-6-served-job-30 to CNI network "multus-cni-network": [f5-served-ns-30/pod-served-1-6-served-job-30:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] [f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows

Version-Release number of selected component (if applicable):
4.7.24

How reproducible:
always

Steps to Reproduce:
1. Deploy a health cluster with 110+ worker nodes
2. Create 2000+ pods across 35 namespace
3. Delete and re-create(after several hours) pods

Actual results:
Deleted pods and waited for several to make sure everything cleaned and recreated pods across multiple namespace failed with CNI request timeout error

Expected results:
Pods should be deleted and DB be cleaned without any stale ports. Recreation of pods should work as normal with expected podready latency.

Additional info:
openvswitch2.13-2.13.0-79.el8fdp.x86_64
ovn2.13-20.12.0-24.el8fdp.x86_64

Comment 1 Surya Seetharaman 2021-08-27 19:06:36 UTC
This is a nasty bug.

There are multiple issues:

1) During creation, we keep seeing (NOTE we couldn't even spawn a must-gather pod), ~360 pods were stuck

  Warning  ErrorAddingLogicalPort  39m                    controlplane  error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 8b534f6b-23a8-4984-863f-7b391b301067, was inserted by this transaction.  Second row, with UUID 25107606-b83c-420b-a903-ae7f6278ecaf, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction

this error over and over and pod creation gets stuck.

When looking into the nbdb, we can see a LSP already existing for that pod:


25107606-b83c-420b-a903-ae7f6278ecaf (f5-served-ns-30_pod-served-1-6-served-job-30)


So I tried to delete the LSP from nbdb and we see the pod's creation being tried again:


  Warning  FailedCreatePodSandBox  4m22s (x139 over 56m)  kubelet       (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-6-served-job-30_f5-served-ns-30_0825a91a-27ab-4a66-9712-5dfceff04331_0(d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07): error adding pod f5-served-ns-30_pod-served-1-6-served-job-30 to CNI network "multus-cni-network": [f5-served-ns-30/pod-served-1-6-served-job-30:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] [f5-served-ns-30/pod-served-1-6-served-job-30 d4e1ad6adfb4efa06f5e41faeb11037679fc7d5e85c5744d0a39951f31211c07] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows
'

But almost immediately as soon as it times out for flows, we again see that error:


  Warning  ErrorAddingLogicalPort  4m9s (x35 over 38m)  controlplane  (combined from similar events): error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, with UUID 35b3d95a-41ae-42ce-b365-55fccd521b95, was inserted by this transaction. in committing transaction


Note that 46564154-61b2-4aed-8bf9-6cac17fcdc18 was the port that got created newly when I deleted the older port.


For some reason master believes it needs to recreate 35b3d95a-41ae-42ce-b365-55fccd521b95 a new LSP again for that pod and tries the transaction?

Seems like a go-ovn portcache bug to me. Workaround: Restarting the masters fixed the issue. - probably because we start refresh and end up deleting all the stale ports.


Looking into the controller logs I saw many port claims and releases since the port was created. :this could very well be due to the timed out waiting for flows as well.

[kni@f20-h21-000-r640 ~]$ cat ovnkube-node-sbjc5.logs | grep  f5-served-ns-30_pod-served-1-6-served-job-30                                                                  
2021-08-27T16:13:50Z|20708|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T16:13:50Z|20709|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112                                              
2021-08-27T16:13:58Z|20710|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.                                                     
2021-08-27T16:13:59Z|20711|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T16:13:59Z|20712|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112                                              
2021-08-27T16:14:20Z|20719|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.                                                     
2021-08-27T16:14:21Z|20720|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T16:14:21Z|20721|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112   
<snippet>                                            ​
2021-08-27T16:33:48Z|21868|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T16:33:48Z|21869|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112                                              
2021-08-27T16:34:10Z|21882|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.                                                     
2021-08-27T16:34:10Z|21883|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T16:34:10Z|21884|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:70 10.128.44.112                                              
2021-08-27T16:34:32Z|21897|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.   
<snipped>                                     ​
2021-08-27T17:08:22Z|23781|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:71 10.128.44.113                                              
2021-08-27T17:09:05Z|23841|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.                                                     
2021-08-27T17:09:05Z|23842|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T17:09:05Z|23843|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:71 10.128.44.113                                              
2021-08-27T17:09:48Z|23904|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.                                                     
2021-08-27T17:09:48Z|23907|binding|INFO|Claiming lport f5-served-ns-30_pod-served-1-6-served-job-30 for this chassis.                                                       
2021-08-27T17:09:48Z|23908|binding|INFO|f5-served-ns-30_pod-served-1-6-served-job-30: Claiming 0a:58:0a:80:2c:71 10.128.44.113                                              
2021-08-27T17:10:34Z|23994|binding|INFO|Releasing lport f5-served-ns-30_pod-served-1-6-served-job-30 from this chassis.


F) We deleted all the pods running and stuck in creation. We saw stale ports being left behind that belonged to the pods that were encountering these errors. This makes no sense whatsoever because ports are deleted using their "namespace/pod-name" id.

master logs:

W0827 16:55:17.156390       1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30                                                     
E0827 16:55:17.157750       1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constrain
t violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on col
umn "name".  First row, with UUID 91b2b6c7-9ad5-441d-b446-1336ee17859e, was inserted by this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existe
d in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction                            
I0827 16:55:17.157846       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-97
12-5dfceff04331", APIVersion:"v1", ResourceVersion:"3177954", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns
-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to
have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 91b2b6c7-9ad5-441d-b446-1336ee17859e, was inserted by
this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but
not any columns in this index. in committing transaction                                                                                                                    
W0827 16:56:18.220720       1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30                                                     
E0827 16:56:18.221797       1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constrain
t violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on col
umn "name".  First row, with UUID a7d6e0d8-ee73-4a2a-a980-e677c7aa7f6f, was inserted by this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existe
d in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction                            
I0827 16:56:18.221887       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-97
12-5dfceff04331", APIVersion:"v1", ResourceVersion:"3177954", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns
-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to
have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID a7d6e0d8-ee73-4a2a-a980-e677c7aa7f6f, was inserted by
this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but
not any columns in this index. in committing transaction                                                                                                                  ​
W0827 17:08:32.425187       1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30                                                     
E0827 17:08:32.426429       1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constrain
t violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on col
umn "name".  First row, with UUID 8f95413e-e0b1-4684-bbc8-e757078f279f, was inserted by this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existe
d in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction    
I0827 17:08:32.426655       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-9712-5dfceff04331", APIVersion:"v1", ResourceVersion:"3177954", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to
have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 8f95413e-e0b1-4684-bbc8-e757078f279f, was inserted by
this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but
not any columns in this index. in committing transaction
          ​
this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but
not any columns in this index. in committing transaction
W0827 17:10:29.977723       1 pods.go:347] Failed to get options for port: f5-served-ns-30_pod-served-1-6-served-job-30
E0827 17:10:29.978953       1 ovn.go:635] error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 6fbbefc4-e1e1-451c-afdf-f3f875e0fdee, was inserted by this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction                        

====Here are the delete logs:

I0827 17:10:29.978971       1 pods.go:91] Deleting pod: f5-served-ns-30/pod-served-1-6-served-job-30                                                                        
E0827 17:10:29.978979       1 pods.go:96] logical port f5-served-ns-30_pod-served-1-6-served-job-30 not found in cache                                                      
I0827 17:10:29.979041       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"f5-served-ns-30", Name:"pod-served-1-6-served-job-30", UID:"0825a91a-27ab-4a66-9712-5dfceff04331", APIVersion:"v1", ResourceVersion:"3288048", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port f5-served-ns-30_pod-served-1-6-served-job-30 error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to
have identical values (f5-served-ns-30_pod-served-1-6-served-job-30) for index on column "name".  First row, with UUID 6fbbefc4-e1e1-451c-afdf-f3f875e0fdee, was inserted by
this transaction.  Second row, with UUID 46564154-61b2-4aed-8bf9-6cac17fcdc18, existed in the database before this transaction, which modified some of the row's columns but
not any columns in this index. in committing transaction

I did not see any error during deletion.


After master restart:


I0827 18:07:03.178445       1 pods.go:70] Stale logical port found: f5-served-ns-34_pod-served-1-53-served-job-34. This logical port will be deleted.                       
I0827 18:07:03.181342       1 pods.go:70] Stale logical port found: f5-served-ns-33_pod-served-1-25-served-job-33. This logical port will be deleted.                       
I0827 18:07:03.183793       1 pods.go:70] Stale logical port found: f5-served-ns-32_pod-served-1-3-served-job-32. This logical port will be deleted.                        
I0827 18:07:03.186315       1 pods.go:70] Stale logical port found: f5-served-ns-33_pod-served-1-26-served-job-33. This logical port will be deleted.                       
I0827 18:07:03.188913       1 pods.go:70] Stale logical port found: f5-served-ns-32_pod-served-1-8-served-job-32. This logical port will be deleted.

We had  ~362 stale ports

Comment 4 Surya Seetharaman 2021-08-30 12:09:11 UTC
- let's say in the first run they had stale ports that did not get deleted,
- while trying to add pod it tries to create a new port and encounters that transaction error
- request gets added to the retry queue
- from there on, its a bomb of requests and retries. - every time an erro-red request is added it probably generates a new port UUID?

I am 90% sure this is what happened. Because I saw 362 stale port entries and 360 pods stuck. However it still doesn't explain how a brand new pod creation failed with that transaction error.

Comment 5 Tim Rozet 2021-08-31 16:55:54 UTC
Looking at the log we have for nbdb, it looks like there is high CPU, long poll intervals, and backoff on json rpc as well as rejecting append_requests:

2021-08-26T21:16:38Z|00198|poll_loop|INFO|Dropped 572 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2021-08-26T21:16:38Z|00199|poll_loop|INFO|wakeup due to 0-ms timeout at ../lib/stream-ssl.c:834 (98% CPU usage)
2021-08-26T21:16:39Z|00200|stream_ssl|WARN|SSL_read: unexpected SSL connection close
2021-08-26T21:16:39Z|00201|jsonrpc|WARN|ssl:192.168.216.11:33646: receive error: Protocol error
2021-08-26T21:16:39Z|00202|reconnect|WARN|ssl:192.168.216.11:33646: connection dropped (Protocol error)
2021-08-26T21:16:40Z|00203|raft|INFO|ssl:192.168.216.11:33872: learned server ID a4f6
2021-08-26T21:16:40Z|00204|raft|INFO|ssl:192.168.216.11:33872: learned remote address ssl:192.168.216.11:9643
2021-08-26T21:16:44Z|00205|poll_loop|INFO|Dropped 575 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2021-08-26T21:16:44Z|00206|poll_loop|INFO|wakeup due to 0-ms timeout at ../lib/stream-ssl.c:834 (99% CPU usage)
2021-08-26T21:16:46Z|00207|stream_ssl|WARN|SSL_read: unexpected SSL connection close


2021-08-27T16:03:04Z|05355|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 50, backlog: 9167550.
2021-08-27T16:03:04Z|05356|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 51, backlog: 9350901.
2021-08-27T16:03:04Z|05357|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 52, backlog: 9534252.
2021-08-27T16:03:04Z|05358|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.168.216.11:9643, num of msgs: 53, backlog: 9717603

2021-08-26T21:16:08Z|00108|raft|INFO|rejecting append_request because previous entry 5,741882 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00109|raft|INFO|rejecting append_request because previous entry 5,741883 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00110|raft|INFO|rejecting append_request because previous entry 5,741883 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00111|raft|INFO|rejecting append_request because previous entry 5,741884 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00112|raft|INFO|rejecting append_request because previous entry 5,741884 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00113|raft|INFO|rejecting append_request because previous entry 5,741885 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00114|raft|INFO|rejecting append_request because previous entry 5,741885 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00115|raft|INFO|rejecting append_request because previous entry 5,741886 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00116|raft|INFO|rejecting append_request because previous entry 5,741886 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00117|raft|INFO|rejecting append_request because previous entry 5,741887 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00118|raft|INFO|rejecting append_request because previous entry 5,741887 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00119|raft|INFO|rejecting append_request because previous entry 5,741888 not in local log (mismatch past end of log)
2021-08-26T21:16:08Z|00120|raft|INFO|rejecting append_request because previous entry 5,741888 not in local log (mismatch past end of log)


We dont have the other nbdb logs, but I suspect the others are similar. It looks like nbdb cannot keep up with the number of changes coming from ovnkube. One change that would help is:
https://bugzilla.redhat.com/show_bug.cgi?id=1960391 - transfer raft leadership during snapshot write. This will relieve some pressure on the leader, and reduce time taken during poll intervals. 4.7 ovnkube image has OVS version 2.13.0-79.el8fdp, we would need openvswitch2.13-2.13.0-114.el8fdp or openvswitch2.13-2.13.0-94.el7fdp to pick up this change.

Additionally in the logs it shows that go-ovn commands are failing, while nbctl commands are succeeding. nbctl will connect to the leader, while go-ovn client may connect to followers. We dont have full logs, so my best guess here is that this may be a consequence of go-ovn interacting with a follower that is in a bad state. Alternatively, we could be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1999650

It's impossible to tell without more information.

Comment 6 Surya Seetharaman 2021-08-31 21:04:12 UTC
So we narrowed down this bug on 4.7.24. Thanks Tim, Numan, Murali, Sai for helping.

Scale team ran the kube burner tests in a loop, i.e create all pods, delete all pods, wait for ~20mins.

We saw the transaction error happening:

I0831 18:15:39.788465       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP
IVersion:"v1", ResourceVersion:"4213537", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error:
Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn
-worker) for index on column "name".  First row, with UUID e9b5fd13-664c-470f-90ee-cf0500ab4dc4, was inserted by this transaction.  Second row, with UUID 85c48bd1-8824-4b9c-
bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction

sh-4.4# ovn-nbctl lsp-list worker012-r640                                                                                                                                    
85c48bd1-8824-4b9c-bb5e-11a658be4bc1 (bar_client-on-ovn-worker) 

Checking the master logs:

[kni@f20-h21-000-r640 ~]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-kn2k4 -c ovnkube-master -f | grep client-on-ovn-worker                                          
I0831 18:15:39.733490       1 informer.go:294] Successfully synced 'bar/client-on-ovn-worker'                                                                                
W0831 18:15:39.742016       1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker      

First request succeeds:
                                                                    
I0831 18:15:39.742629       1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.130.17.32/23"],"mac_address":"0a:58:0a:82:11:20","
gateway_ips":["10.130.16.1"],"ip_address":"10.130.17.32/23","gateway_ip":"10.130.16.1"}}] on pod bar/client-on-ovn-worker                                                    
I0831 18:15:39.771226       1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 32.411363ms                  


But the goovn cache doesn't get updated :( and hence LSPGet fails.

E0831 18:15:39.771254       1 ovn.go:635] failed to get the logical switch port: bar_client-on-ovn-worker from the ovn client, error: object not found                       
I0831 18:15:39.771420       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP
IVersion:"v1", ResourceVersion:"4213535", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' failed to get the logical switch port: bar_client-on-ovn-worker fr
om the ovn client, error: object not found                                                                                                                                   
W0831 18:15:39.773443       1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker      

We try the pod add again and we hit that bug on transaction because the port already exists in the db. Then its a cycle of these errors:
                                                                    
I0831 18:15:39.774158       1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.130.17.33/23"],"mac_address":"0a:58:0a:82:11:21","
gateway_ips":["10.130.16.1"],"ip_address":"10.130.17.33/23","gateway_ip":"10.130.16.1"}}] on pod bar/client-on-ovn-worker          
I0831 18:15:39.788344       1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 17.058603ms
E0831 18:15:39.788361       1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details:
 Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name".  First row, with UUID e9b5fd
13-664c-470f-90ee-cf0500ab4dc4, was inserted by this transaction.  Second row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transactio
n, which modified some of the row's columns but not any columns in this index. in committing transaction
I0831 18:15:39.788465       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP
IVersion:"v1", ResourceVersion:"4213537", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error:
Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn
-worker) for index on column "name".  First row, with UUID e9b5fd13-664c-470f-90ee-cf0500ab4dc4, was inserted by this transaction.  Second row, with UUID 85c48bd1-8824-4b9c-
bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction
W0831 18:15:39.790757       1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker
I0831 18:15:39.791545       1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 3.15309ms
E0831 18:15:39.791566       1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details:
 Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name".  First row, with UUID 85c48b
d1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, wit
h UUID 1caad6da-785d-427c-b3f0-cb17ce814dac, was inserted by this transaction. in committing transaction
I0831 18:15:39.791729       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP
IVersion:"v1", ResourceVersion:"4213538", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error:
Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn
-worker) for index on column "name".  First row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the
row's columns but not any columns in this index.  Second row, with UUID 1caad6da-785d-427c-b3f0-cb17ce814dac, was inserted by this transaction. in committing transaction
W0831 18:15:39.794327       1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker
E0831 18:15:39.795960       1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details:
 Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name".  First row, with UUID 85c48b
d1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, wit
h UUID 4a8cb4e2-1116-4a0e-a566-d57317ef7904, was inserted by this transaction. in committing transaction
I0831 18:15:39.796015       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bar", Name:"client-on-ovn-worker", UID:"a04870f6-ca9e-4398-87b9-44688233b48a", AP
IVersion:"v1", ResourceVersion:"4213541", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port bar_client-on-ovn-worker error:
Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn
-worker) for index on column "name".  First row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transaction, which modified some of the
row's columns but not any columns in this index.  Second row, with UUID 4a8cb4e2-1116-4a0e-a566-d57317ef7904, was inserted by this transaction. in committing transaction
I0831 18:16:58.464661       1 ovn.go:584] [a04870f6-ca9e-4398-87b9-44688233b48a/bar/client-on-ovn-worker] retry pod setup
W0831 18:16:58.466613       1 pods.go:347] Failed to get options for port: bar_client-on-ovn-worker
I0831 18:16:58.467577       1 pods.go:302] [bar/client-on-ovn-worker] addLogicalPort took 2.888758ms
E0831 18:16:58.467593       1 ovn.go:635] error while creating logical port bar_client-on-ovn-worker error: Transaction Failed due to an error: constraint violation details:
 Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (bar_client-on-ovn-worker) for index on column "name".  First row, with UUID a10eea
ba-590f-4f89-852e-3f1f60d6fd3d, was inserted by this transaction.  Second row, with UUID 85c48bd1-8824-4b9c-bb5e-11a658be4bc1, existed in the database before this transactio
n, which modified some of the row's columns but not any columns in this index. in committing transaction
I0831 18:16:58.467608       1 ovn.go:590] [a04870f6-ca9e-4398-87b9-44688233b48a/bar/client-on-ovn-worker] setup retry failed; will try again later



We enabled jsonrpc debug logging in the northd, nbdb and saw the request and reply messages for the lsp-add, but we did not see any "update" messages getting sent after the request and before the reply. Since the update message is the one that triggers the cacheLogicalSwitchPort to get repopulated from the nbdb, we don't have cache updates and we fail the creation ^ and retry and cause the bug.


When we went on searching for why update messages were not sent by ovsdbserver, we saw:
[root@master-0 ~]# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl memory/show
cells:100055 monitors:0 raft-connections:4 raft-log:3650 sessions:3

there weren't any monitors running on the pod to which ovnkube-master was connected to.

NOTE: We don't know why or when the monitor got cancelled, since it happened before we enabled the debug messaging.

Triggering a appctl reconnect recreates the monitor and we started seeing the update messages and the cache getting populated. The transaction errors went away.


Plan is to try and implement a way forward where we trigger reconnect in case we don't get an update between a request and response.
PoC: https://github.com/trozet/ovn-kubernetes/commit/a1defb0bafc68ea18e6781e926e2611020e7b101

Comment 7 Tim Rozet 2021-09-02 02:59:53 UTC
Was able to reproduce with jsonrpc logging and we can clearly see nbdb does not send us a monitor_cancel. It drops the monitor and the only indication of something went wrong is a transaction reply error with reason "canceled". Filed https://bugzilla.redhat.com/show_bug.cgi?id=2000375

Going to test the workaround here: https://github.com/openshift/ovn-kubernetes/pull/709

which will detect if we missed an update, and force a reconnect.

Comment 8 Tim Rozet 2021-09-04 00:45:49 UTC
Fix PR posted: https://github.com/openshift/ovn-kubernetes/pull/717

Comment 16 Surya Seetharaman 2021-09-14 06:30:22 UTC
4.7.z backport created: https://bugzilla.redhat.com/show_bug.cgi?id=2003901

was on PTO last week, but I see Tim has given the new image on 4.7.24 with the fix: https://coreos.slack.com/archives/CU9HKBZKJ/p1631125442091400?thread_ts=1631024275.468200&cid=CU9HKBZKJ

Clearing needinfo on my part.

Comment 17 Surya Seetharaman 2021-09-23 10:03:29 UTC
*** Bug 1997205 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2021-10-18 17:49:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 21 Red Hat Bugzilla 2023-09-15 01:14:20 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.