1959200 – failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

Bug 1959200 - failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

Summary: failed to configure pod interface: error while waiting on OVS.Interface.exter...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Tim Rozet
QA Contact:	Kedar Kulkarni
Docs Contact:
URL:
Whiteboard:	perfscale-ovn
Duplicates (1):	1968009 (view as bug list)
Depends On:	1952846
Blocks:	1996201
TreeView+	depends on / blocked

Reported:	2021-05-10 23:23 UTC by Joe Talerico
Modified:	2023-09-15 01:06 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1996201 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:31:03 UTC
Target Upstream Version:
Embargoed:
Flags:	mmahmoud: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 609	None	closed	Merge 2021-07-18	2021-07-27 20:12:31 UTC
Github	ovn-org ovn-kubernetes pull 2220	None	closed	Adds back checking OF flows for CNI	2021-06-09 15:10:04 UTC
Github	ovn-org ovn-kubernetes pull 2275	None	closed	cni: cancel old pod sandbox add requests if the pod's UID or MAC changes	2021-07-27 20:04:02 UTC
Github	ovn-org ovn-kubernetes pull 2345	None	closed	cni: pass Kube API auth via cnishim response, not CNI config file	2021-07-27 20:04:02 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:31:28 UTC

Description Joe Talerico 2021-05-10 23:23:47 UTC

Description of problem:
Running at 300 nodes, and creating pods/builds/etc we see some pods failing with,

 Warning  FailedCreatePodSandBox  34m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-density-3-1-build_cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971_4e1f7733-050c-4709-acb1-7a48cf0b5e41_0(e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6): [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding


Version-Release number of selected component (if applicable):
OCP4.8


How reproducible:
100%

Steps to Reproduce:
1. AWS 4.8 OCP w/ OVNK
2. Scale to 300 nodes
3. Run Kubeburner clusterdensity w/ 2k iterations

Comment 1 Tim Rozet 2021-05-11 17:17:42 UTC

Note we can see on latest scale run ovn-installed is set on the port at:
Tue May 11 16:59:05 UTC 2021 Tue May 11 16:59:05 UTC 2021 external_ids        : {attached_mac="0a:58:0a:87:00:43", iface-id=openshift-authentication_trozet1, ip_addresses="10.135.0.67/22", ovn-installed="true", sandbox="90ff6fa3b4b4b03fa22766d16e5a76f1fd5889d2972369690dc104deae4efaf2"}


But flows dont show up for 37 seconds later:
Tue May 11 16:59:42 UTC 2021 cookie=0x14e3f20b, duration=0.058s, table=8, n_packets=0, n_bytes=0, idle_age=0, priority=50,reg14=0x28,metadata=0x16,dl_src=0a:58:0a:87:00:43 actions=resubmit(,9)

Going to see if the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1958349 helps.

Comment 2 Tim Rozet 2021-05-11 17:19:37 UTC

Linked the wrong bz. Right one is: https://bugzilla.redhat.com/show_bug.cgi?id=1952846

Comment 3 Tim Rozet 2021-05-12 16:04:46 UTC

OVN fix isn't working. Will wait on update from OVN team:
https://bugzilla.redhat.com/show_bug.cgi?id=1952846#c5

Comment 4 Tim Rozet 2021-05-12 20:22:12 UTC

OK turns out this is because OVN is so slow, we CNI delete a pod, then ADD and the old port external ID still exists in OVN, so the port is considered up:

2021-05-12T17:56:00Z|11450|if_status|DBG|Interface (null) create.
2021-05-12T17:56:00Z|11451|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new CLAIMED
2021-05-12T17:56:00Z|11452|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis.
2021-05-12T17:56:00Z|11453|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3d 10.151.13.61                     <----old IP from pod that was deleted, but still hanging around in OVN DB
2021-05-12T17:56:00Z|11454|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS
2021-05-12T17:56:00Z|11455|if_status|DBG|Seqno requested: 385
2021-05-12T17:56:00Z|11456|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP
2021-05-12T17:56:00Z|11457|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED
2021-05-12T17:56:37Z|11458|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALLED, new CLAIMED
2021-05-12T17:56:37Z|11459|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis.
2021-05-12T17:56:37Z|11460|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3e 10.151.13.62                     <----new IP from pod, this is the correct mark
2021-05-12T17:56:37Z|11461|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS
2021-05-12T17:56:37Z|11462|if_status|DBG|Seqno requested: 386
2021-05-12T17:56:37Z|11463|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP
2021-05-12T17:56:56Z|11498|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED

The solution for now is going to be to reintroduce checking table 8 for the right openflow flow with the mac address.

Comment 5 Amit Sagtani 2021-05-18 19:10:46 UTC

Got the same error with running 25 nodes on OpenShift 4.8.0-0.nightly-2021-05-15-022129 creating pods/builds/etc. some of the pods are stuck at the containerCreating state with the following error - 

'
  Warning  FailedCreatePodSandBox  93m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_node-density-999_node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac_2d5301a6-ff90-4037-8f33-d9f848b321d4_0(87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff): [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] failed to configure pod interface: failed to run 'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface external-ids:sandbox=87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff': exit status 1
  "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n"
'


Can be reproduced with 
AWS 4.8 OCP w/ OVNK --> Scale to 25 Nodes -->  Run Kubeburner nodedensity-lite w/ 250 pods per node

Comment 8 Jose Castillo Lema 2021-07-23 11:54:13 UTC

We are getting the same errors in the 120 node baremetal perf testing, both with node-density-light and node-density-heavy.
From a 120 node and 125 pods/worker node-density-light testing, taking a sample of 100 pods we see 98 events of this type:
$ for i in {9900..9999}; do oc describe po -n node-density-4bf963b7-770c-4afe-a66a-ea4ba5b292d2 node-density-$i | grep "timed out while waiting for OVS port binding"; done | wc -l
98

Env:
 - OCP 4.8.0-fc.9
 - local gateway
 - ovn2.13-20.12.0-25.el8fdp.x86_64

Comment 9 Sai Sindhur Malleni 2021-07-23 20:30:09 UTC

Alright yet another datapoint, but probably you have enough, Seeing same behavior on 4.8.0 and SNO node. Same OVN RPM (ovn2.13-20.12.0-25.el8fdp.x86_64) as Jose pointed out.

Comment 10 Tim Rozet 2021-07-27 20:12:35 UTC

Fixes have gone in include:
1. Fix to OVN where it might not set ovn-installed correctly (dependent bug)
2. Add back checking for pod flows (to ensure the port that gets ovn_installed is the latest port for the pod)
3. Cancel oldest sandbox request if the pod's MAC or UUID changes. Basically if ovnkube is serving a CNI request and the mac was changed due to a pod delete/add event, cancel this request and move onto the newer pod.

This should address most of the functional issues around handling port binding. There may be more issues uncovered in a heavy/stress environment related to OVN not being able to keep up. We can handle those in a new bug if they arise to avoid scope creep in this bug. I think its fine to verify the 25 node-lite test with these fixes on 4.9 and ensure the bug is gone.

Comment 12 Kedar Kulkarni 2021-07-29 12:06:35 UTC

Hi,
I executed the Node Density Light - 25 workers as suggested in Comment 10. 

There were 6439 pods ran over the test period and did not encounter failed pods, and "FailedCreatePodSandbox" was not observed.

Build used: 4.9.0-0.nightly-2021-07-28-043044

Thanks,
KK.

Comment 15 Tim Rozet 2021-08-20 19:13:00 UTC

*** Bug 1968009 has been marked as a duplicate of this bug. ***

Comment 16 Tim Rozet 2021-08-20 19:56:54 UTC

4.8z backport: https://bugzilla.redhat.com/show_bug.cgi?id=1996201

Comment 25 errata-xmlrpc 2021-10-18 17:31:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 26 Red Hat Bugzilla 2023-09-15 01:06:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.