Bug 1959200 - failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding
Summary: failed to configure pod interface: error while waiting on OVS.Interface.exter...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: All
OS: All
high
high
Target Milestone: ---
: 4.9.0
Assignee: Tim Rozet
QA Contact: Kedar Kulkarni
URL:
Whiteboard: perfscale-ovn
: 1968009 (view as bug list)
Depends On: 1952846
Blocks: 1996201
TreeView+ depends on / blocked
 
Reported: 2021-05-10 23:23 UTC by Joe Talerico
Modified: 2023-09-15 01:06 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1996201 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:31:03 UTC
Target Upstream Version:
Embargoed:
mmahmoud: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 609 0 None closed Merge 2021-07-18 2021-07-27 20:12:31 UTC
Github ovn-org ovn-kubernetes pull 2220 0 None closed Adds back checking OF flows for CNI 2021-06-09 15:10:04 UTC
Github ovn-org ovn-kubernetes pull 2275 0 None closed cni: cancel old pod sandbox add requests if the pod's UID or MAC changes 2021-07-27 20:04:02 UTC
Github ovn-org ovn-kubernetes pull 2345 0 None closed cni: pass Kube API auth via cnishim response, not CNI config file 2021-07-27 20:04:02 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:31:28 UTC

Description Joe Talerico 2021-05-10 23:23:47 UTC
Description of problem:
Running at 300 nodes, and creating pods/builds/etc we see some pods failing with,

 Warning  FailedCreatePodSandBox  34m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-density-3-1-build_cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971_4e1f7733-050c-4709-acb1-7a48cf0b5e41_0(e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6): [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding


Version-Release number of selected component (if applicable):
OCP4.8


How reproducible:
100%

Steps to Reproduce:
1. AWS 4.8 OCP w/ OVNK
2. Scale to 300 nodes
3. Run Kubeburner clusterdensity w/ 2k iterations

Comment 1 Tim Rozet 2021-05-11 17:17:42 UTC
Note we can see on latest scale run ovn-installed is set on the port at:
Tue May 11 16:59:05 UTC 2021 Tue May 11 16:59:05 UTC 2021 external_ids        : {attached_mac="0a:58:0a:87:00:43", iface-id=openshift-authentication_trozet1, ip_addresses="10.135.0.67/22", ovn-installed="true", sandbox="90ff6fa3b4b4b03fa22766d16e5a76f1fd5889d2972369690dc104deae4efaf2"}


But flows dont show up for 37 seconds later:
Tue May 11 16:59:42 UTC 2021 cookie=0x14e3f20b, duration=0.058s, table=8, n_packets=0, n_bytes=0, idle_age=0, priority=50,reg14=0x28,metadata=0x16,dl_src=0a:58:0a:87:00:43 actions=resubmit(,9)

Going to see if the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1958349 helps.

Comment 2 Tim Rozet 2021-05-11 17:19:37 UTC
Linked the wrong bz. Right one is: https://bugzilla.redhat.com/show_bug.cgi?id=1952846

Comment 3 Tim Rozet 2021-05-12 16:04:46 UTC
OVN fix isn't working. Will wait on update from OVN team:
https://bugzilla.redhat.com/show_bug.cgi?id=1952846#c5

Comment 4 Tim Rozet 2021-05-12 20:22:12 UTC
OK turns out this is because OVN is so slow, we CNI delete a pod, then ADD and the old port external ID still exists in OVN, so the port is considered up:

2021-05-12T17:56:00Z|11450|if_status|DBG|Interface (null) create.
2021-05-12T17:56:00Z|11451|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new CLAIMED
2021-05-12T17:56:00Z|11452|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis.
2021-05-12T17:56:00Z|11453|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3d 10.151.13.61                     <----old IP from pod that was deleted, but still hanging around in OVN DB
2021-05-12T17:56:00Z|11454|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS
2021-05-12T17:56:00Z|11455|if_status|DBG|Seqno requested: 385
2021-05-12T17:56:00Z|11456|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP
2021-05-12T17:56:00Z|11457|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED
2021-05-12T17:56:37Z|11458|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALLED, new CLAIMED
2021-05-12T17:56:37Z|11459|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis.
2021-05-12T17:56:37Z|11460|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3e 10.151.13.62                     <----new IP from pod, this is the correct mark
2021-05-12T17:56:37Z|11461|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS
2021-05-12T17:56:37Z|11462|if_status|DBG|Seqno requested: 386
2021-05-12T17:56:37Z|11463|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP
2021-05-12T17:56:56Z|11498|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED

The solution for now is going to be to reintroduce checking table 8 for the right openflow flow with the mac address.

Comment 5 Amit Sagtani 2021-05-18 19:10:46 UTC
Got the same error with running 25 nodes on OpenShift 4.8.0-0.nightly-2021-05-15-022129 creating pods/builds/etc. some of the pods are stuck at the containerCreating state with the following error - 

'
  Warning  FailedCreatePodSandBox  93m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_node-density-999_node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac_2d5301a6-ff90-4037-8f33-d9f848b321d4_0(87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff): [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] failed to configure pod interface: failed to run 'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface external-ids:sandbox=87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff': exit status 1
  "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n"
'


Can be reproduced with 
AWS 4.8 OCP w/ OVNK --> Scale to 25 Nodes -->  Run Kubeburner nodedensity-lite w/ 250 pods per node

Comment 8 Jose Castillo Lema 2021-07-23 11:54:13 UTC
We are getting the same errors in the 120 node baremetal perf testing, both with node-density-light and node-density-heavy.
From a 120 node and 125 pods/worker node-density-light testing, taking a sample of 100 pods we see 98 events of this type:
$ for i in {9900..9999}; do oc describe po -n node-density-4bf963b7-770c-4afe-a66a-ea4ba5b292d2 node-density-$i | grep "timed out while waiting for OVS port binding"; done | wc -l
98

Env:
 - OCP 4.8.0-fc.9
 - local gateway
 - ovn2.13-20.12.0-25.el8fdp.x86_64

Comment 9 Sai Sindhur Malleni 2021-07-23 20:30:09 UTC
Alright yet another datapoint, but probably you have enough, Seeing same behavior on 4.8.0 and SNO node. Same OVN RPM (ovn2.13-20.12.0-25.el8fdp.x86_64) as Jose pointed out.

Comment 10 Tim Rozet 2021-07-27 20:12:35 UTC
Fixes have gone in include:
1. Fix to OVN where it might not set ovn-installed correctly (dependent bug)
2. Add back checking for pod flows (to ensure the port that gets ovn_installed is the latest port for the pod)
3. Cancel oldest sandbox request if the pod's MAC or UUID changes. Basically if ovnkube is serving a CNI request and the mac was changed due to a pod delete/add event, cancel this request and move onto the newer pod.

This should address most of the functional issues around handling port binding. There may be more issues uncovered in a heavy/stress environment related to OVN not being able to keep up. We can handle those in a new bug if they arise to avoid scope creep in this bug. I think its fine to verify the 25 node-lite test with these fixes on 4.9 and ensure the bug is gone.

Comment 12 Kedar Kulkarni 2021-07-29 12:06:35 UTC
Hi,
I executed the Node Density Light - 25 workers as suggested in Comment 10. 

There were 6439 pods ran over the test period and did not encounter failed pods, and "FailedCreatePodSandbox" was not observed.

Build used: 4.9.0-0.nightly-2021-07-28-043044

Thanks,
KK.

Comment 15 Tim Rozet 2021-08-20 19:13:00 UTC
*** Bug 1968009 has been marked as a duplicate of this bug. ***

Comment 16 Tim Rozet 2021-08-20 19:56:54 UTC
4.8z backport: https://bugzilla.redhat.com/show_bug.cgi?id=1996201

Comment 25 errata-xmlrpc 2021-10-18 17:31:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 26 Red Hat Bugzilla 2023-09-15 01:06:21 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.