Bug 1959200 - failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding
Summary: failed to configure pod interface: error while waiting on OVS.Interface.exter...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: All
OS: All
high
high
Target Milestone: ---
: 4.9.0
Assignee: Tim Rozet
QA Contact: Kedar Kulkarni
URL:
Whiteboard: perfscale-ovn
: 1968009 (view as bug list)
Depends On: 1952846
Blocks: 1996201
TreeView+ depends on / blocked
 
Reported: 2021-05-10 23:23 UTC by Joe Talerico
Modified: 2021-11-29 18:31 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1996201 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:31:03 UTC
Target Upstream Version:
ealcaniz: needinfo? (trozet)
mmahmoud: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 609 0 None closed Merge 2021-07-18 2021-07-27 20:12:31 UTC
Github ovn-org ovn-kubernetes pull 2220 0 None closed Adds back checking OF flows for CNI 2021-06-09 15:10:04 UTC
Github ovn-org ovn-kubernetes pull 2275 0 None closed cni: cancel old pod sandbox add requests if the pod's UID or MAC changes 2021-07-27 20:04:02 UTC
Github ovn-org ovn-kubernetes pull 2345 0 None closed cni: pass Kube API auth via cnishim response, not CNI config file 2021-07-27 20:04:02 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:31:28 UTC

Description Joe Talerico 2021-05-10 23:23:47 UTC
Description of problem:
Running at 300 nodes, and creating pods/builds/etc we see some pods failing with,

 Warning  FailedCreatePodSandBox  34m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-density-3-1-build_cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971_4e1f7733-050c-4709-acb1-7a48cf0b5e41_0(e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6): [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding


Version-Release number of selected component (if applicable):
OCP4.8


How reproducible:
100%

Steps to Reproduce:
1. AWS 4.8 OCP w/ OVNK
2. Scale to 300 nodes
3. Run Kubeburner clusterdensity w/ 2k iterations

Comment 1 Tim Rozet 2021-05-11 17:17:42 UTC
Note we can see on latest scale run ovn-installed is set on the port at:
Tue May 11 16:59:05 UTC 2021 Tue May 11 16:59:05 UTC 2021 external_ids        : {attached_mac="0a:58:0a:87:00:43", iface-id=openshift-authentication_trozet1, ip_addresses="10.135.0.67/22", ovn-installed="true", sandbox="90ff6fa3b4b4b03fa22766d16e5a76f1fd5889d2972369690dc104deae4efaf2"}


But flows dont show up for 37 seconds later:
Tue May 11 16:59:42 UTC 2021 cookie=0x14e3f20b, duration=0.058s, table=8, n_packets=0, n_bytes=0, idle_age=0, priority=50,reg14=0x28,metadata=0x16,dl_src=0a:58:0a:87:00:43 actions=resubmit(,9)

Going to see if the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1958349 helps.

Comment 2 Tim Rozet 2021-05-11 17:19:37 UTC
Linked the wrong bz. Right one is: https://bugzilla.redhat.com/show_bug.cgi?id=1952846

Comment 3 Tim Rozet 2021-05-12 16:04:46 UTC
OVN fix isn't working. Will wait on update from OVN team:
https://bugzilla.redhat.com/show_bug.cgi?id=1952846#c5

Comment 4 Tim Rozet 2021-05-12 20:22:12 UTC
OK turns out this is because OVN is so slow, we CNI delete a pod, then ADD and the old port external ID still exists in OVN, so the port is considered up:

2021-05-12T17:56:00Z|11450|if_status|DBG|Interface (null) create.
2021-05-12T17:56:00Z|11451|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new CLAIMED
2021-05-12T17:56:00Z|11452|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis.
2021-05-12T17:56:00Z|11453|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3d 10.151.13.61                     <----old IP from pod that was deleted, but still hanging around in OVN DB
2021-05-12T17:56:00Z|11454|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS
2021-05-12T17:56:00Z|11455|if_status|DBG|Seqno requested: 385
2021-05-12T17:56:00Z|11456|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP
2021-05-12T17:56:00Z|11457|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED
2021-05-12T17:56:37Z|11458|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALLED, new CLAIMED
2021-05-12T17:56:37Z|11459|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis.
2021-05-12T17:56:37Z|11460|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3e 10.151.13.62                     <----new IP from pod, this is the correct mark
2021-05-12T17:56:37Z|11461|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS
2021-05-12T17:56:37Z|11462|if_status|DBG|Seqno requested: 386
2021-05-12T17:56:37Z|11463|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP
2021-05-12T17:56:56Z|11498|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED

The solution for now is going to be to reintroduce checking table 8 for the right openflow flow with the mac address.

Comment 5 Amit Sagtani 2021-05-18 19:10:46 UTC
Got the same error with running 25 nodes on OpenShift 4.8.0-0.nightly-2021-05-15-022129 creating pods/builds/etc. some of the pods are stuck at the containerCreating state with the following error - 

'
  Warning  FailedCreatePodSandBox  93m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_node-density-999_node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac_2d5301a6-ff90-4037-8f33-d9f848b321d4_0(87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff): [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] failed to configure pod interface: failed to run 'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface external-ids:sandbox=87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff': exit status 1
  "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n"
'


Can be reproduced with 
AWS 4.8 OCP w/ OVNK --> Scale to 25 Nodes -->  Run Kubeburner nodedensity-lite w/ 250 pods per node

Comment 8 Jose Castillo Lema 2021-07-23 11:54:13 UTC
We are getting the same errors in the 120 node baremetal perf testing, both with node-density-light and node-density-heavy.
From a 120 node and 125 pods/worker node-density-light testing, taking a sample of 100 pods we see 98 events of this type:
$ for i in {9900..9999}; do oc describe po -n node-density-4bf963b7-770c-4afe-a66a-ea4ba5b292d2 node-density-$i | grep "timed out while waiting for OVS port binding"; done | wc -l
98

Env:
 - OCP 4.8.0-fc.9
 - local gateway
 - ovn2.13-20.12.0-25.el8fdp.x86_64

Comment 9 Sai Sindhur Malleni 2021-07-23 20:30:09 UTC
Alright yet another datapoint, but probably you have enough, Seeing same behavior on 4.8.0 and SNO node. Same OVN RPM (ovn2.13-20.12.0-25.el8fdp.x86_64) as Jose pointed out.

Comment 10 Tim Rozet 2021-07-27 20:12:35 UTC
Fixes have gone in include:
1. Fix to OVN where it might not set ovn-installed correctly (dependent bug)
2. Add back checking for pod flows (to ensure the port that gets ovn_installed is the latest port for the pod)
3. Cancel oldest sandbox request if the pod's MAC or UUID changes. Basically if ovnkube is serving a CNI request and the mac was changed due to a pod delete/add event, cancel this request and move onto the newer pod.

This should address most of the functional issues around handling port binding. There may be more issues uncovered in a heavy/stress environment related to OVN not being able to keep up. We can handle those in a new bug if they arise to avoid scope creep in this bug. I think its fine to verify the 25 node-lite test with these fixes on 4.9 and ensure the bug is gone.

Comment 12 Kedar Kulkarni 2021-07-29 12:06:35 UTC
Hi,
I executed the Node Density Light - 25 workers as suggested in Comment 10. 

There were 6439 pods ran over the test period and did not encounter failed pods, and "FailedCreatePodSandbox" was not observed.

Build used: 4.9.0-0.nightly-2021-07-28-043044

Thanks,
KK.

Comment 15 Tim Rozet 2021-08-20 19:13:00 UTC
*** Bug 1968009 has been marked as a duplicate of this bug. ***

Comment 16 Tim Rozet 2021-08-20 19:56:54 UTC
4.8z backport: https://bugzilla.redhat.com/show_bug.cgi?id=1996201

Comment 25 errata-xmlrpc 2021-10-18 17:31:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.