Description of problem: Running at 300 nodes, and creating pods/builds/etc we see some pods failing with, Warning FailedCreatePodSandBox 34m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-density-3-1-build_cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971_4e1f7733-050c-4709-acb1-7a48cf0b5e41_0(e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6): [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] [cluster-density-608dafbd-77a3-4122-a724-1ce9a5401499-971/cluster-density-3-1-build e946df0775d3dff89265fb31ceee49ddfc785746ce1bcfabf828688dc6ab0bb6] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding Version-Release number of selected component (if applicable): OCP4.8 How reproducible: 100% Steps to Reproduce: 1. AWS 4.8 OCP w/ OVNK 2. Scale to 300 nodes 3. Run Kubeburner clusterdensity w/ 2k iterations
Note we can see on latest scale run ovn-installed is set on the port at: Tue May 11 16:59:05 UTC 2021 Tue May 11 16:59:05 UTC 2021 external_ids : {attached_mac="0a:58:0a:87:00:43", iface-id=openshift-authentication_trozet1, ip_addresses="10.135.0.67/22", ovn-installed="true", sandbox="90ff6fa3b4b4b03fa22766d16e5a76f1fd5889d2972369690dc104deae4efaf2"} But flows dont show up for 37 seconds later: Tue May 11 16:59:42 UTC 2021 cookie=0x14e3f20b, duration=0.058s, table=8, n_packets=0, n_bytes=0, idle_age=0, priority=50,reg14=0x28,metadata=0x16,dl_src=0a:58:0a:87:00:43 actions=resubmit(,9) Going to see if the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1958349 helps.
Linked the wrong bz. Right one is: https://bugzilla.redhat.com/show_bug.cgi?id=1952846
OVN fix isn't working. Will wait on update from OVN team: https://bugzilla.redhat.com/show_bug.cgi?id=1952846#c5
OK turns out this is because OVN is so slow, we CNI delete a pod, then ADD and the old port external ID still exists in OVN, so the port is considered up: 2021-05-12T17:56:00Z|11450|if_status|DBG|Interface (null) create. 2021-05-12T17:56:00Z|11451|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new CLAIMED 2021-05-12T17:56:00Z|11452|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis. 2021-05-12T17:56:00Z|11453|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3d 10.151.13.61 <----old IP from pod that was deleted, but still hanging around in OVN DB 2021-05-12T17:56:00Z|11454|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS 2021-05-12T17:56:00Z|11455|if_status|DBG|Seqno requested: 385 2021-05-12T17:56:00Z|11456|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP 2021-05-12T17:56:00Z|11457|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED 2021-05-12T17:56:37Z|11458|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALLED, new CLAIMED 2021-05-12T17:56:37Z|11459|binding|INFO|Claiming lport openshift-authentication_trozet1 for this chassis. 2021-05-12T17:56:37Z|11460|binding|INFO|openshift-authentication_trozet1: Claiming 0a:58:0a:97:0d:3e 10.151.13.62 <----new IP from pod, this is the correct mark 2021-05-12T17:56:37Z|11461|if_status|DBG|Interface openshift-authentication_trozet1 set state: old CLAIMED, new INSTALL_FLOWS 2021-05-12T17:56:37Z|11462|if_status|DBG|Seqno requested: 386 2021-05-12T17:56:37Z|11463|if_status|DBG|Interface openshift-authentication_trozet1 set state: old INSTALL_FLOWS, new MARK_UP 2021-05-12T17:56:56Z|11498|if_status|DBG|Interface openshift-authentication_trozet1 set state: old MARK_UP, new INSTALLED The solution for now is going to be to reintroduce checking table 8 for the right openflow flow with the mac address.
Got the same error with running 25 nodes on OpenShift 4.8.0-0.nightly-2021-05-15-022129 creating pods/builds/etc. some of the pods are stuck at the containerCreating state with the following error - ' Warning FailedCreatePodSandBox 93m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_node-density-999_node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac_2d5301a6-ff90-4037-8f33-d9f848b321d4_0(87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff): [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] [node-density-aaad5041-3e6e-4b4c-823e-2b0c0fe596ac/node-density-999 87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff] failed to configure pod interface: failed to run 'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface external-ids:sandbox=87132e25b5213c2a77bad694c7250a1e913364700abd5242ffc8911867536dff': exit status 1 "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n" ' Can be reproduced with AWS 4.8 OCP w/ OVNK --> Scale to 25 Nodes --> Run Kubeburner nodedensity-lite w/ 250 pods per node
We are getting the same errors in the 120 node baremetal perf testing, both with node-density-light and node-density-heavy. From a 120 node and 125 pods/worker node-density-light testing, taking a sample of 100 pods we see 98 events of this type: $ for i in {9900..9999}; do oc describe po -n node-density-4bf963b7-770c-4afe-a66a-ea4ba5b292d2 node-density-$i | grep "timed out while waiting for OVS port binding"; done | wc -l 98 Env: - OCP 4.8.0-fc.9 - local gateway - ovn2.13-20.12.0-25.el8fdp.x86_64
Alright yet another datapoint, but probably you have enough, Seeing same behavior on 4.8.0 and SNO node. Same OVN RPM (ovn2.13-20.12.0-25.el8fdp.x86_64) as Jose pointed out.
Fixes have gone in include: 1. Fix to OVN where it might not set ovn-installed correctly (dependent bug) 2. Add back checking for pod flows (to ensure the port that gets ovn_installed is the latest port for the pod) 3. Cancel oldest sandbox request if the pod's MAC or UUID changes. Basically if ovnkube is serving a CNI request and the mac was changed due to a pod delete/add event, cancel this request and move onto the newer pod. This should address most of the functional issues around handling port binding. There may be more issues uncovered in a heavy/stress environment related to OVN not being able to keep up. We can handle those in a new bug if they arise to avoid scope creep in this bug. I think its fine to verify the 25 node-lite test with these fixes on 4.9 and ensure the bug is gone.
Hi, I executed the Node Density Light - 25 workers as suggested in Comment 10. There were 6439 pods ran over the test period and did not encounter failed pods, and "FailedCreatePodSandbox" was not observed. Build used: 4.9.0-0.nightly-2021-07-28-043044 Thanks, KK.
*** Bug 1968009 has been marked as a duplicate of this bug. ***
4.8z backport: https://bugzilla.redhat.com/show_bug.cgi?id=1996201
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days