Bug 2017909

Summary: [ICNI 2.0] ovnkube-masters stop processing add/del events for pods
Product: OpenShift Container Platform Reporter: Jose Castillo Lema <jlema>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Mike Fiedler <mifiedle>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: akrzos, bpickard, dblack, dcbw, jlema, mifiedle, murali, smalleni, surya
Version: 4.10Flags: jlema: needinfo-
jlema: needinfo-
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:22:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2022049, 2068619    
Attachments:
Description Flags
Logs from ovnkube-master none

Description Jose Castillo Lema 2021-10-27 17:08:04 UTC
Created attachment 1837679 [details]
Logs from ovnkube-master

Description of problem:
After running a successful node density test that created over > 9k pods in a 500 node env among several namespaces (420 namespaces), and deleting all the corresponding namespaces at once, we see a lot of stale logical switch port entries (hours after the deletion of the projects):
$ oc exec -c ovnkube-master -n openshift-ovn-kubernetes $POD -- /usr/bin/ovn-nbctl --no-heading --format=csv --data=bare --columns=name find Logical_Switch_Port| grep -i served | wc -l
8907

Besides this, the environment can create new pods at this point, every pod creation fails with "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded" errors:
Warning  FailedCreatePodSandBox  50m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app4_jose_24421980-37aa-4a77-a626-a05d94062bcf_0(ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42): error adding pod jose_app4 to CNI network "multus-cni-network": [jose/app4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[jose/app4 ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42] [jose/app4 ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded


Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-10-21-105053 

How reproducible:
We have observed this behaviour in several occasions

Steps to Reproduce:
1. Create a large number of pods among several namespaces
2. Delete all namespaces at one
3. Wait for the namespaces to be completely deleted
4. Check for stale logical switch ports
5. Try to create new pods

Actual results:
 - 8907 stale logical switch ports
 - new pods stuck in Container Creating status with "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded" errors

Expected results:
 - 0 stale logical switch ports
 - new pods created successfuly

Additional info:
We are uploading the logs corresponding to the ovnkube-master.
After restarting the ovnkube-master, we can see in the logs:
I1027 16:55:10.097825       1 pods.go:71] Stale logical port found: serving-ns-96_pod-serving-96-1-serving-job. This logical port will be deleted.           
I1027 16:55:10.100897       1 pods.go:71] Stale logical port found: serving-ns-101_pod-serving-101-4-serving-job. This logical port will be deleted.

Comment 1 Dan Williams 2021-10-28 00:41:46 UTC
We'll need to get the full logs from the master over the addition and deletion; the attached logs are long after anything in 'jose' namespace is done.

WRT the 'serving-ns/served-ns' pods and namespaces I do see:

I1027 14:06:40.932262       1 pods.go:96] [served-ns-2/pod-served-2-1-init-served-job] deleteLogicalPort took 32.560445474s, OVN Execute time 377.48305ms
I1027 14:06:40.932354       1 pods.go:96] [served-ns-5/pod-served-5-1-init-served-job] deleteLogicalPort took 32.662183173s, OVN Execute time 359.479932ms
I1027 14:06:40.932449       1 pods.go:96] [served-ns-1/pod-served-1-1-init-served-job] deleteLogicalPort took 32.984352481s, OVN Execute time 632.951578ms
I1027 14:06:40.940041       1 pods.go:96] [served-ns-8/pod-served-8-1-init-served-job] deleteLogicalPort took 31.881497241s, OVN Execute time 2.875127ms
I1027 14:06:40.940083       1 pods.go:96] [served-ns-9/pod-served-9-1-init-served-job] deleteLogicalPort took 32.362543687s, OVN Execute time 427.975246ms
I1027 14:06:40.941625       1 pods.go:96] [served-ns-3/pod-served-3-1-init-served-job] deleteLogicalPort took 32.706343451s, OVN Execute time 380.315936ms
I1027 14:06:41.419039       1 pods.go:96] [served-ns-4/pod-served-4-1-init-served-job] deleteLogicalPort took 33.07509725s, OVN Execute time 283.988979ms
I1027 14:06:43.006119       1 pods.go:96] [served-ns-11/pod-served-11-1-init-served-job] deleteLogicalPort took 33.804965501s, OVN Execute time 287.680675ms
I1027 14:06:43.006230       1 pods.go:96] [served-ns-13/pod-served-13-1-init-served-job] deleteLogicalPort took 33.503430781s, OVN Execute time 3.321276ms
I1027 14:06:43.006285       1 pods.go:96] [served-ns-6/pod-served-6-1-init-served-job] deleteLogicalPort took 33.671207635s, OVN Execute time 151.03851ms
I1027 14:06:43.429970       1 pods.go:96] [served-ns-16/pod-served-16-1-init-served-job] deleteLogicalPort took 33.845002332s, OVN Execute time 428.58365ms
I1027 14:06:43.887650       1 pods.go:96] [served-ns-17/pod-served-17-1-init-served-job] deleteLogicalPort took 33.755352474s, OVN Execute time 451.605843ms
I1027 14:06:43.887670       1 pods.go:96] [served-ns-18/pod-served-18-1-init-served-job] deleteLogicalPort took 33.449237815s, OVN Execute time 153.009803ms
I1027 14:06:43.887894       1 pods.go:96] [served-ns-15/pod-served-15-1-init-served-job] deleteLogicalPort took 33.60267793s, OVN Execute time 297.599038ms
I1027 14:06:44.322962       1 pods.go:96] [served-ns-19/pod-served-19-1-init-served-job] deleteLogicalPort took 33.560626928s, OVN Execute time 310.298051ms
I1027 14:07:07.510430       1 pods.go:96] [serving-ns-4/pod-serving-4-2-serving-job] deleteLogicalPort took 2.48213702s, OVN Execute time 5.253645ms
I1027 14:07:10.357940       1 pods.go:96] [serving-ns-1/pod-serving-1-4-serving-job] deleteLogicalPort took 2.839554895s, OVN Execute time 480.743845ms
I1027 14:07:14.773902       1 pods.go:96] [served-ns-30/pod-served-30-1-init-served-job] deleteLogicalPort took 33.833736608s, OVN Execute time 505.080316ms
I1027 14:07:14.774075       1 pods.go:96] [served-ns-20/pod-served-20-1-init-served-job] deleteLogicalPort took 33.841566519s, OVN Execute time 478.20836ms
I1027 14:07:14.774350       1 pods.go:96] [served-ns-12/pod-served-12-1-init-served-job] deleteLogicalPort took 33.834109857s, OVN Execute time 506.48338ms
I1027 14:07:14.774702       1 pods.go:96] [served-ns-25/pod-served-25-1-init-served-job] deleteLogicalPort took 33.842345105s, OVN Execute time 505.194812ms
I1027 14:07:14.785260       1 pods.go:96] [served-ns-26/pod-served-26-1-init-served-job] deleteLogicalPort took 33.852763179s, OVN Execute time 507.785652ms

note the OVN execution time is pretty small, but deleteLogicalPort() time is very high. Which could indicate a delay problem in deleteLogicalPort() or go-ovn bindings, but not OVN itself. Perhaps that explains stale ports, since if it takes so long to delete each one, by the time you kill and restart the master it won't have been able to delete many of them.

At a minimum, these commits can help the delete situation:
https://github.com/openshift/ovn-kubernetes/pull/701/commits/a29d416cc2e6b61250c14790aacad40c6f38535f
https://github.com/openshift/ovn-kubernetes/pull/701/commits/51a30ecadfd9f939780a98b989515603a0ef1533

but they are very go-ovn centric so won't apply to 4.10 after the downstream merge lands.

Comment 6 Mike Fiedler 2021-11-16 15:52:02 UTC
@surya @jlema  From the comments in the PR it appears this fix is good.  Can I mark this bz as VERIFIED?   I do not have an environment to test it in.

Comment 7 Sai Sindhur Malleni 2021-11-16 17:38:52 UTC
It seems like there are more fixes needed to fully address this.

Comment 8 Mike Fiedler 2021-11-16 18:30:03 UTC
@smalleni  Should this move back to ASSIGNED?   or another bz will address the remaining issues?

Comment 9 Sai Sindhur Malleni 2021-11-16 18:39:16 UTC
Moving back to assigned makes sense to me.

Comment 10 Mike Fiedler 2021-11-17 14:28:51 UTC
Moving to ASSIGNED for additional fixes per comment 7 and comment 9.  If it is preferable to continue work in other bz, please link them here.

Comment 12 Mike Fiedler 2022-01-13 19:29:21 UTC
@jlema Are you able to verify this fix?   I don't have an environment for it.   cc: @smalleni

Comment 15 Jose Castillo Lema 2022-01-28 13:00:50 UTC
Hi @mifiedle,

we were not able to hit the issue on a 120 node env on the latest OCP release.
We run 3 iterations of node-density light, creating over 2k pods across 35 namespaces.

Average time for all ports to be cleaned up from OVN side was 1m30s (oc exec -c ovnkube-master -n openshift-ovn-kubernetes $POD -- /usr/bin/ovn-nbctl --no-leader-only --no-heading --format=csv --data=bare --columns=name find Logical_Switch_Port| grep -i served).

OCP version: 4.10.0-fc.2
Kubernetes Version: v1.23.0+60f5a1c

Comment 16 Surya Seetharaman 2022-01-31 19:20:38 UTC
Hey Mike,

Based on Jose's previous comment shall we mark this bug as VERIFIED? The 4.9 backport is being stuck on this bz.

Comment 18 Mike Fiedler 2022-02-03 12:54:42 UTC
Marking VERIFIED based on comment 15

Comment 20 errata-xmlrpc 2022-03-10 16:22:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 21 Red Hat Bugzilla 2023-09-15 01:16:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days