Created attachment 1837679 [details] Logs from ovnkube-master Description of problem: After running a successful node density test that created over > 9k pods in a 500 node env among several namespaces (420 namespaces), and deleting all the corresponding namespaces at once, we see a lot of stale logical switch port entries (hours after the deletion of the projects): $ oc exec -c ovnkube-master -n openshift-ovn-kubernetes $POD -- /usr/bin/ovn-nbctl --no-heading --format=csv --data=bare --columns=name find Logical_Switch_Port| grep -i served | wc -l 8907 Besides this, the environment can create new pods at this point, every pod creation fails with "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded" errors: Warning FailedCreatePodSandBox 50m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app4_jose_24421980-37aa-4a77-a626-a05d94062bcf_0(ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42): error adding pod jose_app4 to CNI network "multus-cni-network": [jose/app4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[jose/app4 ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42] [jose/app4 ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-10-21-105053 How reproducible: We have observed this behaviour in several occasions Steps to Reproduce: 1. Create a large number of pods among several namespaces 2. Delete all namespaces at one 3. Wait for the namespaces to be completely deleted 4. Check for stale logical switch ports 5. Try to create new pods Actual results: - 8907 stale logical switch ports - new pods stuck in Container Creating status with "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded" errors Expected results: - 0 stale logical switch ports - new pods created successfuly Additional info: We are uploading the logs corresponding to the ovnkube-master. After restarting the ovnkube-master, we can see in the logs: I1027 16:55:10.097825 1 pods.go:71] Stale logical port found: serving-ns-96_pod-serving-96-1-serving-job. This logical port will be deleted. I1027 16:55:10.100897 1 pods.go:71] Stale logical port found: serving-ns-101_pod-serving-101-4-serving-job. This logical port will be deleted.
We'll need to get the full logs from the master over the addition and deletion; the attached logs are long after anything in 'jose' namespace is done. WRT the 'serving-ns/served-ns' pods and namespaces I do see: I1027 14:06:40.932262 1 pods.go:96] [served-ns-2/pod-served-2-1-init-served-job] deleteLogicalPort took 32.560445474s, OVN Execute time 377.48305ms I1027 14:06:40.932354 1 pods.go:96] [served-ns-5/pod-served-5-1-init-served-job] deleteLogicalPort took 32.662183173s, OVN Execute time 359.479932ms I1027 14:06:40.932449 1 pods.go:96] [served-ns-1/pod-served-1-1-init-served-job] deleteLogicalPort took 32.984352481s, OVN Execute time 632.951578ms I1027 14:06:40.940041 1 pods.go:96] [served-ns-8/pod-served-8-1-init-served-job] deleteLogicalPort took 31.881497241s, OVN Execute time 2.875127ms I1027 14:06:40.940083 1 pods.go:96] [served-ns-9/pod-served-9-1-init-served-job] deleteLogicalPort took 32.362543687s, OVN Execute time 427.975246ms I1027 14:06:40.941625 1 pods.go:96] [served-ns-3/pod-served-3-1-init-served-job] deleteLogicalPort took 32.706343451s, OVN Execute time 380.315936ms I1027 14:06:41.419039 1 pods.go:96] [served-ns-4/pod-served-4-1-init-served-job] deleteLogicalPort took 33.07509725s, OVN Execute time 283.988979ms I1027 14:06:43.006119 1 pods.go:96] [served-ns-11/pod-served-11-1-init-served-job] deleteLogicalPort took 33.804965501s, OVN Execute time 287.680675ms I1027 14:06:43.006230 1 pods.go:96] [served-ns-13/pod-served-13-1-init-served-job] deleteLogicalPort took 33.503430781s, OVN Execute time 3.321276ms I1027 14:06:43.006285 1 pods.go:96] [served-ns-6/pod-served-6-1-init-served-job] deleteLogicalPort took 33.671207635s, OVN Execute time 151.03851ms I1027 14:06:43.429970 1 pods.go:96] [served-ns-16/pod-served-16-1-init-served-job] deleteLogicalPort took 33.845002332s, OVN Execute time 428.58365ms I1027 14:06:43.887650 1 pods.go:96] [served-ns-17/pod-served-17-1-init-served-job] deleteLogicalPort took 33.755352474s, OVN Execute time 451.605843ms I1027 14:06:43.887670 1 pods.go:96] [served-ns-18/pod-served-18-1-init-served-job] deleteLogicalPort took 33.449237815s, OVN Execute time 153.009803ms I1027 14:06:43.887894 1 pods.go:96] [served-ns-15/pod-served-15-1-init-served-job] deleteLogicalPort took 33.60267793s, OVN Execute time 297.599038ms I1027 14:06:44.322962 1 pods.go:96] [served-ns-19/pod-served-19-1-init-served-job] deleteLogicalPort took 33.560626928s, OVN Execute time 310.298051ms I1027 14:07:07.510430 1 pods.go:96] [serving-ns-4/pod-serving-4-2-serving-job] deleteLogicalPort took 2.48213702s, OVN Execute time 5.253645ms I1027 14:07:10.357940 1 pods.go:96] [serving-ns-1/pod-serving-1-4-serving-job] deleteLogicalPort took 2.839554895s, OVN Execute time 480.743845ms I1027 14:07:14.773902 1 pods.go:96] [served-ns-30/pod-served-30-1-init-served-job] deleteLogicalPort took 33.833736608s, OVN Execute time 505.080316ms I1027 14:07:14.774075 1 pods.go:96] [served-ns-20/pod-served-20-1-init-served-job] deleteLogicalPort took 33.841566519s, OVN Execute time 478.20836ms I1027 14:07:14.774350 1 pods.go:96] [served-ns-12/pod-served-12-1-init-served-job] deleteLogicalPort took 33.834109857s, OVN Execute time 506.48338ms I1027 14:07:14.774702 1 pods.go:96] [served-ns-25/pod-served-25-1-init-served-job] deleteLogicalPort took 33.842345105s, OVN Execute time 505.194812ms I1027 14:07:14.785260 1 pods.go:96] [served-ns-26/pod-served-26-1-init-served-job] deleteLogicalPort took 33.852763179s, OVN Execute time 507.785652ms note the OVN execution time is pretty small, but deleteLogicalPort() time is very high. Which could indicate a delay problem in deleteLogicalPort() or go-ovn bindings, but not OVN itself. Perhaps that explains stale ports, since if it takes so long to delete each one, by the time you kill and restart the master it won't have been able to delete many of them. At a minimum, these commits can help the delete situation: https://github.com/openshift/ovn-kubernetes/pull/701/commits/a29d416cc2e6b61250c14790aacad40c6f38535f https://github.com/openshift/ovn-kubernetes/pull/701/commits/51a30ecadfd9f939780a98b989515603a0ef1533 but they are very go-ovn centric so won't apply to 4.10 after the downstream merge lands.
@surya @jlema From the comments in the PR it appears this fix is good. Can I mark this bz as VERIFIED? I do not have an environment to test it in.
It seems like there are more fixes needed to fully address this.
@smalleni Should this move back to ASSIGNED? or another bz will address the remaining issues?
Moving back to assigned makes sense to me.
Moving to ASSIGNED for additional fixes per comment 7 and comment 9. If it is preferable to continue work in other bz, please link them here.
@jlema Are you able to verify this fix? I don't have an environment for it. cc: @smalleni
Hi @mifiedle, we were not able to hit the issue on a 120 node env on the latest OCP release. We run 3 iterations of node-density light, creating over 2k pods across 35 namespaces. Average time for all ports to be cleaned up from OVN side was 1m30s (oc exec -c ovnkube-master -n openshift-ovn-kubernetes $POD -- /usr/bin/ovn-nbctl --no-leader-only --no-heading --format=csv --data=bare --columns=name find Logical_Switch_Port| grep -i served). OCP version: 4.10.0-fc.2 Kubernetes Version: v1.23.0+60f5a1c
Hey Mike, Based on Jose's previous comment shall we mark this bug as VERIFIED? The 4.9 backport is being stuck on this bz.
Marking VERIFIED based on comment 15
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days