Created attachment 1837679 [details]
Logs from ovnkube-master
Description of problem:
After running a successful node density test that created over > 9k pods in a 500 node env among several namespaces (420 namespaces), and deleting all the corresponding namespaces at once, we see a lot of stale logical switch port entries (hours after the deletion of the projects):
$ oc exec -c ovnkube-master -n openshift-ovn-kubernetes $POD -- /usr/bin/ovn-nbctl --no-heading --format=csv --data=bare --columns=name find Logical_Switch_Port| grep -i served | wc -l
Besides this, the environment can create new pods at this point, every pod creation fails with "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded" errors:
Warning FailedCreatePodSandBox 50m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app4_jose_24421980-37aa-4a77-a626-a05d94062bcf_0(ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42): error adding pod jose_app4 to CNI network "multus-cni-network": [jose/app4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[jose/app4 ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42] [jose/app4 ef4392b10bdbf271d27948a7ff32ffdf2eb322248f12278a785391a092fdee42] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
Version-Release number of selected component (if applicable):
We have observed this behaviour in several occasions
Steps to Reproduce:
1. Create a large number of pods among several namespaces
2. Delete all namespaces at one
3. Wait for the namespaces to be completely deleted
4. Check for stale logical switch ports
5. Try to create new pods
- 8907 stale logical switch ports
- new pods stuck in Container Creating status with "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded" errors
- 0 stale logical switch ports
- new pods created successfuly
We are uploading the logs corresponding to the ovnkube-master.
After restarting the ovnkube-master, we can see in the logs:
I1027 16:55:10.097825 1 pods.go:71] Stale logical port found: serving-ns-96_pod-serving-96-1-serving-job. This logical port will be deleted.
I1027 16:55:10.100897 1 pods.go:71] Stale logical port found: serving-ns-101_pod-serving-101-4-serving-job. This logical port will be deleted.
We'll need to get the full logs from the master over the addition and deletion; the attached logs are long after anything in 'jose' namespace is done.
WRT the 'serving-ns/served-ns' pods and namespaces I do see:
I1027 14:06:40.932262 1 pods.go:96] [served-ns-2/pod-served-2-1-init-served-job] deleteLogicalPort took 32.560445474s, OVN Execute time 377.48305ms
I1027 14:06:40.932354 1 pods.go:96] [served-ns-5/pod-served-5-1-init-served-job] deleteLogicalPort took 32.662183173s, OVN Execute time 359.479932ms
I1027 14:06:40.932449 1 pods.go:96] [served-ns-1/pod-served-1-1-init-served-job] deleteLogicalPort took 32.984352481s, OVN Execute time 632.951578ms
I1027 14:06:40.940041 1 pods.go:96] [served-ns-8/pod-served-8-1-init-served-job] deleteLogicalPort took 31.881497241s, OVN Execute time 2.875127ms
I1027 14:06:40.940083 1 pods.go:96] [served-ns-9/pod-served-9-1-init-served-job] deleteLogicalPort took 32.362543687s, OVN Execute time 427.975246ms
I1027 14:06:40.941625 1 pods.go:96] [served-ns-3/pod-served-3-1-init-served-job] deleteLogicalPort took 32.706343451s, OVN Execute time 380.315936ms
I1027 14:06:41.419039 1 pods.go:96] [served-ns-4/pod-served-4-1-init-served-job] deleteLogicalPort took 33.07509725s, OVN Execute time 283.988979ms
I1027 14:06:43.006119 1 pods.go:96] [served-ns-11/pod-served-11-1-init-served-job] deleteLogicalPort took 33.804965501s, OVN Execute time 287.680675ms
I1027 14:06:43.006230 1 pods.go:96] [served-ns-13/pod-served-13-1-init-served-job] deleteLogicalPort took 33.503430781s, OVN Execute time 3.321276ms
I1027 14:06:43.006285 1 pods.go:96] [served-ns-6/pod-served-6-1-init-served-job] deleteLogicalPort took 33.671207635s, OVN Execute time 151.03851ms
I1027 14:06:43.429970 1 pods.go:96] [served-ns-16/pod-served-16-1-init-served-job] deleteLogicalPort took 33.845002332s, OVN Execute time 428.58365ms
I1027 14:06:43.887650 1 pods.go:96] [served-ns-17/pod-served-17-1-init-served-job] deleteLogicalPort took 33.755352474s, OVN Execute time 451.605843ms
I1027 14:06:43.887670 1 pods.go:96] [served-ns-18/pod-served-18-1-init-served-job] deleteLogicalPort took 33.449237815s, OVN Execute time 153.009803ms
I1027 14:06:43.887894 1 pods.go:96] [served-ns-15/pod-served-15-1-init-served-job] deleteLogicalPort took 33.60267793s, OVN Execute time 297.599038ms
I1027 14:06:44.322962 1 pods.go:96] [served-ns-19/pod-served-19-1-init-served-job] deleteLogicalPort took 33.560626928s, OVN Execute time 310.298051ms
I1027 14:07:07.510430 1 pods.go:96] [serving-ns-4/pod-serving-4-2-serving-job] deleteLogicalPort took 2.48213702s, OVN Execute time 5.253645ms
I1027 14:07:10.357940 1 pods.go:96] [serving-ns-1/pod-serving-1-4-serving-job] deleteLogicalPort took 2.839554895s, OVN Execute time 480.743845ms
I1027 14:07:14.773902 1 pods.go:96] [served-ns-30/pod-served-30-1-init-served-job] deleteLogicalPort took 33.833736608s, OVN Execute time 505.080316ms
I1027 14:07:14.774075 1 pods.go:96] [served-ns-20/pod-served-20-1-init-served-job] deleteLogicalPort took 33.841566519s, OVN Execute time 478.20836ms
I1027 14:07:14.774350 1 pods.go:96] [served-ns-12/pod-served-12-1-init-served-job] deleteLogicalPort took 33.834109857s, OVN Execute time 506.48338ms
I1027 14:07:14.774702 1 pods.go:96] [served-ns-25/pod-served-25-1-init-served-job] deleteLogicalPort took 33.842345105s, OVN Execute time 505.194812ms
I1027 14:07:14.785260 1 pods.go:96] [served-ns-26/pod-served-26-1-init-served-job] deleteLogicalPort took 33.852763179s, OVN Execute time 507.785652ms
note the OVN execution time is pretty small, but deleteLogicalPort() time is very high. Which could indicate a delay problem in deleteLogicalPort() or go-ovn bindings, but not OVN itself. Perhaps that explains stale ports, since if it takes so long to delete each one, by the time you kill and restart the master it won't have been able to delete many of them.
At a minimum, these commits can help the delete situation:
but they are very go-ovn centric so won't apply to 4.10 after the downstream merge lands.
@surya @jlema From the comments in the PR it appears this fix is good. Can I mark this bz as VERIFIED? I do not have an environment to test it in.
It seems like there are more fixes needed to fully address this.
@smalleni Should this move back to ASSIGNED? or another bz will address the remaining issues?
Moving back to assigned makes sense to me.
Moving to ASSIGNED for additional fixes per comment 7 and comment 9. If it is preferable to continue work in other bz, please link them here.
@jlema Are you able to verify this fix? I don't have an environment for it. cc: @smalleni
we were not able to hit the issue on a 120 node env on the latest OCP release.
We run 3 iterations of node-density light, creating over 2k pods across 35 namespaces.
Average time for all ports to be cleaned up from OVN side was 1m30s (oc exec -c ovnkube-master -n openshift-ovn-kubernetes $POD -- /usr/bin/ovn-nbctl --no-leader-only --no-heading --format=csv --data=bare --columns=name find Logical_Switch_Port| grep -i served).
OCP version: 4.10.0-fc.2
Kubernetes Version: v1.23.0+60f5a1c
Based on Jose's previous comment shall we mark this bug as VERIFIED? The 4.9 backport is being stuck on this bz.
Marking VERIFIED based on comment 15
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days