Description of problem: The following test fails: kuryr_tempest_plugin.tests.scenario.test_namespace.TestNamespaceScenario.test_recreate_pod_in_namespace Traceback (most recent call last): File "/home/stack/plugins/kuryr/kuryr_tempest_plugin/tests/scenario/test_namespace.py", line 400, in test_recreate_pod_in_namespace " be deleted" % ns_name) File "/usr/lib64/python3.6/unittest/case.py", line 855, in assertNotEqual raise self.failureException(msg) AssertionError: 0 == 0 : Timed out waiting for namespace kuryr-ns-1955505019 to be deleted Version-Release number of selected component (if applicable): OCP 4.9.0-0.nightly-2021-08-23-224104 OSP RHOS-16.1-RHEL-8-20210604.n.0 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This seems to be caused by a behavior change in cri-o. Specifically the workflow is now: 1. Pod gets created and scheduled in the API. 2. cri-o issues CNI ADD request to kuryr-cni. kuryr-daemon gets it. 3. kuryr-daemon waits for the KuryrPort to appear in API. 4. #3 never happens as pod got deleted before KuryrPort got created. CNI ADD request hangs. 5. cri-o issues CNI DEL request that gets nicely served, but won't allow deleting the pod until CNI ADD times out on its side. 6. CNI ADD times out on kuryr-daemon side, that doesn't really matter, it already timed out on cri-o. The behavior of waiting for CNI ADD to finish/time out is most likely new in 4.9 cri-o.
In terms of a tests blocker - we don't consider this to be very high priority and we're currently waiting for changes in multus that would allow us to fix this. In the meanwhile I suggest ignoring the tests results or extending the timeout in the test to be over 5 minutes.
Test passed OCP 4.11.0-0.nightly-2022-04-24-135651 OSP RHOS-16.2-RHEL-8-20220311.n.1
(In reply to Itzik Brown from comment #8) > Test passed > OCP 4.11.0-0.nightly-2022-04-24-135651 > OSP RHOS-16.2-RHEL-8-20220311.n.1 Had it passed after decreasing the timeout [1]? [1] https://github.com/openstack/kuryr-tempest-plugin/blob/86423cc26cd44234948b0a544a51df5aa835289b/kuryr_tempest_plugin/tests/scenario/test_namespace.py#L393-L394
@Michal , Thanks for reminding .. So when reducing the retries to 24 it fails with the same error
I have just run the updated test 3 times against my 4.11 cluster. It does seem to work without any issues. Are you sure you've tested against Kuryr including the fix?
I checked with 4.11.0-0.nightly-2022-05-11-054135 RHOS-16.2-RHEL-8-20220311.n.1 The following tests passed. "[sig-apps][Feature:DeploymentConfig] deploymentconfigs when run iteratively should immediately start a new deployment [Suite:openshift/conformance/parallel]" "[sig-builds][Feature:Builds] remove all builds when build configuration is removed oc delete buildconfig should start builds and delete the buildconfig [Suite:openshift/conformance/parallel]" "[sig-node] Pods Extended Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]" test_recreate_pod_in_namespace fails with 24 retries but succeed with a higher number.
Yup, I'd mark this as verified. Even with test_recreate_pod_in_namespace failing, it's not Kuryr blocking the deletion and it also fails with OpenShiftSDN. That would be a separate new issue that should be handled by the node team. The tests from openshift/conformance/parallel were struggling from the Kuryr problem, but work now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069