1997120 – test_recreate_pod_in_namespace fails - Timed out waiting for namespace

Bug 1997120 - test_recreate_pod_in_namespace fails - Timed out waiting for namespace

Summary: test_recreate_pod_in_namespace fails - Timed out waiting for namespace

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Michał Dulko
QA Contact:	Itzik Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-24 13:07 UTC by Itzik Brown
Modified:	2022-08-10 10:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:36:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kuryr-kubernetes pull 657	0	None	open	Bug 1997120: CNI: Watch for deleted pods	2022-04-26 16:04:57 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:37:07 UTC

Description Itzik Brown 2021-08-24 13:07:43 UTC

Description of problem:
The following test fails: kuryr_tempest_plugin.tests.scenario.test_namespace.TestNamespaceScenario.test_recreate_pod_in_namespace

Traceback (most recent call last):
  File "/home/stack/plugins/kuryr/kuryr_tempest_plugin/tests/scenario/test_namespace.py", line 400, in test_recreate_pod_in_namespace
    " be deleted" % ns_name)
  File "/usr/lib64/python3.6/unittest/case.py", line 855, in assertNotEqual
    raise self.failureException(msg)
AssertionError: 0 == 0 : Timed out waiting for namespace kuryr-ns-1955505019 to be deleted

Version-Release number of selected component (if applicable):
OCP 4.9.0-0.nightly-2021-08-23-224104
OSP RHOS-16.1-RHEL-8-20210604.n.0

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Michał Dulko 2021-10-18 07:45:54 UTC

This seems to be caused by a behavior change in cri-o. Specifically the workflow is now:

1. Pod gets created and scheduled in the API.
2. cri-o issues CNI ADD request to kuryr-cni. kuryr-daemon gets it.
3. kuryr-daemon waits for the KuryrPort to appear in API.
4. #3 never happens as pod got deleted before KuryrPort got created. CNI ADD request hangs.
5. cri-o issues CNI DEL request that gets nicely served, but won't allow deleting the pod until CNI ADD times out on its side.
6. CNI ADD times out on kuryr-daemon side, that doesn't really matter, it already timed out on cri-o.

The behavior of waiting for CNI ADD to finish/time out is most likely new in 4.9 cri-o.

Comment 4 Michał Dulko 2021-10-27 11:35:49 UTC

In terms of a tests blocker - we don't consider this to be very high priority and we're currently waiting for changes in multus that would allow us to fix this. In the meanwhile I suggest ignoring the tests results or extending the timeout in the test to be over 5 minutes.

Comment 8 Itzik Brown 2022-04-28 05:49:21 UTC

Test passed
OCP 4.11.0-0.nightly-2022-04-24-135651
OSP RHOS-16.2-RHEL-8-20220311.n.1

Comment 9 Michał Dulko 2022-04-28 08:16:01 UTC

(In reply to Itzik Brown from comment #8)
> Test passed
> OCP 4.11.0-0.nightly-2022-04-24-135651
> OSP RHOS-16.2-RHEL-8-20220311.n.1

Had it passed after decreasing the timeout [1]?

[1] https://github.com/openstack/kuryr-tempest-plugin/blob/86423cc26cd44234948b0a544a51df5aa835289b/kuryr_tempest_plugin/tests/scenario/test_namespace.py#L393-L394

Comment 11 Itzik Brown 2022-05-08 11:53:46 UTC

@Michal , Thanks for reminding .. So when reducing the retries to 24 it fails with the same error

Comment 12 Michał Dulko 2022-05-10 08:15:38 UTC

I have just run the updated test 3 times against my 4.11 cluster. It does seem to work without any issues. Are you sure you've tested against Kuryr including the fix?

Comment 13 Itzik Brown 2022-05-11 20:20:58 UTC

I checked with 
4.11.0-0.nightly-2022-05-11-054135
RHOS-16.2-RHEL-8-20220311.n.1

The following tests passed.
"[sig-apps][Feature:DeploymentConfig] deploymentconfigs when run iteratively should immediately start a new deployment [Suite:openshift/conformance/parallel]"
"[sig-builds][Feature:Builds] remove all builds when build configuration is removed oc delete buildconfig should start builds and delete the buildconfig [Suite:openshift/conformance/parallel]"
"[sig-node] Pods Extended Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"

test_recreate_pod_in_namespace fails with 24 retries but succeed with a higher number.

Comment 14 Michał Dulko 2022-05-12 07:37:37 UTC

Yup, I'd mark this as verified. Even with test_recreate_pod_in_namespace failing, it's not Kuryr blocking the deletion and it also fails with OpenShiftSDN. That would be a separate new issue that should be handled by the node team. The tests from openshift/conformance/parallel were struggling from the Kuryr problem, but work now.

Comment 16 errata-xmlrpc 2022-08-10 10:36:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.