1825372 – e2e: Failing test - Pod Container Status should never report success for a pending container

Bug 1825372 - e2e: Failing test - Pod Container Status should never report success for a pending container

Summary: e2e: Failing test - Pod Container Status should never report success for a pe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Ted Yu
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1844376 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-17 19:50 UTC by Cesar Wong
Modified:	2020-07-01 16:08 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-01 16:08:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2589	0	None	None	None	2020-07-01 16:08:47 UTC

Description Cesar Wong 2020-04-17 19:50:28 UTC

Description of problem:
Test 'Pod Container Status should never report success for a pending container' is failing on IBM ROKS clusters


How reproducible:
Always

Steps to Reproduce:
1. Run e2e conformance test on IBM 

Actual results:
Test 'Pod Container Status should never report success for a pending container' fails

Expected results:
Test 'Pod Container Status should never report success for a pending container' should succeed

Additional info:
The delay between the time a pod is created and the time it is deleted, could be 0 or very close to 0. This is not enough time for the pod to be setup, volumes mounted, etc. The test should be fixed to allow enough time for this:
https://github.com/openshift/origin/blob/4d0922fb92f85f566cb22bbaaedf587e8a50aca4/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L293-L295

Comment 1 Dan Winship 2020-06-11 17:33:28 UTC

Why was this filed as an ibmcloud bug? The analysis above suggests that the test just has a race condition, not that there is something specific to ibmcloud that makes it fail.

Now that the fix/test has been backported, it is completely breaking e2e-aws-4.2 as well: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-ocp-installer-e2e-aws-4.2

Comment 2 Cesar Wong 2020-06-11 18:09:30 UTC

The only reason is that this was that this test was continuously failing for e2e conformance on IBM cloud and we added a skip explicitly for it. However, I agree that this is not specific to ibmcloud, it could flake on any platform.

Comment 3 Ryan Phillips 2020-06-11 18:57:07 UTC

I learned from Cesar that IBM is running on RHEL7. I suspect they need to update their crio and runc binaries to the latest released versions.

Comment 4 Ryan Phillips 2020-06-11 19:06:09 UTC

cri-o-1.16.6-14.dev.rhaos4.3.git24e5f4e.el7.x86_64
runc-1.0.0-67.rc10.el7_8.x86_64

These look correct.

Comment 5 Ryan Phillips 2020-06-11 19:17:12 UTC

Related to https://github.com/kubernetes/kubernetes/issues/88766

Comment 6 Ryan Phillips 2020-06-11 20:05:19 UTC

Cesar tested against the IBM release once more and did not see this issue again. We have had lots of runc and crio fixes recently, so I suspect some upgrades might have happened.

I found a minor issue with the log line, which is patched here: https://github.com/kubernetes/kubernetes/pull/92051  (I'll create a separate BZ to backport the change, since it is not related to this issue).

If the issue comes back, certainly please reopen.

Comment 7 Dan Winship 2020-06-12 13:02:21 UTC

OK, well whether or not this is a problem on ibmcloud, it's a problem on e2e-aws-4.2 (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-ocp-installer-e2e-aws-4.2)

(Also, if this is not a problem on ibmcloud then then the skip rule in test/extended/util/annotate/rules.go needs to be removed.)

Comment 10 Ted Yu 2020-06-16 18:48:16 UTC

Got a passing test on 4.2 branch PR:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/25129/pull-ci-openshift-origin-release-4.2-e2e-aws/1272916797271576576

Comment 11 Ted Yu 2020-06-16 21:40:34 UTC

During back porting, found some logistical issue:

https://github.com/openshift/origin/pull/25129#issuecomment-645024258

Comment 12 W. Trevor King 2020-06-18 04:14:05 UTC

TestGrid [1] shows this failing in every release-openshift-origin-installer-e2e-aws-4.2 job that has run the test since it landed via [2,3].  It is blocking 4.2 release acceptance [4].  To cut a new 4.2.z, we need to either revert [3] or find and fix the product bug it is exposing (which is presumably what this ticket is about).  Thoughts on whether this is a product issue that is serious enough to block on?  Do we think the product issue is a regression, or is it a long-running issue that's just being exposed by the new test?  Brackets from "e2e-aws-4.2 mostly passed" to "e2e-aws-4.2 hardly ever passes":

  $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5550/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5558/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]')
  --- /dev/fd/63	2020-06-17 21:00:26.414902729 -0700
  +++ /dev/fd/62	2020-06-17 21:00:26.415902741 -0700
  @@ -46 +46 @@
  -hyperkube https://github.com/openshift/origin/commit/bf8cb33e0df6eb4847a28f23ddde8ce306151cf2
  +hyperkube https://github.com/openshift/origin/commit/77869edca67678ff9f6575094bae0ede1314f989
  @@ -99 +99 @@
  -tests https://github.com/openshift/origin/commit/bf8cb33e0df6eb4847a28f23ddde8ce306151cf2
  +tests https://github.com/openshift/origin/commit/77869edca67678ff9f6575094bae0ede1314f989

Example failed job [5]:

  [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]
  fail [k8s.io/kubernetes/test/e2e/node/pods.go:445]: Jun  4 22:54:56.278: 30 errors:
  pod pod-submit-status-2-0 on node ip-10-0-133-218.ec2.internal container unexpected exit code 137: start=0001-01-01 00:00:00 +0000 UTC end=0001-01-01 00:00:00 +0000 UTC reason=ContainerStatusUnknown message=The container could not be located when the pod was terminated
  ...
  pod pod-submit-status-2-14 on node ip-10-0-131-232.ec2.internal container unexpected exit code 137: start=0001-01-01 00:00:00 +0000 UTC end=0001-01-01 00:00:00 +0000 UTC reason=ContainerStatusUnknown message=The container could not be located when the pod was terminated

stdout for that test has a bunch of:

  Jun  4 22:54:56.335: INFO: At 2020-06-04 22:54:53 +0000 UTC - event for pod-submit-status-2-14: {kubelet ip-10-0-131-232.ec2.internal} FailedCreatePodSandBox: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-submit-status-2-14_e2e-pods-6554_61101339-a6b6-11ea-b6cf-0af56c18fe99_0(01a33bece46dfc4d6cf329b4e2b5f5500d98ca36fc3bc6f912ece9c4cc471233): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/261822/ns/net": no such file or directory

and similar.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-origin-installer-e2e-aws-4.2
[2]: https://github.com/openshift/origin/pull/25036 (this PR was closed, but the same commit landed via [3])
[3]: https://github.com/openshift/origin/pull/25038
[4]: https://openshift-release.svc.ci.openshift.org/#4.2.0-0.nightly
[5]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5558

Comment 14 W. Trevor King 2020-06-18 04:32:16 UTC

Breaking 4.2.z for almost two weeks on a new test is bad.  I've filed [1] to revert and drop the test to get back to green CI.  Folks can take another run at this with #25129 and a fixed test without holding up the rest of 4.2.z.  On the other hand, I don't have power to approve origin changes or twiddle the labels to get the revert PR landed, so maybe #25129 will land first anyway ;).

[1]: https://github.com/openshift/origin/pull/251492

Comment 15 Ben Parees 2020-06-18 05:12:57 UTC

correct reversion pr link: https://github.com/openshift/origin/pull/25149

Comment 16 W. Trevor King 2020-06-18 14:13:00 UTC

*** Bug 1844376 has been marked as a duplicate of this bug. ***

Comment 18 Ben Parees 2020-06-18 15:03:26 UTC

Moving this back to new.  We reverted the introduction of the test in the 4.2.z stream(which is the only place where we saw it breaking):
https://github.com/openshift/origin/pull/25149

But it needs to be unreverted in 4.2.z, and any 4.3/4.4/4.5/4.6 work (this bug was originally filed against 4.5 and targeted to 4.6) needs to be resolved.

Comment 19 Ryan Phillips 2020-06-18 15:15:19 UTC

After some discussion in slack, we reverted the 4.2.z PR and will not be backporting a fix into 4.2.z for a few reasons:

1. 4.2.z is largely for critical and CVE fixes
2. The patch that was reverted works in later releases (4.3, 4.4, and 4.5)
3. The patch relies on a 'fixed' version of CRIO is later releases

Moving this bug to MODIFIED to validate the patch is reverted.

Comment 22 Sunil Choudhary 2020-06-22 13:44:22 UTC

https://github.com/openshift/origin/pull/25149 is merged, marking it verified.

Comment 24 errata-xmlrpc 2020-07-01 16:08:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2589

Note You need to log in before you can comment on or make changes to this bug.