Bug 1825372 - e2e: Failing test - Pod Container Status should never report success for a pending container
Summary: e2e: Failing test - Pod Container Status should never report success for a pe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.z
Assignee: Ted Yu
QA Contact: MinLi
URL:
Whiteboard:
: 1844376 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-17 19:50 UTC by Cesar Wong
Modified: 2020-07-01 16:08 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-01 16:08:20 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2589 0 None None None 2020-07-01 16:08:47 UTC

Description Cesar Wong 2020-04-17 19:50:28 UTC
Description of problem:
Test 'Pod Container Status should never report success for a pending container' is failing on IBM ROKS clusters


How reproducible:
Always

Steps to Reproduce:
1. Run e2e conformance test on IBM 

Actual results:
Test 'Pod Container Status should never report success for a pending container' fails

Expected results:
Test 'Pod Container Status should never report success for a pending container' should succeed

Additional info:
The delay between the time a pod is created and the time it is deleted, could be 0 or very close to 0. This is not enough time for the pod to be setup, volumes mounted, etc. The test should be fixed to allow enough time for this:
https://github.com/openshift/origin/blob/4d0922fb92f85f566cb22bbaaedf587e8a50aca4/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L293-L295

Comment 1 Dan Winship 2020-06-11 17:33:28 UTC
Why was this filed as an ibmcloud bug? The analysis above suggests that the test just has a race condition, not that there is something specific to ibmcloud that makes it fail.

Now that the fix/test has been backported, it is completely breaking e2e-aws-4.2 as well: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-ocp-installer-e2e-aws-4.2

Comment 2 Cesar Wong 2020-06-11 18:09:30 UTC
The only reason is that this was that this test was continuously failing for e2e conformance on IBM cloud and we added a skip explicitly for it. However, I agree that this is not specific to ibmcloud, it could flake on any platform.

Comment 3 Ryan Phillips 2020-06-11 18:57:07 UTC
I learned from Cesar that IBM is running on RHEL7. I suspect they need to update their crio and runc binaries to the latest released versions.

Comment 4 Ryan Phillips 2020-06-11 19:06:09 UTC
cri-o-1.16.6-14.dev.rhaos4.3.git24e5f4e.el7.x86_64
runc-1.0.0-67.rc10.el7_8.x86_64

These look correct.

Comment 5 Ryan Phillips 2020-06-11 19:17:12 UTC
Related to https://github.com/kubernetes/kubernetes/issues/88766

Comment 6 Ryan Phillips 2020-06-11 20:05:19 UTC
Cesar tested against the IBM release once more and did not see this issue again. We have had lots of runc and crio fixes recently, so I suspect some upgrades might have happened.

I found a minor issue with the log line, which is patched here: https://github.com/kubernetes/kubernetes/pull/92051  (I'll create a separate BZ to backport the change, since it is not related to this issue).

If the issue comes back, certainly please reopen.

Comment 7 Dan Winship 2020-06-12 13:02:21 UTC
OK, well whether or not this is a problem on ibmcloud, it's a problem on e2e-aws-4.2 (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-ocp-installer-e2e-aws-4.2)

(Also, if this is not a problem on ibmcloud then then the skip rule in test/extended/util/annotate/rules.go needs to be removed.)

Comment 11 Ted Yu 2020-06-16 21:40:34 UTC
During back porting, found some logistical issue:

https://github.com/openshift/origin/pull/25129#issuecomment-645024258

Comment 12 W. Trevor King 2020-06-18 04:14:05 UTC
TestGrid [1] shows this failing in every release-openshift-origin-installer-e2e-aws-4.2 job that has run the test since it landed via [2,3].  It is blocking 4.2 release acceptance [4].  To cut a new 4.2.z, we need to either revert [3] or find and fix the product bug it is exposing (which is presumably what this ticket is about).  Thoughts on whether this is a product issue that is serious enough to block on?  Do we think the product issue is a regression, or is it a long-running issue that's just being exposed by the new test?  Brackets from "e2e-aws-4.2 mostly passed" to "e2e-aws-4.2 hardly ever passes":

  $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5550/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5558/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]')
  --- /dev/fd/63	2020-06-17 21:00:26.414902729 -0700
  +++ /dev/fd/62	2020-06-17 21:00:26.415902741 -0700
  @@ -46 +46 @@
  -hyperkube https://github.com/openshift/origin/commit/bf8cb33e0df6eb4847a28f23ddde8ce306151cf2
  +hyperkube https://github.com/openshift/origin/commit/77869edca67678ff9f6575094bae0ede1314f989
  @@ -99 +99 @@
  -tests https://github.com/openshift/origin/commit/bf8cb33e0df6eb4847a28f23ddde8ce306151cf2
  +tests https://github.com/openshift/origin/commit/77869edca67678ff9f6575094bae0ede1314f989

Example failed job [5]:

  [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]
  fail [k8s.io/kubernetes/test/e2e/node/pods.go:445]: Jun  4 22:54:56.278: 30 errors:
  pod pod-submit-status-2-0 on node ip-10-0-133-218.ec2.internal container unexpected exit code 137: start=0001-01-01 00:00:00 +0000 UTC end=0001-01-01 00:00:00 +0000 UTC reason=ContainerStatusUnknown message=The container could not be located when the pod was terminated
  ...
  pod pod-submit-status-2-14 on node ip-10-0-131-232.ec2.internal container unexpected exit code 137: start=0001-01-01 00:00:00 +0000 UTC end=0001-01-01 00:00:00 +0000 UTC reason=ContainerStatusUnknown message=The container could not be located when the pod was terminated

stdout for that test has a bunch of:

  Jun  4 22:54:56.335: INFO: At 2020-06-04 22:54:53 +0000 UTC - event for pod-submit-status-2-14: {kubelet ip-10-0-131-232.ec2.internal} FailedCreatePodSandBox: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-submit-status-2-14_e2e-pods-6554_61101339-a6b6-11ea-b6cf-0af56c18fe99_0(01a33bece46dfc4d6cf329b4e2b5f5500d98ca36fc3bc6f912ece9c4cc471233): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/261822/ns/net": no such file or directory

and similar.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-origin-installer-e2e-aws-4.2
[2]: https://github.com/openshift/origin/pull/25036 (this PR was closed, but the same commit landed via [3])
[3]: https://github.com/openshift/origin/pull/25038
[4]: https://openshift-release.svc.ci.openshift.org/#4.2.0-0.nightly
[5]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5558

Comment 14 W. Trevor King 2020-06-18 04:32:16 UTC
Breaking 4.2.z for almost two weeks on a new test is bad.  I've filed [1] to revert and drop the test to get back to green CI.  Folks can take another run at this with #25129 and a fixed test without holding up the rest of 4.2.z.  On the other hand, I don't have power to approve origin changes or twiddle the labels to get the revert PR landed, so maybe #25129 will land first anyway ;).

[1]: https://github.com/openshift/origin/pull/251492

Comment 15 Ben Parees 2020-06-18 05:12:57 UTC
correct reversion pr link: https://github.com/openshift/origin/pull/25149

Comment 16 W. Trevor King 2020-06-18 14:13:00 UTC
*** Bug 1844376 has been marked as a duplicate of this bug. ***

Comment 18 Ben Parees 2020-06-18 15:03:26 UTC
Moving this back to new.  We reverted the introduction of the test in the 4.2.z stream(which is the only place where we saw it breaking):
https://github.com/openshift/origin/pull/25149

But it needs to be unreverted in 4.2.z, and any 4.3/4.4/4.5/4.6 work (this bug was originally filed against 4.5 and targeted to 4.6) needs to be resolved.

Comment 19 Ryan Phillips 2020-06-18 15:15:19 UTC
After some discussion in slack, we reverted the 4.2.z PR and will not be backporting a fix into 4.2.z for a few reasons:

1. 4.2.z is largely for critical and CVE fixes
2. The patch that was reverted works in later releases (4.3, 4.4, and 4.5)
3. The patch relies on a 'fixed' version of CRIO is later releases

Moving this bug to MODIFIED to validate the patch is reverted.

Comment 22 Sunil Choudhary 2020-06-22 13:44:22 UTC
https://github.com/openshift/origin/pull/25149 is merged, marking it verified.

Comment 24 errata-xmlrpc 2020-07-01 16:08:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2589


Note You need to log in before you can comment on or make changes to this bug.