Description of problem: Test 'Pod Container Status should never report success for a pending container' is failing on IBM ROKS clusters How reproducible: Always Steps to Reproduce: 1. Run e2e conformance test on IBM Actual results: Test 'Pod Container Status should never report success for a pending container' fails Expected results: Test 'Pod Container Status should never report success for a pending container' should succeed Additional info: The delay between the time a pod is created and the time it is deleted, could be 0 or very close to 0. This is not enough time for the pod to be setup, volumes mounted, etc. The test should be fixed to allow enough time for this: https://github.com/openshift/origin/blob/4d0922fb92f85f566cb22bbaaedf587e8a50aca4/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L293-L295
Why was this filed as an ibmcloud bug? The analysis above suggests that the test just has a race condition, not that there is something specific to ibmcloud that makes it fail. Now that the fix/test has been backported, it is completely breaking e2e-aws-4.2 as well: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-ocp-installer-e2e-aws-4.2
The only reason is that this was that this test was continuously failing for e2e conformance on IBM cloud and we added a skip explicitly for it. However, I agree that this is not specific to ibmcloud, it could flake on any platform.
I learned from Cesar that IBM is running on RHEL7. I suspect they need to update their crio and runc binaries to the latest released versions.
cri-o-1.16.6-14.dev.rhaos4.3.git24e5f4e.el7.x86_64 runc-1.0.0-67.rc10.el7_8.x86_64 These look correct.
Related to https://github.com/kubernetes/kubernetes/issues/88766
Cesar tested against the IBM release once more and did not see this issue again. We have had lots of runc and crio fixes recently, so I suspect some upgrades might have happened. I found a minor issue with the log line, which is patched here: https://github.com/kubernetes/kubernetes/pull/92051 (I'll create a separate BZ to backport the change, since it is not related to this issue). If the issue comes back, certainly please reopen.
OK, well whether or not this is a problem on ibmcloud, it's a problem on e2e-aws-4.2 (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-ocp-installer-e2e-aws-4.2) (Also, if this is not a problem on ibmcloud then then the skip rule in test/extended/util/annotate/rules.go needs to be removed.)
Got a passing test on 4.2 branch PR: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/25129/pull-ci-openshift-origin-release-4.2-e2e-aws/1272916797271576576
During back porting, found some logistical issue: https://github.com/openshift/origin/pull/25129#issuecomment-645024258
TestGrid [1] shows this failing in every release-openshift-origin-installer-e2e-aws-4.2 job that has run the test since it landed via [2,3]. It is blocking 4.2 release acceptance [4]. To cut a new 4.2.z, we need to either revert [3] or find and fix the product bug it is exposing (which is presumably what this ticket is about). Thoughts on whether this is a product issue that is serious enough to block on? Do we think the product issue is a regression, or is it a long-running issue that's just being exposed by the new test? Brackets from "e2e-aws-4.2 mostly passed" to "e2e-aws-4.2 hardly ever passes": $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5550/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5558/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') --- /dev/fd/63 2020-06-17 21:00:26.414902729 -0700 +++ /dev/fd/62 2020-06-17 21:00:26.415902741 -0700 @@ -46 +46 @@ -hyperkube https://github.com/openshift/origin/commit/bf8cb33e0df6eb4847a28f23ddde8ce306151cf2 +hyperkube https://github.com/openshift/origin/commit/77869edca67678ff9f6575094bae0ede1314f989 @@ -99 +99 @@ -tests https://github.com/openshift/origin/commit/bf8cb33e0df6eb4847a28f23ddde8ce306151cf2 +tests https://github.com/openshift/origin/commit/77869edca67678ff9f6575094bae0ede1314f989 Example failed job [5]: [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s] fail [k8s.io/kubernetes/test/e2e/node/pods.go:445]: Jun 4 22:54:56.278: 30 errors: pod pod-submit-status-2-0 on node ip-10-0-133-218.ec2.internal container unexpected exit code 137: start=0001-01-01 00:00:00 +0000 UTC end=0001-01-01 00:00:00 +0000 UTC reason=ContainerStatusUnknown message=The container could not be located when the pod was terminated ... pod pod-submit-status-2-14 on node ip-10-0-131-232.ec2.internal container unexpected exit code 137: start=0001-01-01 00:00:00 +0000 UTC end=0001-01-01 00:00:00 +0000 UTC reason=ContainerStatusUnknown message=The container could not be located when the pod was terminated stdout for that test has a bunch of: Jun 4 22:54:56.335: INFO: At 2020-06-04 22:54:53 +0000 UTC - event for pod-submit-status-2-14: {kubelet ip-10-0-131-232.ec2.internal} FailedCreatePodSandBox: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-submit-status-2-14_e2e-pods-6554_61101339-a6b6-11ea-b6cf-0af56c18fe99_0(01a33bece46dfc4d6cf329b4e2b5f5500d98ca36fc3bc6f912ece9c4cc471233): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/261822/ns/net": no such file or directory and similar. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-blocking#release-openshift-origin-installer-e2e-aws-4.2 [2]: https://github.com/openshift/origin/pull/25036 (this PR was closed, but the same commit landed via [3]) [3]: https://github.com/openshift/origin/pull/25038 [4]: https://openshift-release.svc.ci.openshift.org/#4.2.0-0.nightly [5]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/5558
Breaking 4.2.z for almost two weeks on a new test is bad. I've filed [1] to revert and drop the test to get back to green CI. Folks can take another run at this with #25129 and a fixed test without holding up the rest of 4.2.z. On the other hand, I don't have power to approve origin changes or twiddle the labels to get the revert PR landed, so maybe #25129 will land first anyway ;). [1]: https://github.com/openshift/origin/pull/251492
correct reversion pr link: https://github.com/openshift/origin/pull/25149
*** Bug 1844376 has been marked as a duplicate of this bug. ***
Moving this back to new. We reverted the introduction of the test in the 4.2.z stream(which is the only place where we saw it breaking): https://github.com/openshift/origin/pull/25149 But it needs to be unreverted in 4.2.z, and any 4.3/4.4/4.5/4.6 work (this bug was originally filed against 4.5 and targeted to 4.6) needs to be resolved.
After some discussion in slack, we reverted the 4.2.z PR and will not be backporting a fix into 4.2.z for a few reasons: 1. 4.2.z is largely for critical and CVE fixes 2. The patch that was reverted works in later releases (4.3, 4.4, and 4.5) 3. The patch relies on a 'fixed' version of CRIO is later releases Moving this bug to MODIFIED to validate the patch is reverted.
https://github.com/openshift/origin/pull/25149 is merged, marking it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2589