Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1929685

Summary:	[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
Product:	OpenShift Container Platform	Reporter:	Federico Paolinelli <fpaoline>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	adam.kaplan, aos-bugs, ehashman, harpatil, jsafrane, nagrawal, obulatov, tsmetana, wking
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
Last Closed:	2021-04-29 09:31:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Federico Paolinelli 2021-02-17 12:48:27 UTC

test:
[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bk8s%5C.io%5C%5D+%5C%5Bsig-node%5C%5D+Pods+Extended+%5C%5Bk8s%5C.io%5C%5D+Pod+Container+Status+should+never+report+success+for+a+pending+container


FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion.  A given test may fail for several reasons, and this bug should be scoped to one of those reasons.  Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random.  Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis.

FIXME: Provide a snippet of the test failure or error from the job log

Comment 2 Oleg Bulatov 2021-02-24 13:58:33 UTC

The job often fails with the message like:

fail [runtime/asm_amd64.s:1374]: Feb 24 08:51:11.751: timed out waiting for watch events for pod-submit-status-0-12

I found it at [1]. The last message in the node journal about pod-submit-status-0-12 is:

Feb 24 08:46:22.476184 ip-10-0-189-71 hyperkube[1480]: E0224 08:46:22.473971    1480 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "pod-submit-status-2-14_e2e-pods-9926(52e49bc7-0394-4421-8ed5-55d3435e7a88)" failed: rpc error: code = Unknown desc = exit status 1: write child: broken pipe

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.8/1364484752576352256

Comment 3 Harshal Patil 2021-02-25 06:27:49 UTC

*** Bug 1931628 has been marked as a duplicate of this bug. ***

Comment 5 Tomas Smetana 2021-03-31 11:58:56 UTC

This really doesn't seem to be a storage problem: secrets and configmaps don't fall to the Storage area and the pods in question don't use any persistent volumes. However... It's interesting:

The test in question is attempting the start several pods that run /bin/false (i.e. fail on purpose) and get deleted after some time which is differrent in every test run and might be just a few miliseconds. The test collects the pod events in a goroutine to check them after the pod is deleted. Very often the test actually fails to start because the delete request is almost immediate after the pod creation: I assume this is the source of the failed mount errors, where the cleanup and startup procedures interleave. In the failed cases the pod seems to actually manage to get started but get stuck in "Terminated" state and the test fails because there is a 5 minute time out for the events collection. This is the actual reason for the test to fail: the delete api call is in the logs, but the delete event is ether missed or never arrives.

Comment 6 Tomas Smetana 2021-04-01 11:46:57 UTC

I'm sorry but I have to bounce this one back to the Node team: there's nothing much storage related in the logs and the failure seems more like a race in kubelet or perhaps the test itself.

Comment 7 Harshal Patil 2021-04-09 11:21:34 UTC

*** Bug 1941726 has been marked as a duplicate of this bug. ***

Comment 8 W. Trevor King 2021-04-22 03:56:57 UTC

Raising severity based on CI frequency:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Pod+Container+Status+should+never+report+success+for+a+pending+container' | grep 'failures match' | sort
endurance-e2e-aws-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 10 runs, 10% failed, 100% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 16 runs, 56% failed, 33% of failures match = 19% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 15 runs, 93% failed, 21% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-aws-workers-rhel7 (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-aws (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-fips (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp-rt (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws (all) - 14 runs, 43% failed, 100% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 8 runs, 63% failed, 20% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-workers-rhel7 (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt (all) - 8 runs, 100% failed, 63% of failures match = 63% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 9 runs, 67% failed, 17% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi (all) - 11 runs, 45% failed, 60% of failures match = 27% impact
periodic-ci-openshift-release-master-okd-4.6-e2e-aws (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.7-e2e-vsphere (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-okd-4.8-e2e-aws (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
promote-release-openshift-machine-os-content-e2e-aws-4.8 (all) - 58 runs, 5% failed, 67% of failures match = 3% impact
pull-ci-cri-o-cri-o-master-e2e-gcp (all) - 13 runs, 38% failed, 20% of failures match = 8% impact
...
pull-ci-openshift-vmware-vsphere-csi-driver-operator-master-e2e-vsphere (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
rehearse-16861-pull-ci-openshift-cluster-network-operator-release-4.8-e2e-gcp-network-migration (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
...
rehearse-17804-pull-ci-openshift-ovn-kubernetes-release-4.9-e2e-ovn-hybrid-step-registry (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-aws-upi-4.8 (all) - 8 runs, 63% failed, 40% of failures match = 25% impact
release-openshift-ocp-installer-e2e-azure-ovn-4.8 (all) - 8 runs, 38% failed, 100% of failures match = 38% impact
release-openshift-ocp-installer-e2e-openstack-4.6 (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
release-openshift-ocp-installer-e2e-openstack-4.7 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
release-openshift-ocp-installer-e2e-openstack-4.8 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.7 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
test_pull_request_crio_e2e_crun_fedora_cgroupv2 (all) - 17 runs, 59% failed, 50% of failures match = 29% impact