Bug 1929685
| Summary: | [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Federico Paolinelli <fpaoline> |
| Component: | Node | Assignee: | Harshal Patil <harpatil> |
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | adam.kaplan, aos-bugs, ehashman, harpatil, jsafrane, nagrawal, obulatov, tsmetana, wking |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
|
|
| Last Closed: | 2021-04-29 09:31:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Federico Paolinelli
2021-02-17 12:48:27 UTC
The job often fails with the message like: fail [runtime/asm_amd64.s:1374]: Feb 24 08:51:11.751: timed out waiting for watch events for pod-submit-status-0-12 I found it at [1]. The last message in the node journal about pod-submit-status-0-12 is: Feb 24 08:46:22.476184 ip-10-0-189-71 hyperkube[1480]: E0224 08:46:22.473971 1480 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "pod-submit-status-2-14_e2e-pods-9926(52e49bc7-0394-4421-8ed5-55d3435e7a88)" failed: rpc error: code = Unknown desc = exit status 1: write child: broken pipe [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.8/1364484752576352256 *** Bug 1931628 has been marked as a duplicate of this bug. *** This really doesn't seem to be a storage problem: secrets and configmaps don't fall to the Storage area and the pods in question don't use any persistent volumes. However... It's interesting: The test in question is attempting the start several pods that run /bin/false (i.e. fail on purpose) and get deleted after some time which is differrent in every test run and might be just a few miliseconds. The test collects the pod events in a goroutine to check them after the pod is deleted. Very often the test actually fails to start because the delete request is almost immediate after the pod creation: I assume this is the source of the failed mount errors, where the cleanup and startup procedures interleave. In the failed cases the pod seems to actually manage to get started but get stuck in "Terminated" state and the test fails because there is a 5 minute time out for the events collection. This is the actual reason for the test to fail: the delete api call is in the logs, but the delete event is ether missed or never arrives. I'm sorry but I have to bounce this one back to the Node team: there's nothing much storage related in the logs and the failure seems more like a race in kubelet or perhaps the test itself. *** Bug 1941726 has been marked as a duplicate of this bug. *** Raising severity based on CI frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Pod+Container+Status+should+never+report+success+for+a+pending+container' | grep 'failures match' | sort endurance-e2e-aws-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 10 runs, 10% failed, 100% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 3 runs, 33% failed, 200% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 4 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 16 runs, 56% failed, 33% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 15 runs, 93% failed, 21% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-openshift-release-master-nightly-4.6-e2e-aws-workers-rhel7 (all) - 4 runs, 75% failed, 67% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere (all) - 5 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws (all) - 3 runs, 33% failed, 100% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-fips (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp (all) - 3 runs, 33% failed, 200% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp-rt (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws (all) - 14 runs, 43% failed, 100% of failures match = 43% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 8 runs, 63% failed, 20% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-workers-rhel7 (all) - 4 runs, 75% failed, 33% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt (all) - 8 runs, 100% failed, 63% of failures match = 63% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 9 runs, 67% failed, 17% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi (all) - 11 runs, 45% failed, 60% of failures match = 27% impact periodic-ci-openshift-release-master-okd-4.6-e2e-aws (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-okd-4.7-e2e-vsphere (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-okd-4.8-e2e-aws (all) - 7 runs, 43% failed, 33% of failures match = 14% impact promote-release-openshift-machine-os-content-e2e-aws-4.8 (all) - 58 runs, 5% failed, 67% of failures match = 3% impact pull-ci-cri-o-cri-o-master-e2e-gcp (all) - 13 runs, 38% failed, 20% of failures match = 8% impact ... pull-ci-openshift-vmware-vsphere-csi-driver-operator-master-e2e-vsphere (all) - 6 runs, 67% failed, 25% of failures match = 17% impact rehearse-16861-pull-ci-openshift-cluster-network-operator-release-4.8-e2e-gcp-network-migration (all) - 5 runs, 60% failed, 33% of failures match = 20% impact ... rehearse-17804-pull-ci-openshift-ovn-kubernetes-release-4.9-e2e-ovn-hybrid-step-registry (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-e2e-aws-upi-4.8 (all) - 8 runs, 63% failed, 40% of failures match = 25% impact release-openshift-ocp-installer-e2e-azure-ovn-4.8 (all) - 8 runs, 38% failed, 100% of failures match = 38% impact release-openshift-ocp-installer-e2e-openstack-4.6 (all) - 3 runs, 33% failed, 200% of failures match = 67% impact release-openshift-ocp-installer-e2e-openstack-4.7 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact release-openshift-ocp-installer-e2e-openstack-4.8 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.7 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact test_pull_request_crio_e2e_crun_fedora_cgroupv2 (all) - 17 runs, 59% failed, 50% of failures match = 29% impact |