Bug 1965545
Summary: | Pod stuck in ContainerCreating: Unit ...slice already exists | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Node | Assignee: | Kir Kolyshkin <kir> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | aos-bugs, harpatil, kir, nagrawal, rphillips, weinliu |
Version: | 4.8 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:10:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2021-05-28 00:00:19 UTC
Rough upper bound on frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubePodNotReady+fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 33 runs, 27% failed, 22% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 32 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 32 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 31 runs, 100% failed, 10% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 38 runs, 32% failed, 17% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 38 runs, 100% failed, 34% of failures match = 34% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 15 runs, 100% failed, 7% of failures match = 7% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 16 runs, 100% failed, 44% of failures match = 44% impact release-openshift-ocp-installer-e2e-aws-ovn-4.8 (all) - 16 runs, 25% failed, 25% of failures match = 6% impact release-openshift-ocp-installer-e2e-azure-ovn-4.8 (all) - 16 runs, 31% failed, 40% of failures match = 13% impact release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 16 runs, 38% failed, 17% of failures match = 6% impact release-openshift-origin-installer-launch-aws (all) - 116 runs, 65% failed, 1% of failures match = 1% impact Certainly feels like a 4.8 regression. Stretching out to maxAge=336h doesn't seem to have all that much impact on the impacts, so probably something that's been broken for at least a week or two. Raising to high based on these rates, but feel free to push back if that seems off base. > Kir, could this be related to the recent runc bump to rc95?
Yes, looks like it. Looking...
Filed https://github.com/opencontainers/runc/issues/2996, reproduced. Backport notes: 1. Created a branch for runc: https://github.com/openshift/opencontainers-runc/tree/openshift-4.8 2. Created a backport of runc fix: https://github.com/openshift/opencontainers-runc/pull/9, merged it. 3. Created a bump PR for 4.8: https://github.com/openshift/kubernetes/pull/790 Upstream issue (just created by me): https://github.com/kubernetes/kubernetes/issues/102676 [1] is 4.8.0-0.ci-2021-06-07-101955 -> 4.8.0-0.ci-2021-06-07-155134, with: ... alert KubeNodeNotReady fired for 2100 seconds with labels: {condition="Ready", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", node="ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn", service="kube-state-metrics", severity="warning", status="true"} alert KubePodNotReady fired for 2040 seconds with labels: {namespace="openshift-multus", pod="network-metrics-daemon-cd52c", severity="warning"} alert KubePodNotReady fired for 2040 seconds with labels: {namespace="openshift-network-diagnostics", pod="network-check-target-59phj", severity="warning"} alert KubePodNotReady fired for 2040 seconds with labels: {namespace="openshift-sdn", pod="sdn-68n94", severity="warning"} ... Looks the same as this bug: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1401930066266427392/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn/journal | gunzip | grep 'Unit .*slice already exists' | tail -n2 Jun 07 17:07:23.648904 ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn hyperkube[1440]: E0607 17:07:23.648854 1440 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 37965ca8-72fb-4fd4-83e6-51142c02904d cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod37965ca8-72fb-4fd4-83e6-51142c02904d] : Unit kubepods-burstable-pod37965ca8_72fb_4fd4_83e6_51142c02904d.slice already exists." pod="openshift-sdn/sdn-68n94" podUID=37965ca8-72fb-4fd4-83e6-51142c02904d Jun 07 17:07:36.649006 ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn hyperkube[1440]: E0607 17:07:36.648938 1440 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 37965ca8-72fb-4fd4-83e6-51142c02904d cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod37965ca8-72fb-4fd4-83e6-51142c02904d] : Unit kubepods-burstable-pod37965ca8_72fb_4fd4_83e6_51142c02904d.slice already exists." pod="openshift-sdn/sdn-68n94" podUID=37965ca8-72fb-4fd4-83e6-51142c02904d [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1401930066266427392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |