This bug was initially created as a copy of Bug #1965545 I am copying this bug because: I think we missed the PR https://github.com/openshift/kubernetes/pull/790 during the rebase to 1.22. [1] failed a recent 4.8 update, with one of the compute nodes stuck coming back after reboot. From the JUnit: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 2h35m9s May 27 06:34:41.899: Unexpected alerts fired or pending during the upgrade: alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"} alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"} alert ClusterOperatorDegraded fired for 2712 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", reason="MachineConfigDaemonFailed", service="cluster-version-operator", severity="warning"} alert ClusterOperatorDegraded fired for 2742 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"} alert ClusterOperatorDown fired for 3942 seconds with labels: {endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-05-26-225946"} alert KubeContainerWaiting fired for 1645 seconds with labels: {container="machine-config-daemon", namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"} alert KubeContainerWaiting fired for 1645 seconds with labels: {container="oauth-proxy", namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"} alert KubePodNotReady fired for 360 seconds with labels: {namespace="openshift-authentication", pod="oauth-openshift-5d96b55df8-ldqdp", severity="warning"} alert KubePodNotReady fired for 4345 seconds with labels: {namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"} alert KubePodNotReady fired for 4525 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeStatefulSetReplicasMismatch fired for 4525 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"} Digging into that machine-config daemon pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "machine-config-daemon-mm7gt").status.containerStatuses[]' { "image": "registry.ci.openshift.org/ocp/4.8-2021-05-26-225946@sha256:26b5b45670764e9270780891b2142602ebaa8b364d6917ef28e3f898e28725d9", "imageID": "", "lastState": {}, "name": "machine-config-daemon", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } { "image": "registry.ci.openshift.org/ocp/4.8-2021-05-26-225946@sha256:b3177a6ad870f49f3bb6ce9a53344b14a4150c8c3a0711f20943501014b22f67", "imageID": "", "lastState": {}, "name": "oauth-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "machine-config-daemon-mm7gt").spec.nodeName' ip-10-0-220-230.ec2.internal Which is not really helpful. But we did gather kubelet logs for that node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-220-230.ec2.internal/journal | gunzip | grep -1 machine-config-daemon-mm7gt | tail -n3 May 27 06:38:19.960408 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:19.960361 1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 May 27 06:38:33.962303 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:33.962251 1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 May 27 06:38:47.962911 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:47.962851 1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 That error message was mentioned way back in bug 1466636, but I don't see anything since, so opening a new bug about this new instance. Stuck-creating is pretty serious, but going with medium until we have a better handle on frequency. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080
This is the same as https://github.com/kubernetes/kubernetes/issues/102676 It is caused by a regression in runc, introduced by yours truly in runc 1.0.0-rc94 (commit bacfc2c) and fixed in runc-1.0.0 (commit b2d28c5df2c). Should be fixed by a runc bump, e.g. https://github.com/kubernetes/kubernetes/pull/104528 and its backports (1.22: https://github.com/kubernetes/kubernetes/pull/104529, 1.21: https://github.com/kubernetes/kubernetes/pull/104530).
*** This bug has been marked as a duplicate of bug 1993980 ***