Bug 1996755 - Pod stuck in ContainerCreating: Unit ...slice already exists
Summary: Pod stuck in ContainerCreating: Unit ...slice already exists
Keywords:
Status: CLOSED DUPLICATE of bug 1993980
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-23 15:19 UTC by Artyom
Modified: 2021-08-25 14:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-25 14:40:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Artyom 2021-08-23 15:19:10 UTC
This bug was initially created as a copy of Bug #1965545

I am copying this bug because: 
I think we missed the PR https://github.com/openshift/kubernetes/pull/790 during the rebase to 1.22.


[1] failed a recent 4.8 update, with one of the compute nodes stuck coming back after reboot.  From the JUnit:

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	2h35m9s
May 27 06:34:41.899: Unexpected alerts fired or pending during the upgrade:

alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"}
alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"}
alert ClusterOperatorDegraded fired for 2712 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", reason="MachineConfigDaemonFailed", service="cluster-version-operator", severity="warning"}
alert ClusterOperatorDegraded fired for 2742 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"}
alert ClusterOperatorDown fired for 3942 seconds with labels: {endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-05-26-225946"}
alert KubeContainerWaiting fired for 1645 seconds with labels: {container="machine-config-daemon", namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"}
alert KubeContainerWaiting fired for 1645 seconds with labels: {container="oauth-proxy", namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"}
alert KubePodNotReady fired for 360 seconds with labels: {namespace="openshift-authentication", pod="oauth-openshift-5d96b55df8-ldqdp", severity="warning"}
alert KubePodNotReady fired for 4345 seconds with labels: {namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"}
alert KubePodNotReady fired for 4525 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"}
alert KubeStatefulSetReplicasMismatch fired for 4525 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"}

Digging into that machine-config daemon pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "machine-config-daemon-mm7gt").status.containerStatuses[]'
{
  "image": "registry.ci.openshift.org/ocp/4.8-2021-05-26-225946@sha256:26b5b45670764e9270780891b2142602ebaa8b364d6917ef28e3f898e28725d9",
  "imageID": "",
  "lastState": {},
  "name": "machine-config-daemon",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "reason": "ContainerCreating"
    }
  }
}
{
  "image": "registry.ci.openshift.org/ocp/4.8-2021-05-26-225946@sha256:b3177a6ad870f49f3bb6ce9a53344b14a4150c8c3a0711f20943501014b22f67",
  "imageID": "",
  "lastState": {},
  "name": "oauth-proxy",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "reason": "ContainerCreating"
    }
  }
}
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "machine-config-daemon-mm7gt").spec.nodeName'
ip-10-0-220-230.ec2.internal

Which is not really helpful.  But we did gather kubelet logs for that node:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-220-230.ec2.internal/journal | gunzip | grep -1 machine-config-daemon-mm7gt | tail -n3
May 27 06:38:19.960408 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:19.960361    1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16
May 27 06:38:33.962303 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:33.962251    1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16
May 27 06:38:47.962911 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:47.962851    1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16

That error message was mentioned way back in bug 1466636, but I don't see anything since, so opening a new bug about this new instance.  Stuck-creating is pretty serious, but going with medium until we have a better handle on frequency.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080

Comment 1 Kir Kolyshkin 2021-08-24 02:47:05 UTC
This is the same as https://github.com/kubernetes/kubernetes/issues/102676

It is caused by a regression in runc, introduced by yours truly in runc 1.0.0-rc94 (commit bacfc2c) and fixed in runc-1.0.0 (commit b2d28c5df2c).

Should be fixed by a runc bump, e.g. https://github.com/kubernetes/kubernetes/pull/104528 and its backports (1.22: https://github.com/kubernetes/kubernetes/pull/104529, 1.21: https://github.com/kubernetes/kubernetes/pull/104530).

Comment 2 Ryan Phillips 2021-08-25 14:40:11 UTC

*** This bug has been marked as a duplicate of bug 1993980 ***


Note You need to log in before you can comment on or make changes to this bug.