Description of problem: Multiple pods are stuck in ContainerCreating status. While checking the journal log from the master node, the following messages and other similar messages were found. Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.135395 2146 status_manager.go:429] Ignoring same status for pod "cluster-monitoring-operator-6dc9f9d7fb-9d7wb_openshift-monitoring(670c80ba-a6c5-4c51-bb67-9d030b483e6c)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [cluster-monitoring-operator kube-rbac-proxy]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [cluster-monitoring-operator kube-rbac-proxy]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.128.0.15 PodIP: PodIPs:[] StartTime:2021-03-16 17:43:27 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:cluster-monitoring-operator State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61ca3ebd394f1acd4cbca4a9ddfb7088524c89c4a72a7d87561829239caddeab ImageID: ContainerID: Started:0xc000ebc54a} {Name:kube-rbac-proxy State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83618ed7f04eb61c8cbf3bed789f78f9a6101f95c3fae80fd445858f4f353d9b ImageID: ContainerID: Started:0xc000ebc54b}] QOSClass:Burstable EphemeralContainerStatuses:[]} Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.135591 2146 status_manager.go:429] Ignoring same status for pod "oauth-openshift-59db975dbb-n7plm_openshift-authentication(7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oauth-openshift]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oauth-openshift]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.128.0.15 PodIP: PodIPs:[] StartTime:2021-03-16 17:43:27 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:oauth-openshift State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9afc8b46b65e99ea01e07ffd3eba0a9dd5605730092a5194ddc76485e1071676 ImageID: ContainerID: Started:0xc0032316b9}] QOSClass:Burstable EphemeralContainerStatuses:[]} Mar 22 00:48:16 master01 hyperkube[2146]: E0322 00:48:16.136195 2146 pod_workers.go:191] Error syncing pod 670c80ba-a6c5-4c51-bb67-9d030b483e6c ("cluster-monitoring-operator-6dc9f9d7fb-9d7wb_openshift-monitoring(670c80ba-a6c5-4c51-bb67-9d030b483e6c)"), skipping: failed to ensure that the pod: 670c80ba-a6c5-4c51-bb67-9d030b483e6c cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod670c80ba-a6c5-4c51-bb67-9d030b483e6c] : dbus: connection closed by user Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.136227 2146 event.go:291] "Event occurred" object="openshift-monitoring/cluster-monitoring-operator-6dc9f9d7fb-9d7wb" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodContainer" message="unable to ensure pod container exists: failed to create container for [kubepods burstable pod670c80ba-a6c5-4c51-bb67-9d030b483e6c] : dbus: connection closed by user" Mar 22 00:48:16 master01 hyperkube[2146]: E0322 00:48:16.136559 2146 pod_workers.go:191] Error syncing pod 7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d ("oauth-openshift-59db975dbb-n7plm_openshift-authentication(7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d)"), skipping: failed to ensure that the pod: 7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d] : dbus: connection closed by user Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.136597 2146 event.go:291] "Event occurred" object="openshift-authentication/oauth-openshift-59db975dbb-n7plm" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodContainer" message="unable to ensure pod container exists: failed to create container for [kubepods burstable pod7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d] : dbus: connection closed by user" A similar issue was reported in the OCP3.9 version in Bug #1634092, not sure whether they are related yet. Will provide more information later. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Run the following command $ oc get pods -A -o wide | grep -v Running | grep -v Completed 2. Check the result Actual results: Multiple pods are stuck in ContainerCreating status Expected results: Pods get created correctly Additional info:
I believe these errors are coming from runc, Kir can you PTAL
This is caused by https://github.com/opencontainers/runc/pull/2203, and allegedly fixed by https://github.com/opencontainers/runc/pull/2862. I am working on fixing the last one (will carry).
Proposed upstream fix: https://github.com/opencontainers/runc/pull/2923
The upstream fix is now merged. Backports: 4.6: https://github.com/projectatomic/runc/pull/49 (very manual, needs a careful review) 4.8: https://github.com/projectatomic/runc/pull/50 (mostly smooth, only a few merge conflicts)
Scratch the above backports, they were done against a wrong repo. 4.6: https://github.com/openshift/opencontainers-runc/pull/7 (if we want it) 4.7: https://github.com/openshift/opencontainers-runc/pull/8 (exactly same commits) Actually these two branches are the same, thus the PRs are the same, too. Once these are in, we need a PR against cri-o to vendor the updated runc/libcontainer.
> 4.6: https://github.com/openshift/opencontainers-runc/pull/7 (if we want it) > 4.7: https://github.com/openshift/opencontainers-runc/pull/8 (exactly same commits) Updated to include the backport of https://github.com/opencontainers/runc/pull/2937 Fixed upstream by * master https://github.com/kubernetes/kubernetes/pull/102147 * kubernetes 1.21: https://github.com/kubernetes/kubernetes/pull/102196 Will be picked up
OK, this is not yet fixed; also requires https://github.com/opencontainers/runc/pull/2997 I am not sure about the ETA though.
I think this is fixed by the attached PR
$ oc get pods -A -o wide | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-10-071057 True False 4h34m Cluster version is 4.8.0-0.nightly-2021-06-10-071057 Verified to get fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
I do not know if there's a plan to backport the fix to 4.7. Since 4.7 uses runc-1.0.0-rc92, the fix would involve either vendoring runc 1.0.2 or backporting the fix to rc92. Both options are somewhat complicated and risk introducing more regressions.
as per comment 21