Bug 1941456 - Multiple pods stuck in ContainerCreating status with the message "failed to create container for [kubepods burstable podxxx] : dbus: connection closed by user" being seen in the journal log
Summary: Multiple pods stuck in ContainerCreating status with the message "failed to c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Kir Kolyshkin
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-22 07:49 UTC by yhe
Modified: 2021-12-15 06:22 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:54:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 790 0 None closed Bug 1965545: vendor: bump runc to rc95 + "unit exists" fix 2021-06-09 14:34:26 UTC
Red Hat Knowledge Base (Solution) 6096651 0 None None None 2021-06-03 10:48:38 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:54:57 UTC

Description yhe 2021-03-22 07:49:55 UTC
Description of problem:
Multiple pods are stuck in ContainerCreating status. While checking the journal log from the master node, the following messages and other similar messages were found.

Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.135395    2146 status_manager.go:429] Ignoring same status for pod "cluster-monitoring-operator-6dc9f9d7fb-9d7wb_openshift-monitoring(670c80ba-a6c5-4c51-bb67-9d030b483e6c)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [cluster-monitoring-operator kube-rbac-proxy]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [cluster-monitoring-operator kube-rbac-proxy]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.128.0.15 PodIP: PodIPs:[] StartTime:2021-03-16 17:43:27 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:cluster-monitoring-operator State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61ca3ebd394f1acd4cbca4a9ddfb7088524c89c4a72a7d87561829239caddeab ImageID: ContainerID: Started:0xc000ebc54a} {Name:kube-rbac-proxy State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83618ed7f04eb61c8cbf3bed789f78f9a6101f95c3fae80fd445858f4f353d9b ImageID: ContainerID: Started:0xc000ebc54b}] QOSClass:Burstable EphemeralContainerStatuses:[]}

Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.135591    2146 status_manager.go:429] Ignoring same status for pod "oauth-openshift-59db975dbb-n7plm_openshift-authentication(7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oauth-openshift]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oauth-openshift]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-03-16 17:43:27 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.128.0.15 PodIP: PodIPs:[] StartTime:2021-03-16 17:43:27 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:oauth-openshift State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9afc8b46b65e99ea01e07ffd3eba0a9dd5605730092a5194ddc76485e1071676 ImageID: ContainerID: Started:0xc0032316b9}] QOSClass:Burstable EphemeralContainerStatuses:[]}

Mar 22 00:48:16 master01 hyperkube[2146]: E0322 00:48:16.136195    2146 pod_workers.go:191] Error syncing pod 670c80ba-a6c5-4c51-bb67-9d030b483e6c ("cluster-monitoring-operator-6dc9f9d7fb-9d7wb_openshift-monitoring(670c80ba-a6c5-4c51-bb67-9d030b483e6c)"), skipping: failed to ensure that the pod: 670c80ba-a6c5-4c51-bb67-9d030b483e6c cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod670c80ba-a6c5-4c51-bb67-9d030b483e6c] : dbus: connection closed by user

Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.136227    2146 event.go:291] "Event occurred" object="openshift-monitoring/cluster-monitoring-operator-6dc9f9d7fb-9d7wb" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodContainer" message="unable to ensure pod container exists: failed to create container for [kubepods burstable pod670c80ba-a6c5-4c51-bb67-9d030b483e6c] : dbus: connection closed by user"

Mar 22 00:48:16 master01 hyperkube[2146]: E0322 00:48:16.136559    2146 pod_workers.go:191] Error syncing pod 7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d ("oauth-openshift-59db975dbb-n7plm_openshift-authentication(7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d)"), skipping: failed to ensure that the pod: 7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d] : dbus: connection closed by user

Mar 22 00:48:16 master01 hyperkube[2146]: I0322 00:48:16.136597    2146 event.go:291] "Event occurred" object="openshift-authentication/oauth-openshift-59db975dbb-n7plm" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodContainer" message="unable to ensure pod container exists: failed to create container for [kubepods burstable pod7ef7bb79-8a84-4f6d-8c16-df4a6af65a3d] : dbus: connection closed by user"

A similar issue was reported in the OCP3.9 version in Bug #1634092, not sure whether they are related yet.

Will provide more information later.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Run the following command 

$ oc get pods -A -o wide | grep -v Running | grep -v Completed

2. Check the result

Actual results:
Multiple pods are stuck in ContainerCreating status

Expected results:
Pods get created correctly

Additional info:

Comment 3 Peter Hunt 2021-03-22 15:57:23 UTC
I believe these errors are coming from runc, Kir can you PTAL

Comment 5 Kir Kolyshkin 2021-04-27 18:35:18 UTC
This is caused by https://github.com/opencontainers/runc/pull/2203, and allegedly fixed by https://github.com/opencontainers/runc/pull/2862.

I am working on fixing the last one (will carry).

Comment 6 Kir Kolyshkin 2021-04-27 20:54:01 UTC
Proposed upstream fix: https://github.com/opencontainers/runc/pull/2923

Comment 7 Kir Kolyshkin 2021-04-30 00:05:59 UTC
The upstream fix is now merged.

Backports:
4.6: https://github.com/projectatomic/runc/pull/49 (very manual, needs a careful review)
4.8: https://github.com/projectatomic/runc/pull/50 (mostly smooth, only a few merge conflicts)

Comment 8 Kir Kolyshkin 2021-05-01 01:05:14 UTC
Scratch the above backports, they were done against a wrong repo.

4.6: https://github.com/openshift/opencontainers-runc/pull/7 (if we want it)
4.7: https://github.com/openshift/opencontainers-runc/pull/8 (exactly same commits)

Actually these two branches are the same, thus the PRs are the same, too.

Once these are in, we need a PR against cri-o to vendor the updated runc/libcontainer.

Comment 10 Kir Kolyshkin 2021-05-21 19:21:19 UTC
> 4.6: https://github.com/openshift/opencontainers-runc/pull/7 (if we want it)
> 4.7: https://github.com/openshift/opencontainers-runc/pull/8 (exactly same commits)

Updated to include the backport of https://github.com/opencontainers/runc/pull/2937

Fixed upstream by
* master https://github.com/kubernetes/kubernetes/pull/102147
* kubernetes 1.21: https://github.com/kubernetes/kubernetes/pull/102196

Will be picked up

Comment 12 Kir Kolyshkin 2021-06-03 20:36:15 UTC
OK, this is not yet fixed; also requires https://github.com/opencontainers/runc/pull/2997

I am not sure about the ETA though.

Comment 13 Peter Hunt 2021-06-09 14:34:26 UTC
I think this is fixed by the attached PR

Comment 15 Weinan Liu 2021-06-11 07:46:14 UTC
$ oc get pods -A -o wide | grep -v Running | grep -v Completed
NAMESPACE                                          NAME                                                                  READY   STATUS      RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-10-071057   True        False         4h34m   Cluster version is 4.8.0-0.nightly-2021-06-10-071057

Verified to get fixed.

Comment 18 errata-xmlrpc 2021-07-27 22:54:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 21 Kir Kolyshkin 2021-08-30 21:31:49 UTC
I do not know if there's a plan to backport the fix to 4.7. Since 4.7 uses runc-1.0.0-rc92, the fix would involve either vendoring runc 1.0.2 or backporting the fix to rc92. Both options are somewhat complicated and risk introducing more regressions.

Comment 22 Weinan Liu 2021-12-15 06:22:41 UTC
as per comment 21


Note You need to log in before you can comment on or make changes to this bug.