1957726 – Pod stuck in ContainerCreating - Failed to start transient scope unit: Connection timed out

Bug 1957726 - Pod stuck in ContainerCreating - Failed to start transient scope unit: Connection timed out

Summary: Pod stuck in ContainerCreating - Failed to start transient scope unit: Connec...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Micah Abbott
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1957713 (view as bug list)
Depends On:	1819868
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-06 10:58 UTC by KOSAL RAJ I
Modified:	2024-12-20 20:01 UTC (History)
CC List:	27 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: systemd was excessively reading mountinfo and consuming all of the CPU resources. Consequence: Containers failed to start. Fix: Enable rate-limiting when systemd is reading the mountinfo. Result: Containers are able to start successfully.
Clone Of:
Environment:
Last Closed:	2021-11-30 14:35:44 UTC
Target Upstream Version:
Embargoed:
Flags:	miabbott: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1819868	1	unspecified	CLOSED	systemd excessively reads mountinfo and udev is dense OpenShift environments	2024-12-20 19:02:04 UTC
Red Hat Knowledge Base (Solution)	6208722	0	None	None	None	2021-07-22 12:17:06 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:07:14 UTC

Description KOSAL RAJ I 2021-05-06 10:58:18 UTC

Description of problem:
Pod stuck in ContainerCreating - Output: Failed to start transient scope unit: Connection timed out

Version-Release number of selected component (if applicable):
OCP 4.6.26

Additional info:

Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  163m (x93 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "siptls-ca" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/siptls-ca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/siptls-ca
Output: Failed to start transient scope unit: Connection timed out
  Warning  FailedMount  134m (x105 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "client-ca" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/client-ca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/client-ca
Output: Failed to start transient scope unit: Connection timed out
  Warning  FailedMount  43m (x142 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "etcd-bro-client-cert" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/etcd-bro-client-cert --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/etcd-bro-client-cert
Output: Failed to start transient scope unit: Connection timed out
  Warning  FailedMount  33m (x146 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "server-cert" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/server-cert --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/server-cert
Output: Failed to start transient scope unit: Connection timed out
  Warning  FailedMount  23m (x150 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "bootstrap-ca" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/bootstrap-ca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/bootstrap-ca
Output: Failed to start transient scope unit: Connection timed out
  Warning  FailedMount  18m (x152 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "pmca" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/pmca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/pmca
Output: Failed to start transient scope unit: Connection timed out
  Warning  FailedMount  4m8s (x158 over 6h14m)  kubelet  MountVolume.SetUp failed for volume "peer-client-cert" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/peer-client-cert --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/peer-client-cert
Output: Failed to start transient scope unit: Connection timed out

Comment 3 Peter Hunt 2021-05-06 14:25:31 UTC

*** Bug 1957713 has been marked as a duplicate of this bug. ***

Comment 5 Peter Hunt 2021-05-06 14:29:27 UTC

what's the output of `systemctl list-units --failed`

Comment 7 Peter Hunt 2021-05-06 17:01:32 UTC

is there NFS involved here? I would not normally expect a mount or systemd-run to fail because of `Connection timed out`

Comment 8 David Juran 2021-05-07 04:57:16 UTC

No, the only storage available is OCS and Ceph in "internal" mode, i.e. with OSDs on the worker nodes

Comment 37 Micah Abbott 2021-05-19 18:58:29 UTC

This problem is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1819868 which was fixed with `systemd-239-45.el8` in the RHEL 8.4 GA release.

RHCOS 48.84.202105182219-0 was built using RHEL 8.4 GA content and is now available via OCP 4.8 nightly payloads.

Moving this to MODIFIED.


Note:  the `systemd` fix is not currently backported to RHEL 8.3 or RHEL 8.2 EUS, so OCP versions 4.5/4.6/4.7 are currently vulnerable to this problem.

Comment 44 Michael Nguyen 2021-05-24 12:50:58 UTC

Verified on 4.8.0-0.nightly-2021-05-21-200728.  Systemd version with fix is landed in RHCOS.
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-200728   True        False         105s    Cluster version is 4.8.0-0.nightly-2021-05-21-200728

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-d7rw6f2-f76d1-9wg8c-master-0         Ready    master   20m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-1         Ready    master   20m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-2         Ready    master   20m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx   Ready    worker   13m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-c-vmrbc   Ready    worker   12m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-d-j5l8b   Ready    worker   13m   v1.21.0-rc.0+c656d63

$ oc debug node/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx
Starting pod/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -qa systemd
systemd-239-45.el8.x86_64
sh-4.4# systemctl --version
systemd 239 (239-45.el8)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f620068b78e684b615ac01c5b79d6043bee9727644b1a976d45ae023d49fa850
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.84.202105211054-0 (2021-05-21T10:58:00Z)

  ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b
                   Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z)

sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 52 errata-xmlrpc 2021-07-27 23:06:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 54 Scott Dodson 2021-09-09 16:19:52 UTC

Micah has corrected me, the fix came in via systemd bug https://bugzilla.redhat.com/show_bug.cgi?id=1984406 which made it into 4.7.25. Marking comment 53 as private since it has incorrect information.

Comment 55 Arkady Kanevsky 2021-11-29 16:45:02 UTC

Which z-stream of 4.6 did it land?
Customer is waiting on 4.6 z-stream with the fix.

Comment 56 Micah Abbott 2021-11-30 14:35:44 UTC

(In reply to Arkady Kanevsky from comment #55)
> Which z-stream of 4.6 did it land?
> Customer is waiting on 4.6 z-stream with the fix.

The breadcrumbs of the fix look like this:

Fixed in RHEL 8.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1968528
Backported to 8.4.z - https://bugzilla.redhat.com/show_bug.cgi?id=1984406

OCP 4.7/4.8/4.9 are all using RHCOS based on RHEL 8.4 content, so all those OCP versions have received the fix.

OCP 4.6 is using an RHCOS based on RHEL 8.2 EUS content.

During discussions about the problem, I recall that the systemd team declined to backport the fix to RHEL 8.2 due to the complexity of the backport and the major changes it would introduce in the EUS stream. Please engage with the systemd team if you want to ask for the backport to RHEl 8.2 EUS.

Thus this problem will not be fixed as part of any OCP 4.6.z release.  Customers are encouraged to upgrade to a newer y-stream version of OCP if they need a fix for this issue.

Comment 59 HuijingHei 2022-04-12 03:49:07 UTC

According to Comment 54, "the fix came in via systemd bug https://bugzilla.redhat.com/show_bug.cgi?id=1984406", and BZ#1984406 is set to qe_test_coverage+(means covered by automated test), will change qe_test_coverage flag of this bug to '+'(refer to BZ#1984406)

Note You need to log in before you can comment on or make changes to this bug.

akanevsk
alosadag
amulmule
aos-bugs
chdeshpa
djuran
dornelas
ealcaniz
fsoppels
gferrazs
harpatil
hhei
jligon
mavazque
miabbott
mrussell
mschwabe
nagrawal
nstielau
openshift-bugs-escalate
pehunt
rcarrier
rphillips
skunkerk
smilner
ssonigra
svaezi