Description of problem: Pod stuck in ContainerCreating - Output: Failed to start transient scope unit: Connection timed out Version-Release number of selected component (if applicable): OCP 4.6.26 Additional info: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 163m (x93 over 6h14m) kubelet MountVolume.SetUp failed for volume "siptls-ca" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/siptls-ca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/siptls-ca Output: Failed to start transient scope unit: Connection timed out Warning FailedMount 134m (x105 over 6h14m) kubelet MountVolume.SetUp failed for volume "client-ca" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/client-ca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/client-ca Output: Failed to start transient scope unit: Connection timed out Warning FailedMount 43m (x142 over 6h14m) kubelet MountVolume.SetUp failed for volume "etcd-bro-client-cert" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/etcd-bro-client-cert --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/etcd-bro-client-cert Output: Failed to start transient scope unit: Connection timed out Warning FailedMount 33m (x146 over 6h14m) kubelet MountVolume.SetUp failed for volume "server-cert" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/server-cert --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/server-cert Output: Failed to start transient scope unit: Connection timed out Warning FailedMount 23m (x150 over 6h14m) kubelet MountVolume.SetUp failed for volume "bootstrap-ca" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/bootstrap-ca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/bootstrap-ca Output: Failed to start transient scope unit: Connection timed out Warning FailedMount 18m (x152 over 6h14m) kubelet MountVolume.SetUp failed for volume "pmca" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/pmca --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/pmca Output: Failed to start transient scope unit: Connection timed out Warning FailedMount 4m8s (x158 over 6h14m) kubelet MountVolume.SetUp failed for volume "peer-client-cert" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/peer-client-cert --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/0b94f3d8-4449-451a-9147-e0e64230ee1f/volumes/kubernetes.io~secret/peer-client-cert Output: Failed to start transient scope unit: Connection timed out
*** Bug 1957713 has been marked as a duplicate of this bug. ***
what's the output of `systemctl list-units --failed`
is there NFS involved here? I would not normally expect a mount or systemd-run to fail because of `Connection timed out`
No, the only storage available is OCS and Ceph in "internal" mode, i.e. with OSDs on the worker nodes
This problem is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1819868 which was fixed with `systemd-239-45.el8` in the RHEL 8.4 GA release. RHCOS 48.84.202105182219-0 was built using RHEL 8.4 GA content and is now available via OCP 4.8 nightly payloads. Moving this to MODIFIED. Note: the `systemd` fix is not currently backported to RHEL 8.3 or RHEL 8.2 EUS, so OCP versions 4.5/4.6/4.7 are currently vulnerable to this problem.
Verified on 4.8.0-0.nightly-2021-05-21-200728. Systemd version with fix is landed in RHCOS. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-21-200728 True False 105s Cluster version is 4.8.0-0.nightly-2021-05-21-200728 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-d7rw6f2-f76d1-9wg8c-master-0 Ready master 20m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-master-1 Ready master 20m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-master-2 Ready master 20m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx Ready worker 13m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-worker-c-vmrbc Ready worker 12m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-worker-d-j5l8b Ready worker 13m v1.21.0-rc.0+c656d63 $ oc debug node/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx Starting pod/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm -qa systemd systemd-239-45.el8.x86_64 sh-4.4# systemctl --version systemd 239 (239-45.el8) +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f620068b78e684b615ac01c5b79d6043bee9727644b1a976d45ae023d49fa850 CustomOrigin: Managed by machine-config-operator Version: 48.84.202105211054-0 (2021-05-21T10:58:00Z) ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z) sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
Micah has corrected me, the fix came in via systemd bug https://bugzilla.redhat.com/show_bug.cgi?id=1984406 which made it into 4.7.25. Marking comment 53 as private since it has incorrect information.
Which z-stream of 4.6 did it land? Customer is waiting on 4.6 z-stream with the fix.
(In reply to Arkady Kanevsky from comment #55) > Which z-stream of 4.6 did it land? > Customer is waiting on 4.6 z-stream with the fix. The breadcrumbs of the fix look like this: Fixed in RHEL 8.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1968528 Backported to 8.4.z - https://bugzilla.redhat.com/show_bug.cgi?id=1984406 OCP 4.7/4.8/4.9 are all using RHCOS based on RHEL 8.4 content, so all those OCP versions have received the fix. OCP 4.6 is using an RHCOS based on RHEL 8.2 EUS content. During discussions about the problem, I recall that the systemd team declined to backport the fix to RHEL 8.2 due to the complexity of the backport and the major changes it would introduce in the EUS stream. Please engage with the systemd team if you want to ask for the backport to RHEl 8.2 EUS. Thus this problem will not be fixed as part of any OCP 4.6.z release. Customers are encouraged to upgrade to a newer y-stream version of OCP if they need a fix for this issue.
According to Comment 54, "the fix came in via systemd bug https://bugzilla.redhat.com/show_bug.cgi?id=1984406", and BZ#1984406 is set to qe_test_coverage+(means covered by automated test), will change qe_test_coverage flag of this bug to '+'(refer to BZ#1984406)