Description of problem: The infra servers are unable to mount some volume/secret/cm of a pod. Below are some logs: ~~~~~~~~~~~~~~~~~~~~~ Pod events: ``` MountVolume.SetUp failed for volume "kibana-token-2m2z9" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 Output: Failed to start transient scope unit: Argument list too long unable to ensure pod container exists: failed to create container for [kubepods burstable podc6c5064c-b87c-4c4d-b1ad-0ff42642392a] : Argument list too long ``` Journalctl: ``` Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.351330 1541 mount_linux.go:140] Mount failed: exit status 1 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-proxy --scope – mount -t tmpfs tm> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long Nov 10 13:20:29 lxinfra02 hyperkube[1541]: W1110 13:20:29.351378 1541 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~empty-dir/wrapped_kibana-proxy and> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.351482 1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana-proxy\" (\"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\")> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.351552 1541 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-logging", Name:"kibana-8474d8f7d-c7zjt", UID:"c6c5064c-b87c-4c4d-b1ad-0ff42642392a", APIVers> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-proxy --scope – mount -t tmpfs tm> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.356985 1541 mount_linux.go:140] Mount failed: exit status 1 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana --scope – mount -t tmpfs tmpfs /v> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long Nov 10 13:20:29 lxinfra02 hyperkube[1541]: W1110 13:20:29.357022 1541 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~empty-dir/wrapped_kibana and fsGro> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.357107 1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana\" (\"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\")" fail> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.357275 1541 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-logging", Name:"kibana-8474d8f7d-c7zjt", UID:"c6c5064c-b87c-4c4d-b1ad-0ff42642392a", APIVers> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana --scope – mount -t tmpfs tmpfs /v> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.711323 1541 reconciler.go:254] operationExecutor.MountVolume started for volume "kibana-token-2m2z9" (UniqueName: "kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kib> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.711539 1541 secret.go:183] Setting up volume kibana-token-2m2z9 for pod c6c5064c-b87c-4c4d-b1ad-0ff42642392a at /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volum> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.711874 1541 secret.go:207] Received secret openshift-logging/kibana-token-2m2z9 containing (4) pieces of data, 11868 total bytes Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.712059 1541 empty_dir.go:275] pod c6c5064c-b87c-4c4d-b1ad-0ff42642392a: mounting tmpfs for volume wrapped_kibana-token-2m2z9 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.720482 1541 mount_linux.go:140] Mount failed: exit status 1 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 --scope – mount -t tm> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long Nov 10 13:20:29 lxinfra02 hyperkube[1541]: W1110 13:20:29.720521 1541 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~empty-dir/wrapped_kibana-token-2m2> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.720619 1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana-token-2m2z9\" (\"c6c5064c-b87c-4c4d-b1ad-0ff426423> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.720666 1541 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-logging", Name:"kibana-8474d8f7d-c7zjt", UID:"c6c5064c-b87c-4c4d-b1ad-0ff42642392a", APIVers> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 --scope – mount -t tm> Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.923617 1541 kubelet_volumes.go:154] orphaned pod "02587d53-807b-4f9d-b40d-ad9128d86e1d" found, but volume paths are still present on disk : There were a total of 37 errors simi> Nov 10 13:20:30 lxinfra02 hyperkube[1541]: I1110 13:20:30.088767 1541 prober.go:129] Liveness probe for "dns-default-6jlt2_openshift-dns(c40b3d0f-af65-4fd7-868f-a79efba73da8):dns" succeeded Nov 10 13:20:30 lxinfra02 hyperkube[1541]: I1110 13:20:30.111764 1541 prober.go:129] Liveness probe for "ovs-gbnp8_openshift-sdn(2be7cfd7-c35d-439b-ba7f-8c44c80a1991):openvswitch" succeeded ``` node-logs: ``` Nov 10 13:22:31.410507 lxinfra02 hyperkube[1541]: E1110 13:22:31.410495 1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana\" (\"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\")" failed. No retries permitted until 2020-11-10 13:24:33.410462342 +0000 UTC m=+798640.005634250 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kibana\" (UniqueName: \"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana\") pod \"kibana-8474d8f7d-c7zjt\" (UID: \"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\") : mount failed: exit status 1\nMounting command: systemd-run\nMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana\nOutput: Failed to start transient scope unit: Argument list too long\n" ``` ~~~~~~~~~~~~~~~~~~~~ sos-report of this node is attached in the case. Version-Release number of selected component (if applicable): node having below info about version: ~~~~~~~~~~~~~~~~~~~~~ $ cat os-release NAME="Red Hat Enterprise Linux" VERSION="8.3 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.3" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.3 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.3" $ cat system-release Red Hat Enterprise Linux CoreOS release 4.4 ~~~~~~~~~~~~~~~~~~~~~ Additional info: This same issue was reported for OCP 4.4 also in https://bugzilla.redhat.com/show_bug.cgi?id=1787148#c49 but that bug was closed so opening this bug for RHCOS nodes.
Looking that the original issue in BZ#1787148, it looks like it was ultimately resolved via a patch to `systemd` (see https://bugzilla.redhat.com/show_bug.cgi?id=1817576) and may be able to be addressed via a change to `runc` (see https://bugzilla.redhat.com/show_bug.cgi?id=1787148#c40) The linked `systemd` PR (https://github.com/systemd/systemd/pull/7314) shows the commits landed in systemd 236 and RHCOS 4.4 includes systemd 239, so I would start to suspect that something needs to be done on the `runc` side of things. Going to send this over to the Containers team to see if something can be done in `runc`.
Looks like this is being tracked over here: https://bugzilla.redhat.com/show_bug.cgi?id=1787148
I have proposed a fix upstream to backport the 4.6 to 4.5. If accepted, I'll pull back to 4.4 as well
This was fixed originally in 4.6.0. the 4.5.z version has merged, so I will clone this bug back
Verified on 4.6.0-0.nightly-2021-01-13-215839. I do not see any mount failure in events or node journal logs. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-01-13-215839 True False 8h Cluster version is 4.6.0-0.nightly-2021-01-13-215839 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-225.us-east-2.compute.internal Ready worker,wscan 8h v1.19.0+9c69bdc 10.0.129.225 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-144-228.us-east-2.compute.internal Ready master 8h v1.19.0+9c69bdc 10.0.144.228 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-162-51.us-east-2.compute.internal Ready worker,wscan 8h v1.19.0+9c69bdc 10.0.162.51 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-174-80.us-east-2.compute.internal Ready master 8h v1.19.0+9c69bdc 10.0.174.80 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-212-102.us-east-2.compute.internal Ready master 8h v1.19.0+9c69bdc 10.0.212.102 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-216-180.us-east-2.compute.internal Ready worker,wscan 8h v1.19.0+9c69bdc 10.0.216.180 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 $ oc get events -A | grep -i "Argument list too long" $ $ oc get events -A | grep -i "mount failed: exit status 1" $ $ oc debug node/ip-10-0-129-225.us-east-2.compute.internal Starting pod/ip-10-0-129-225us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` ... sh-4.4# journalctl | grep -i "Failed to set up mount unit: Invalid argument" sh-4.4# sh-4.4# journalctl | grep -i "Failed to set up mount unit" sh-4.4# sh-4.4# journalctl | grep -i "Argument list too long" sh-4.4#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.6.12 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0037
Hi Team, Case 02798983, cu reported they are hitting this issue again in 4.6.17. See attached screenshots.
Are the affected nodes rhel 7 workers?
Hi Peter, all nodes are RHCOS for case 02798983.
Thank you, it has since been fixed for that case. For any new cases, the fixes should be in. If the issues pop up on RHEL 7, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1924502. If they pop up on RHCOS, it would be worth opening a clone of this for that openshift verison.
I see this bug has been cloned and new one opened https://bugzilla.redhat.com/show_bug.cgi?id=1939416, marking this one as closed as it was already release as part of errata.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days