Bug 1897337

Summary:	Mounts failing with error "Failed to start transient scope unit: Argument list too long"
Product:	OpenShift Container Platform	Reporter:	Shubhag Saxena <shsaxena>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	adeshpan, anisal, aos-bugs, bbreard, dtardon, dwalsh, hgomes, imcleod, jligon, jokerman, mbetti, miabbott, nagrawal, naygupta, nchoudhu, ngirard, npaez, nstielau, palshure, pehunt, rphillips, shsaxena, systemd-maint-list, tsweeney
Version:	4.4	Keywords:	Reopened, UpcomingSprint
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1939416 (view as bug list)		Environment:
Last Closed:	2021-03-17 14:12:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1787148
Bug Blocks:	1915520, 1939416

Description Shubhag Saxena 2020-11-12 19:55:29 UTC

Description of problem:

The infra servers are unable to mount some volume/secret/cm of a pod. 

Below are some logs:
~~~~~~~~~~~~~~~~~~~~~
Pod events:
```
MountVolume.SetUp failed for volume "kibana-token-2m2z9" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 Output: Failed to start transient scope unit: Argument list too long
unable to ensure pod container exists: failed to create container for [kubepods burstable podc6c5064c-b87c-4c4d-b1ad-0ff42642392a] : Argument list too long
```

Journalctl:
```
 Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument
 Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument
 Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument
 Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument
 Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument
 Nov 10 13:20:29 lxinfra02 systemd[1]: Failed to set up mount unit: Invalid argument
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.351330    1541 mount_linux.go:140] Mount failed: exit status 1
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-proxy --scope – mount -t tmpfs tm>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: W1110 13:20:29.351378    1541 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~empty-dir/wrapped_kibana-proxy and>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.351482    1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana-proxy\" (\"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\")>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.351552    1541 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-logging", Name:"kibana-8474d8f7d-c7zjt", UID:"c6c5064c-b87c-4c4d-b1ad-0ff42642392a", APIVers>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-proxy --scope – mount -t tmpfs tm>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.356985    1541 mount_linux.go:140] Mount failed: exit status 1
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana --scope – mount -t tmpfs tmpfs /v>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: W1110 13:20:29.357022    1541 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~empty-dir/wrapped_kibana and fsGro>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.357107    1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana\" (\"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\")" fail>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.357275    1541 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-logging", Name:"kibana-8474d8f7d-c7zjt", UID:"c6c5064c-b87c-4c4d-b1ad-0ff42642392a", APIVers>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana --scope – mount -t tmpfs tmpfs /v>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.711323    1541 reconciler.go:254] operationExecutor.MountVolume started for volume "kibana-token-2m2z9" (UniqueName: "kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kib>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.711539    1541 secret.go:183] Setting up volume kibana-token-2m2z9 for pod c6c5064c-b87c-4c4d-b1ad-0ff42642392a at /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volum>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.711874    1541 secret.go:207] Received secret openshift-logging/kibana-token-2m2z9 containing (4) pieces of data, 11868 total bytes
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.712059    1541 empty_dir.go:275] pod c6c5064c-b87c-4c4d-b1ad-0ff42642392a: mounting tmpfs for volume wrapped_kibana-token-2m2z9
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.720482    1541 mount_linux.go:140] Mount failed: exit status 1
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 --scope – mount -t tm>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: W1110 13:20:29.720521    1541 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~empty-dir/wrapped_kibana-token-2m2>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.720619    1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana-token-2m2z9\" (\"c6c5064c-b87c-4c4d-b1ad-0ff426423>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: I1110 13:20:29.720666    1541 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-logging", Name:"kibana-8474d8f7d-c7zjt", UID:"c6c5064c-b87c-4c4d-b1ad-0ff42642392a", APIVers>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting command: systemd-run
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana-token-2m2z9 --scope – mount -t tm>
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: Output: Failed to start transient scope unit: Argument list too long
 Nov 10 13:20:29 lxinfra02 hyperkube[1541]: E1110 13:20:29.923617    1541 kubelet_volumes.go:154] orphaned pod "02587d53-807b-4f9d-b40d-ad9128d86e1d" found, but volume paths are still present on disk : There were a total of 37 errors simi>
 Nov 10 13:20:30 lxinfra02 hyperkube[1541]: I1110 13:20:30.088767    1541 prober.go:129] Liveness probe for "dns-default-6jlt2_openshift-dns(c40b3d0f-af65-4fd7-868f-a79efba73da8):dns" succeeded
 Nov 10 13:20:30 lxinfra02 hyperkube[1541]: I1110 13:20:30.111764    1541 prober.go:129] Liveness probe for "ovs-gbnp8_openshift-sdn(2be7cfd7-c35d-439b-ba7f-8c44c80a1991):openvswitch" succeeded
```

node-logs:
```
Nov 10 13:22:31.410507 lxinfra02 hyperkube[1541]: E1110 13:22:31.410495    1541 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana\" (\"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\")" failed. No retries permitted until 2020-11-10 13:24:33.410462342 +0000 UTC m=+798640.005634250 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kibana\" (UniqueName: \"kubernetes.io/secret/c6c5064c-b87c-4c4d-b1ad-0ff42642392a-kibana\") pod \"kibana-8474d8f7d-c7zjt\" (UID: \"c6c5064c-b87c-4c4d-b1ad-0ff42642392a\") : mount failed: exit status 1\nMounting command: systemd-run\nMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/c6c5064c-b87c-4c4d-b1ad-0ff42642392a/volumes/kubernetes.io~secret/kibana\nOutput: Failed to start transient scope unit: Argument list too long\n"
```
~~~~~~~~~~~~~~~~~~~~

sos-report of this node is attached in the case.

Version-Release number of selected component (if applicable):

node having below info about version:
~~~~~~~~~~~~~~~~~~~~~
$ cat os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.3 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.3"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.3"

$ cat system-release 
Red Hat Enterprise Linux CoreOS release 4.4
~~~~~~~~~~~~~~~~~~~~~

Additional info:

This same issue was reported for OCP 4.4 also in https://bugzilla.redhat.com/show_bug.cgi?id=1787148#c49 but that bug was closed so opening this bug for RHCOS nodes.

Comment 2 Micah Abbott 2020-11-12 22:23:09 UTC

Looking that the original issue in BZ#1787148, it looks like it was ultimately resolved via a patch to `systemd` (see https://bugzilla.redhat.com/show_bug.cgi?id=1817576) and may be able to be addressed via a change to `runc` (see https://bugzilla.redhat.com/show_bug.cgi?id=1787148#c40)

The linked `systemd` PR (https://github.com/systemd/systemd/pull/7314) shows the commits landed in systemd 236 and RHCOS 4.4 includes systemd 239, so I would start to suspect that something needs to be done on the `runc` side of things.

Going to send this over to the Containers team to see if something can be done in `runc`.

Comment 7 Ryan Phillips 2020-12-03 14:39:26 UTC

Looks like this is being tracked over here: https://bugzilla.redhat.com/show_bug.cgi?id=1787148

Comment 16 Peter Hunt 2021-01-04 20:59:41 UTC

I have proposed a fix upstream to backport the 4.6 to 4.5. If accepted, I'll pull back to 4.4 as well

Comment 18 Peter Hunt 2021-01-12 19:43:23 UTC

This was fixed originally in 4.6.0. the 4.5.z version has merged, so I will clone this bug back

Comment 21 Sunil Choudhary 2021-01-14 11:30:07 UTC

Verified on 4.6.0-0.nightly-2021-01-13-215839. I do not see any mount failure in events or node journal logs.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-01-13-215839   True        False         8h      Cluster version is 4.6.0-0.nightly-2021-01-13-215839

$ oc get nodes -o wide
NAME                                         STATUS   ROLES          AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-129-225.us-east-2.compute.internal   Ready    worker,wscan   8h    v1.19.0+9c69bdc   10.0.129.225   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa)   4.18.0-193.40.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-144-228.us-east-2.compute.internal   Ready    master         8h    v1.19.0+9c69bdc   10.0.144.228   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa)   4.18.0-193.40.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-162-51.us-east-2.compute.internal    Ready    worker,wscan   8h    v1.19.0+9c69bdc   10.0.162.51    <none>        Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa)   4.18.0-193.40.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-174-80.us-east-2.compute.internal    Ready    master         8h    v1.19.0+9c69bdc   10.0.174.80    <none>        Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa)   4.18.0-193.40.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-212-102.us-east-2.compute.internal   Ready    master         8h    v1.19.0+9c69bdc   10.0.212.102   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa)   4.18.0-193.40.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-216-180.us-east-2.compute.internal   Ready    worker,wscan   8h    v1.19.0+9c69bdc   10.0.216.180   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa)   4.18.0-193.40.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8

$ oc get events -A | grep -i "Argument list too long"
$ 

$ oc get events -A | grep -i "mount failed: exit status 1"
$


$ oc debug node/ip-10-0-129-225.us-east-2.compute.internal
Starting pod/ip-10-0-129-225us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
...

sh-4.4# journalctl | grep -i "Failed to set up mount unit: Invalid argument"
sh-4.4#
 
sh-4.4# journalctl | grep -i "Failed to set up mount unit"
sh-4.4# 

sh-4.4# journalctl | grep -i "Argument list too long"
sh-4.4#

Comment 24 errata-xmlrpc 2021-01-18 17:59:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.12 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0037

Comment 30 Shubhag Saxena 2021-03-10 01:06:26 UTC

Hi Team,

Case 02798983, cu reported they are hitting this issue again in 4.6.17. See attached screenshots.

Comment 34 Peter Hunt 2021-03-10 14:25:29 UTC

Are the affected nodes rhel 7 workers?

Comment 36 Neil Girard 2021-03-11 16:13:01 UTC

Hi Peter, all nodes are RHCOS for case 02798983.

Comment 37 Peter Hunt 2021-03-11 17:12:32 UTC

Thank you, it has since been fixed for that case. For any new cases, the fixes should be in.
If the issues pop up on RHEL 7, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1924502.
If they pop up on RHCOS, it would be worth opening a clone of this for that openshift verison.

Comment 38 Sunil Choudhary 2021-03-17 14:12:28 UTC

I see this bug has been cloned and new one opened https://bugzilla.redhat.com/show_bug.cgi?id=1939416, marking this one as closed as it was already release as part of errata.

Comment 39 Red Hat Bugzilla 2023-09-15 00:51:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days