Bug 1884800

Summary:	Failed to set up mount unit: Invalid argument
Product:	OpenShift Container Platform	Reporter:	dtarabor
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Storage sub component:	Kubernetes	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, darya, gvillani, hekumar, jcrumple, jsafrane, mgugino, ngirard, openshift-bugs-escalate, smulje, spasquie, sreber, ssonigra
Version:	4.4	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:33:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1915520

Description dtarabor 2020-10-02 19:57:04 UTC

Description of problem:

* fluentd pods are causing "Failed to set up mount unit: Invalid argument" errors multiple times per second.  

if the logging operator (fluentd) is removed, the problem resolves itself.

[root@au1-ocpinf-d01 ~]# journalctl --since "1 days ago" | grep "Invalid argument"
Sep 25 19:33:12 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument
Sep 25 19:33:12 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument
Sep 25 19:33:13 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument
Sep 25 19:33:13 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument


Version-Release number of selected component (if applicable):

OCP 4.4
Logging Operator: 4.4.0-202008210157.p0 provided by Red Hat, Inc

How reproducible:

i was unable to reproduce the issue but the customer has been able to on 3 of his 4.4 clusters. 

Steps to Reproduce:
1. install logging operator
2. allow data to populate
3. check journal

Actual results:

journal is flooded with the above error message.

Comment 3 Jeff Cantrill 2020-10-07 13:43:46 UTC

Setting priority to low.  Investigation of the must gather shoes the logging system in a health state.

Comment 4 Jeff Cantrill 2020-10-07 14:08:01 UTC

Working with the storage team was pointed to:

https://access.redhat.com/solutions/5038151
https://bugzilla.redhat.com/show_bug.cgi?id=1779813

There is nothing that can be done from logging perspective to explicitly resolve the issue.

Comment 5 Jan Safranek 2020-10-07 16:14:15 UTC

We can do perhaps something on the storage side. I can see elasticsearch-cdm-7fc52t3q-2-5dd6cf7dbc-bfnvj.yaml pod running on node au1-ocpinf-d02.ocp4-lab.sarc.samsung.com. And it uses PVC elasticsearch-elasticsearch-cdm-7fc52t3q-2, which is mounted to the node as:

dev/sdb on /var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/mounts/[NIM-ESX-VVOL-OCP-LAB] rfc4122.11bc26b0-694e-4917-9e80-f9919c8df059/ocp4-lab-t82zt-dynamic-pvc-0f13e3ad-97f8-41ab-9392-84562ef40d17.vmdk type ext4 (rw,relatime,seclabel)

$ systemd-escape /var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/mounts/[NIM-ESX-VVOL-OCP-LAB] rfc4122.11bc26b0-694e-4917-9e80-f9919c8df059/ocp4-lab-t82zt-dynamic-pvc-0f13e3ad-97f8-41ab-9392-84562ef40d17.vmdk | wc -c
258

So it's over the systemd limit and systemd spams the log. The directory name must be shorter.

"ocp4-lab-t82zt" is cluster prefix, dunno if the customer can make it shorter.
"dynamic-pvc-0f13e3ad-97f8-41ab-9392-84562ef40d17.vmdk" is hardcoded in Kubernetes.
"11bc26b0-694e-4917-9e80-f9919c8df059" is UUID of the volume (or the datastore?) and is hardcoded in Kubernetes.
"[NIM-ESX-VVOL-OCP-LAB] rfc4122" comes from data store + folder name. Can the customer use one with shorter name / less dashes? Systemd escapes every "-" with 4 characters ("\x2d"). They need to save only few characters to get to the limit.

On the OCP / Kubernetes side, we will try to fix vSphere code not to depend on datastore name and always produce shorter directory names. This will take some time though.

Just to note: all pods are actually running, elastic should work. Just systemd spams the log in the background.

Comment 13 Hemant Kumar 2020-11-12 21:24:32 UTC

I have given up on trying to drop UUID of folder from volume path. That is too risky and can break all over the place. I am going for a simpler approach of reducing the prefix size - https://github.com/kubernetes/kubernetes/pull/96533

This should *somewhat* help with longer volume names which are on boundary of 255 chars (like the one reported in this bug). For other cases, we will have to document and suggest recommendations to the customer.

Comment 24 Jan Safranek 2021-03-19 14:35:34 UTC

*** Bug 1939416 has been marked as a duplicate of this bug. ***

Comment 26 Hemant Kumar 2021-03-19 16:57:32 UTC

I also filed a related systemd issue for this - https://bugzilla.redhat.com/show_bug.cgi?id=1940973

Comment 27 Hemant Kumar 2021-03-22 18:57:16 UTC

*** Bug 1940898 has been marked as a duplicate of this bug. ***

Comment 35 errata-xmlrpc 2021-07-27 22:33:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 40 Red Hat Bugzilla 2023-09-15 00:49:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days