Description of problem: * fluentd pods are causing "Failed to set up mount unit: Invalid argument" errors multiple times per second. if the logging operator (fluentd) is removed, the problem resolves itself. [root@au1-ocpinf-d01 ~]# journalctl --since "1 days ago" | grep "Invalid argument" Sep 25 19:33:12 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument Sep 25 19:33:12 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument Sep 25 19:33:13 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument Sep 25 19:33:13 au1-ocpinf-d01.ocp4-lab.sarc.samsung.com systemd[1]: Failed to set up mount unit: Invalid argument Version-Release number of selected component (if applicable): OCP 4.4 Logging Operator: 4.4.0-202008210157.p0 provided by Red Hat, Inc How reproducible: i was unable to reproduce the issue but the customer has been able to on 3 of his 4.4 clusters. Steps to Reproduce: 1. install logging operator 2. allow data to populate 3. check journal Actual results: journal is flooded with the above error message.
Setting priority to low. Investigation of the must gather shoes the logging system in a health state.
Working with the storage team was pointed to: https://access.redhat.com/solutions/5038151 https://bugzilla.redhat.com/show_bug.cgi?id=1779813 There is nothing that can be done from logging perspective to explicitly resolve the issue.
We can do perhaps something on the storage side. I can see elasticsearch-cdm-7fc52t3q-2-5dd6cf7dbc-bfnvj.yaml pod running on node au1-ocpinf-d02.ocp4-lab.sarc.samsung.com. And it uses PVC elasticsearch-elasticsearch-cdm-7fc52t3q-2, which is mounted to the node as: dev/sdb on /var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/mounts/[NIM-ESX-VVOL-OCP-LAB] rfc4122.11bc26b0-694e-4917-9e80-f9919c8df059/ocp4-lab-t82zt-dynamic-pvc-0f13e3ad-97f8-41ab-9392-84562ef40d17.vmdk type ext4 (rw,relatime,seclabel) $ systemd-escape /var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/mounts/[NIM-ESX-VVOL-OCP-LAB] rfc4122.11bc26b0-694e-4917-9e80-f9919c8df059/ocp4-lab-t82zt-dynamic-pvc-0f13e3ad-97f8-41ab-9392-84562ef40d17.vmdk | wc -c 258 So it's over the systemd limit and systemd spams the log. The directory name must be shorter. "ocp4-lab-t82zt" is cluster prefix, dunno if the customer can make it shorter. "dynamic-pvc-0f13e3ad-97f8-41ab-9392-84562ef40d17.vmdk" is hardcoded in Kubernetes. "11bc26b0-694e-4917-9e80-f9919c8df059" is UUID of the volume (or the datastore?) and is hardcoded in Kubernetes. "[NIM-ESX-VVOL-OCP-LAB] rfc4122" comes from data store + folder name. Can the customer use one with shorter name / less dashes? Systemd escapes every "-" with 4 characters ("\x2d"). They need to save only few characters to get to the limit. On the OCP / Kubernetes side, we will try to fix vSphere code not to depend on datastore name and always produce shorter directory names. This will take some time though. Just to note: all pods are actually running, elastic should work. Just systemd spams the log in the background.
I have given up on trying to drop UUID of folder from volume path. That is too risky and can break all over the place. I am going for a simpler approach of reducing the prefix size - https://github.com/kubernetes/kubernetes/pull/96533 This should *somewhat* help with longer volume names which are on boundary of 255 chars (like the one reported in this bug). For other cases, we will have to document and suggest recommendations to the customer.
*** Bug 1939416 has been marked as a duplicate of this bug. ***
I also filed a related systemd issue for this - https://bugzilla.redhat.com/show_bug.cgi?id=1940973
*** Bug 1940898 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days