Bug 1897337
Summary: | Mounts failing with error "Failed to start transient scope unit: Argument list too long" | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Shubhag Saxena <shsaxena> | |
Component: | Node | Assignee: | Peter Hunt <pehunt> | |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | adeshpan, anisal, aos-bugs, bbreard, dtardon, dwalsh, hgomes, imcleod, jligon, jokerman, mbetti, miabbott, nagrawal, naygupta, nchoudhu, ngirard, npaez, nstielau, palshure, pehunt, rphillips, shsaxena, systemd-maint-list, tsweeney | |
Version: | 4.4 | Keywords: | Reopened, UpcomingSprint | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1939416 (view as bug list) | Environment: | ||
Last Closed: | 2021-03-17 14:12:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1787148 | |||
Bug Blocks: | 1915520, 1939416 |
Description
Shubhag Saxena
2020-11-12 19:55:29 UTC
Looking that the original issue in BZ#1787148, it looks like it was ultimately resolved via a patch to `systemd` (see https://bugzilla.redhat.com/show_bug.cgi?id=1817576) and may be able to be addressed via a change to `runc` (see https://bugzilla.redhat.com/show_bug.cgi?id=1787148#c40) The linked `systemd` PR (https://github.com/systemd/systemd/pull/7314) shows the commits landed in systemd 236 and RHCOS 4.4 includes systemd 239, so I would start to suspect that something needs to be done on the `runc` side of things. Going to send this over to the Containers team to see if something can be done in `runc`. Looks like this is being tracked over here: https://bugzilla.redhat.com/show_bug.cgi?id=1787148 I have proposed a fix upstream to backport the 4.6 to 4.5. If accepted, I'll pull back to 4.4 as well This was fixed originally in 4.6.0. the 4.5.z version has merged, so I will clone this bug back Verified on 4.6.0-0.nightly-2021-01-13-215839. I do not see any mount failure in events or node journal logs. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-01-13-215839 True False 8h Cluster version is 4.6.0-0.nightly-2021-01-13-215839 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-225.us-east-2.compute.internal Ready worker,wscan 8h v1.19.0+9c69bdc 10.0.129.225 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-144-228.us-east-2.compute.internal Ready master 8h v1.19.0+9c69bdc 10.0.144.228 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-162-51.us-east-2.compute.internal Ready worker,wscan 8h v1.19.0+9c69bdc 10.0.162.51 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-174-80.us-east-2.compute.internal Ready master 8h v1.19.0+9c69bdc 10.0.174.80 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-212-102.us-east-2.compute.internal Ready master 8h v1.19.0+9c69bdc 10.0.212.102 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-216-180.us-east-2.compute.internal Ready worker,wscan 8h v1.19.0+9c69bdc 10.0.216.180 <none> Red Hat Enterprise Linux CoreOS 46.82.202101131942-0 (Ootpa) 4.18.0-193.40.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 $ oc get events -A | grep -i "Argument list too long" $ $ oc get events -A | grep -i "mount failed: exit status 1" $ $ oc debug node/ip-10-0-129-225.us-east-2.compute.internal Starting pod/ip-10-0-129-225us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` ... sh-4.4# journalctl | grep -i "Failed to set up mount unit: Invalid argument" sh-4.4# sh-4.4# journalctl | grep -i "Failed to set up mount unit" sh-4.4# sh-4.4# journalctl | grep -i "Argument list too long" sh-4.4# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.6.12 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0037 Hi Team, Case 02798983, cu reported they are hitting this issue again in 4.6.17. See attached screenshots. Are the affected nodes rhel 7 workers? Hi Peter, all nodes are RHCOS for case 02798983. Thank you, it has since been fixed for that case. For any new cases, the fixes should be in. If the issues pop up on RHEL 7, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1924502. If they pop up on RHCOS, it would be worth opening a clone of this for that openshift verison. I see this bug has been cloned and new one opened https://bugzilla.redhat.com/show_bug.cgi?id=1939416, marking this one as closed as it was already release as part of errata. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |