Bug 1820717
Summary: | [sig-storage] FailMount Unable to attach or mount volumes | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Carlos Eduardo Arango Gutierrez <carangog> | |
Component: | Node | Assignee: | Kir Kolyshkin <kir> | |
Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | unspecified | CC: | aos-bugs, hekumar, jlebon, jokerman, jsafrane, mpatel, rteague, sdodson, zyu | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1823597 (view as bug list) | Environment: | ||
Last Closed: | 2020-04-22 20:08:49 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1823597 |
Description
Carlos Eduardo Arango Gutierrez
2020-04-03 16:38:10 UTC
This Fail Mount error is also seen on https://bugzilla.redhat.com/show_bug.cgi?id=1819495 *** Bug 1821347 has been marked as a duplicate of this bug. *** Tried to re-run the job rehearsal by myself. I can see lot of "device or resource busy" errors there, however, some of the pod directories are removed after (lot of) some retries. It looks like there is something wrong with mount propagation: 1. Pod A uses a volume. 2. User / test starts a (privileged?) pod B using completely different volumes (I used "oc debug node/xxx" - privileged pod, but no mount propagation set anywhere) 3. User / test deletes Pod A. 4. Kubelet umounts the volume. 5. Kubelet tries to delete directory where the volume was mounted -> error. Pod B is still running and it sees the volume of pod A mounted and blocks removal of the directory. Is there unmount hook present in the container runtime? When Pod B dies, it most probably allows kubelet to finally remove the volume directory - it's hard to debug during e2e tests running, pods come and go... But I saw some "busy" directories removed after a while. Moving to Node team for investigation. One note: nodes run RHEL-7 VERSION="7.7 (Maipo)" cri-o-1.14.12-19.dev.rhaos4.2.git313d784.el7.x86_64 kernel-3.10.0-1127.el7.x86_64 Any update here? All rhel7 worker jobs are blocked by multiple test failures. Top 15 failures: Failed % of 132 Test (started between 2020-04-06T15:01:06 and 2020-04-08T09:29:55 UTC) 66 50 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 65 49 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s] 64 48 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 64 48 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 63 47 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 60 45 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 60 45 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 59 44 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 57 43 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 57 43 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s] 56 42 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 56 42 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 55 41 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 53 40 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 53 40 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] One pod requesting one prebound PVC should be able to mount volume and write from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] (In reply to Jan Safranek from comment #6) > Tried to re-run the job rehearsal by myself. I can see lot of "device or > resource busy" errors there, however, some of the pod directories are > removed after (lot of) some retries. > > It looks like there is something wrong with mount propagation: > > 1. Pod A uses a volume. > 2. User / test starts a (privileged?) pod B using completely different > volumes (I used "oc debug node/xxx" - privileged pod, but no mount > propagation set anywhere) > 3. User / test deletes Pod A. > 4. Kubelet umounts the volume. > 5. Kubelet tries to delete directory where the volume was mounted -> error. > > Pod B is still running and it sees the volume of pod A mounted and blocks > removal of the directory. Is there unmount hook present in the container > runtime? > > When Pod B dies, it most probably allows kubelet to finally remove the > volume directory - it's hard to debug during e2e tests running, pods come > and go... But I saw some "busy" directories removed after a while. > > Moving to Node team for investigation. Jan, were you able to repro this from the command line? Can you provide the list of commands to run (ideally, with an environment in which this can be reproduced). I have seen similar things happening if some containers mount /var/lib/kubernetes (or the like, i.e. directories containing other containers' mounts). Can something like this be the case? (or the like -- i.e. directories that contain I've moved severity up to urgent as this is preventing us from getting any passing e2e tests when RHEL7 workers are involved. This should be backported to 4.4 as well when it's fixed. It started in around April 01 in 4.2 RHEL-7 jobs: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-informing#release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2&sort-by-flakiness= The last OK job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/344 The first flaky job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/347 Do you remember anything interesting in RHCOS / CRI-O / OCP 4.2 around that time? Can you get the version of runc in use here? Also, if possible, can you provide the result of cat /proc/sys/fs/may_detach_mounts (as root I guess)? If it's 0, try doing echo 1 > /proc/sys/fs/may_detach_mounts and try to reproduce again (without rebooting the kernel). Detailed explanation and how to check if my hypothesis is correct here: https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17 So, yes, this is runc packaging bug (In reply to Jan Safranek from comment #11) > It started in around April 01 in 4.2 RHEL-7 jobs: > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-informing#release- > openshift-ocp-installer-e2e-aws-rhel7-workers-4.2&sort-by-flakiness= > > The last OK job: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/344 runc-1.0.0-66.rc8.el7_7.x86_64 > The first flaky job: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/347 runc-1.0.0-67.rc10.el7_8.x86_64  The regression is the one I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17 (http://pkgs.devel.redhat.com/cgit/rpms/runc/commit/?h=extras-rhel-7.8&id=4250b1f539bfde82ac7e33d9a1385286975d9915) *** This bug has been marked as a duplicate of bug 1823374 *** |