Description of problem: CI tests continuously failing on release-openshift-ocp-installer-e2e-aws-rhel7-workers test : *[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Pre-provisioned PV (block volmode)] volumeMode should fail to create pod by failing to mount volume [Suite:openshift/conformance/parallel] [Suite:k8s]* ``` /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:307 +0x4d github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.deleteNS(0x65741e0, 0xc002c6c280, 0x63a1f00, 0xc000d6a638, 0xc002b49280, 0x13, 0x8bb2c97000, 0xed618d43a, 0xa6bac80) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/util.go:1194 +0x189 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*Framework).AfterEach.func1(0xc000843040) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:311 +0x222 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*Framework).AfterEach(0xc000843040) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:397 +0x400 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.RunCleanupActions() /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/cleanup.go:59 +0x66 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e.glob..func3() /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/e2e.go:164 +0x62 github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0xc0009a2c00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:109 +0x9c github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0xc0009a2c00, 0x90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:63 +0xcf github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*synchronizedAfterSuiteNode).Run(0xc0009dc6e0, 0x1, 0x1, 0x0, 0x0, 0x4e6b9e) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes/synchronized_after_suite_node.go:29 +0x64 github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runAfterSuite(0xc002ddaa00, 0xc000010020) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:136 +0xd3 github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).registerForInterrupts(0xc002ddaa00) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:234 +0x1d9 created by github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:59 +0x5e goroutine 199 [IO wait]: internal/poll.runtime_pollWait(0x7fdc8a88e000, 0x72, 0xffffffffffffffff) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/netpoll.go:182 +0x56 internal/poll.(*pollDesc).wait(0xc003380198, 0x72, 0xdb00, 0xdb94, 0xffffffffffffffff) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/internal/poll/fd_poll_runtime.go:87 +0x9b internal/poll.(*pollDesc).waitRead(...) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/internal/poll/fd_poll_runtime.go:92 internal/poll.(*FD).Read(0xc003380180, 0xc003ab0000, 0xdb94, 0xdb94, 0x0, 0x0, 0x0) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/internal/poll/fd_unix.go:169 +0x19b net.(*netFD).Read(0xc003380180, 0xc003ab0000, 0xdb94, 0xdb94, 0x203000, 0x0, 0xdb28) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/net/fd_unix.go:202 +0x4f net.(*conn).Read(0xc003931600, 0xc003ab0000, 0xdb94, 0xdb94, 0x0, 0x0, 0x0) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/net/net.go:177 +0x69 crypto/tls.(*atLeastReader).Read(0xc002f0a8e0, 0xc003ab0000, 0xdb94, 0xdb94, 0x60806d4ff88f3934, 0xc00009da30, 0xc00009d9e0) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/crypto/tls/conn.go:761 +0x60 bytes.(*Buffer).ReadFrom(0xc00317b758, 0x639d5c0, 0xc002f0a8e0, 0x40f965, 0x4ee8260, 0x56df840) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/bytes/buffer.go:207 +0xbd crypto/tls.(*Conn).readFromUntil(0xc00317b500, 0x63a3200, 0xc003931600, 0x5, 0xc003931600, 0x60e) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/crypto/tls/conn.go:783 +0xf8 crypto/tls.(*Conn).readRecordOrCCS(0xc00317b500, 0x5b01800, 0xc00317b638, 0xc00009dd58) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/crypto/tls/conn.go:590 +0x125 crypto/tls.(*Conn).readRecord(...) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/crypto/tls/conn.go:558 crypto/tls.(*Conn).Read(0xc00317b500, 0xc003482000, 0x1000, 0x1000, 0x0, 0x0, 0x0) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/crypto/tls/conn.go:1236 +0x137 bufio.(*Reader).Read(0xc00339de00, 0xc002db2f18, 0x9, 0x9, 0x40be04, 0xc002d96060, 0xc00009dd58) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/bufio/bufio.go:223 +0x23e io.ReadAtLeast(0x639d320, 0xc00339de00, 0xc002db2f18, 0x9, 0x9, 0x9, 0x639d800, 0xc003a6e2c0, 0xc0000a2040) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/io/io.go:310 +0x88 io.ReadFull(...) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/io/io.go:329 github.com/openshift/origin/vendor/golang.org/x/net/http2.readFrameHeader(0xc002db2f18, 0x9, 0x9, 0x639d320, 0xc00339de00, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/golang.org/x/net/http2/frame.go:237 +0x88 github.com/openshift/origin/vendor/golang.org/x/net/http2.(*Framer).ReadFrame(0xc002db2ee0, 0xc003929830, 0x0, 0x0, 0x0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/golang.org/x/net/http2/frame.go:492 +0xa1 github.com/openshift/origin/vendor/golang.org/x/net/http2.(*clientConnReadLoop).run(0xc00009dfb8, 0x5af8c38, 0xc00344f7b8) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/golang.org/x/net/http2/transport.go:1679 +0x8d github.com/openshift/origin/vendor/golang.org/x/net/http2.(*ClientConn).readLoop(0xc00155b380) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/golang.org/x/net/http2/transport.go:1607 +0x76 created by github.com/openshift/origin/vendor/golang.org/x/net/http2.(*Transport).newClientConn /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/golang.org/x/net/http2/transport.go:670 +0x637 goroutine 740 [chan receive]: github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).registerForHardInterrupts(0xc002ddaa00) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:244 +0xc0 created by github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).registerForInterrupts /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:223 +0x104 goroutine 743 [select]: github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.poller.func1.1(0xc003971140, 0x77359400, 0x8bb2c97000, 0xc0039710e0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:450 +0x153 created by github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.poller.func1 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:433 +0x8c goroutine 734 [select]: github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.poller.func1.1(0xc002dccb40, 0x77359400, 0x45d964b800, 0xc002dccae0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:450 +0x153 created by github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.poller.func1 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:433 +0x8c rax 0xca rbx 0xa6bd2a0 rcx 0xffffffffffffffff rdx 0x0 rdi 0xa6bd3e8 rsi 0x80 rbp 0x7ffed1f155f0 rsp 0x7ffed1f155a8 r8 0x0 r9 0x0 r10 0x0 r11 0x286 r12 0x55 r13 0x1 r14 0x1 r15 0x0 rip 0x464bf1 rflags 0x286 cs 0x33 fs 0x0 gs 0x0 ``` Also *[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] volumes should be mountable [Suite:openshift/conformance/parallel] [Suite:k8s]* ``` fail [k8s.io/kubernetes/test/e2e/storage/utils/local.go:251]: Unexpected error: <exec.CodeExitError>: { Err: { s: "error running &{/usr/bin/kubectl [kubectl --server=https://api.ci-op-45dwdn11-253c5.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-volume-9990 hostexec-ip-10-0-135-178.ec2.internal -- nsenter --mount=/rootfs/proc/1/ns/mnt -- sh -c sudo umount /tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30 && rm -r /tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30] [] <nil> rm: cannot remove '/tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30': Device or resource busy\ncommand terminated with exit code 1\n [] <nil> 0xc0040815f0 exit status 1 <nil> <nil> true [0xc0038264b0 0xc0038264c8 0xc0038264e0] [0xc0038264b0 0xc0038264c8 0xc0038264e0] [0xc0038264c0 0xc0038264d8] [0x95bef0 0x95bef0] 0xc001266d80 <nil>}:\nCommand stdout:\n\nstderr:\nrm: cannot remove '/tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30': Device or resource busy\ncommand terminated with exit code 1\n\nerror:\nexit status 1\n", }, Code: 1, } error running &{/usr/bin/kubectl [kubectl --server=https://api.ci-op-45dwdn11-253c5.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-volume-9990 hostexec-ip-10-0-135-178.ec2.internal -- nsenter --mount=/rootfs/proc/1/ns/mnt -- sh -c sudo umount /tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30 && rm -r /tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30] [] <nil> rm: cannot remove '/tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30': Device or resource busy command terminated with exit code 1 [] <nil> 0xc0040815f0 exit status 1 <nil> <nil> true [0xc0038264b0 0xc0038264c8 0xc0038264e0] [0xc0038264b0 0xc0038264c8 0xc0038264e0] [0xc0038264c0 0xc0038264d8] [0x95bef0 0x95bef0] 0xc001266d80 <nil>}: Command stdout: stderr: rm: cannot remove '/tmp/local-driver-cf680694-7577-11ea-b330-0a58ac107e30': Device or resource busy command terminated with exit code 1 error: exit status 1 ``` Additional info:
This Fail Mount error is also seen on https://bugzilla.redhat.com/show_bug.cgi?id=1819495
*** Bug 1821347 has been marked as a duplicate of this bug. ***
Tried to re-run the job rehearsal by myself. I can see lot of "device or resource busy" errors there, however, some of the pod directories are removed after (lot of) some retries. It looks like there is something wrong with mount propagation: 1. Pod A uses a volume. 2. User / test starts a (privileged?) pod B using completely different volumes (I used "oc debug node/xxx" - privileged pod, but no mount propagation set anywhere) 3. User / test deletes Pod A. 4. Kubelet umounts the volume. 5. Kubelet tries to delete directory where the volume was mounted -> error. Pod B is still running and it sees the volume of pod A mounted and blocks removal of the directory. Is there unmount hook present in the container runtime? When Pod B dies, it most probably allows kubelet to finally remove the volume directory - it's hard to debug during e2e tests running, pods come and go... But I saw some "busy" directories removed after a while. Moving to Node team for investigation.
One note: nodes run RHEL-7 VERSION="7.7 (Maipo)" cri-o-1.14.12-19.dev.rhaos4.2.git313d784.el7.x86_64 kernel-3.10.0-1127.el7.x86_64
Any update here? All rhel7 worker jobs are blocked by multiple test failures. Top 15 failures: Failed % of 132 Test (started between 2020-04-06T15:01:06 and 2020-04-08T09:29:55 UTC) 66 50 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 65 49 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s] 64 48 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 64 48 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 63 47 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 60 45 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 60 45 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 59 44 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 57 43 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 57 43 [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s] 56 42 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Suite:openshift/conformance/parallel] [Suite:k8s] 56 42 [sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 55 41 [sig-storage] PersistentVolumes-local [Volume type: tmpfs] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 53 40 [sig-storage] PersistentVolumes-local [Volume type: dir-link-bindmounted] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s] 53 40 [sig-storage] PersistentVolumes-local [Volume type: dir-bindmounted] One pod requesting one prebound PVC should be able to mount volume and write from pod1 [Suite:openshift/conformance/parallel] [Suite:k8s]
(In reply to Jan Safranek from comment #6) > Tried to re-run the job rehearsal by myself. I can see lot of "device or > resource busy" errors there, however, some of the pod directories are > removed after (lot of) some retries. > > It looks like there is something wrong with mount propagation: > > 1. Pod A uses a volume. > 2. User / test starts a (privileged?) pod B using completely different > volumes (I used "oc debug node/xxx" - privileged pod, but no mount > propagation set anywhere) > 3. User / test deletes Pod A. > 4. Kubelet umounts the volume. > 5. Kubelet tries to delete directory where the volume was mounted -> error. > > Pod B is still running and it sees the volume of pod A mounted and blocks > removal of the directory. Is there unmount hook present in the container > runtime? > > When Pod B dies, it most probably allows kubelet to finally remove the > volume directory - it's hard to debug during e2e tests running, pods come > and go... But I saw some "busy" directories removed after a while. > > Moving to Node team for investigation. Jan, were you able to repro this from the command line? Can you provide the list of commands to run (ideally, with an environment in which this can be reproduced). I have seen similar things happening if some containers mount /var/lib/kubernetes (or the like, i.e. directories containing other containers' mounts). Can something like this be the case? (or the like -- i.e. directories that contain
I've moved severity up to urgent as this is preventing us from getting any passing e2e tests when RHEL7 workers are involved. This should be backported to 4.4 as well when it's fixed.
It started in around April 01 in 4.2 RHEL-7 jobs: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-informing#release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2&sort-by-flakiness= The last OK job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/344 The first flaky job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/347 Do you remember anything interesting in RHCOS / CRI-O / OCP 4.2 around that time?
Can you get the version of runc in use here?
Also, if possible, can you provide the result of cat /proc/sys/fs/may_detach_mounts (as root I guess)? If it's 0, try doing echo 1 > /proc/sys/fs/may_detach_mounts and try to reproduce again (without rebooting the kernel).
Detailed explanation and how to check if my hypothesis is correct here: https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17
So, yes, this is runc packaging bug (In reply to Jan Safranek from comment #11) > It started in around April 01 in 4.2 RHEL-7 jobs: > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.2-informing#release- > openshift-ocp-installer-e2e-aws-rhel7-workers-4.2&sort-by-flakiness= > > The last OK job: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/344 runc-1.0.0-66.rc8.el7_7.x86_64 > The first flaky job: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/347 runc-1.0.0-67.rc10.el7_8.x86_64  The regression is the one I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17 (http://pkgs.devel.redhat.com/cgit/rpms/runc/commit/?h=extras-rhel-7.8&id=4250b1f539bfde82ac7e33d9a1385286975d9915)
*** This bug has been marked as a duplicate of bug 1823374 ***