Bug 1745773
Summary: | Delay in detecting a successful CSI mount or incorrect report of failed mount, by kubelet for Ceph CSI based volume (due to the recursive fsGroup setting) | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Shyamsundar <srangana> | |
Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> | |
Status: | CLOSED ERRATA | QA Contact: | Chao Yang <chaoyang> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 4.2.0 | CC: | aos-bugs, aos-storage-staff, asakala, bbennett, chaoyang, fbertina, hchiramm, jstrunk, mloriedo, mrajanna, ykaul | |
Target Milestone: | --- | Keywords: | Performance | |
Target Release: | 4.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ocs-monkey | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1762658 (view as bug list) | Environment: | ||
Last Closed: | 2020-01-23 11:05:32 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1762658 |
Description
Shyamsundar
2019-08-26 21:23:05 UTC
It seems like this is where it's taking the most time: Aug 26 15:47:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:47:33.584277 1171 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: "bad60295-c818-11e9-9022-02f7b213f410") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount" Aug 26 15:49:36 ip-10-0-159-243 hyperkube[1171]: I0826 15:49:36.607093 1171 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: "bad60295-c818-11e9-9022-02f7b213f410") There's a 2-minute delay during the MountVolume.SetUp call. Assuming the NodePublishVolume call returns quickly (as confirmed by the logs), the delay seems to be happening on applying the volume ownership to the files: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_mounter.go#L267 I might happen that a pod starts and creates many files in the volume. Then a second pod starts and tries to change the ownership of those files, taking some time to finish. The application creates many files in the volume? (In reply to Fabio Bertinatto from comment #2) > There's a 2-minute delay during the MountVolume.SetUp call. > > Assuming the NodePublishVolume call returns quickly (as confirmed by the > logs), the delay seems to be happening on applying the volume ownership to > the files: > > https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/ > csi_mounter.go#L267 > > I might happen that a pod starts and creates many files in the volume. Then > a second pod starts and tries to change the ownership of those files, taking > some time to finish. > > The application creates many files in the volume? The application untars a linux kernel source code, maintaining a max of 4 untar copies on the volume at any given time. It also periodically removes an entire untarred directory. This is a mock application to test functionality of a particular workload. More details here: https://github.com/JohnStrunk/ocs-monkey/blob/master/osio-workload/workload.py So I would assume at any given time there maybe 1-4 copies of the linux source code, and that can be a lot of files in the volume. Are there any prescribed ways to avoid walking the tree and setting the fsGroup? Or, ways in which this issue can be circumvented? (In reply to Fabio Bertinatto from comment #2) > It seems like this is where it's taking the most time: > > Aug 26 15:47:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:47:33.584277 > 1171 operation_generator.go:623] MountVolume.MountDevice succeeded for > volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > "bad60295-c818-11e9-9022-02f7b213f410") device mount path > "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022- > 02f7b213f410/globalmount" > Aug 26 15:49:36 ip-10-0-159-243 hyperkube[1171]: I0826 15:49:36.607093 > 1171 operation_generator.go:667] MountVolume.SetUp succeeded for volume > "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > "bad60295-c818-11e9-9022-02f7b213f410") > > There's a 2-minute delay during the MountVolume.SetUp call. > > Assuming the NodePublishVolume call returns quickly (as confirmed by the > logs), the delay seems to be happening on applying the volume ownership to > the files: > > https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/ > csi_mounter.go#L267 > > I might happen that a pod starts and creates many files in the volume. Then > a second pod starts and tries to change the ownership of those files, taking > some time to finish. > > The application creates many files in the volume? Nice catch fabio! I do think this is where the delay is introduced. Looking at FSGroup apply process and this workload case, I see more than one pod access this volume and good amount of volume/file operations happens in this workload test. ... accessModes := c.spec.PersistentVolume.Spec.AccessModes if c.spec.PersistentVolume.Spec.AccessModes == nil { klog.V(4).Info(log("mounter.SetupAt WARNING: skipping fsGroup, access modes not provided")) return nil } if !hasReadWriteOnce(accessModes) { klog.V(4).Info(log("mounter.SetupAt WARNING: skipping fsGroup, only support ReadWriteOnce access mode")) return nil } ... This PVC is defined as "ReadWriteOnce" in this setup --snip-- volumeMode: Filesystem volumeName: pvc-4df1f65f-c818-11e9-9022-02f7b213f410 status: accessModes: - ReadWriteOnce capacity: storage: 3Gi phase: Bound PV Dump: $ oc get pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410 -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: rook-ceph.rbd.csi.ceph.com creationTimestamp: "2019-08-26T15:43:55Z" finalizers: - kubernetes.io/pv-protection name: pvc-4df1f65f-c818-11e9-9022-02f7b213f410 resourceVersion: "24217" selfLink: /api/v1/persistentvolumes/pvc-4df1f65f-c818-11e9-9022-02f7b213f410 uid: 4fa6d706-c818-11e9-b67a-0e9b9a4cfc94 spec: accessModes: - ReadWriteOnce --/snip-- Considering this is any way multi access volume, can we define this PVC as "RWX" and perform this test? this mode change could avoid volume ownership process and could workaround/Fix this issue. Fabio, any thoughts on my suggestion ? (In reply to Shyamsundar from comment #3) > (In reply to Fabio Bertinatto from comment #2) > > There's a 2-minute delay during the MountVolume.SetUp call. > > > > Assuming the NodePublishVolume call returns quickly (as confirmed by the > > logs), the delay seems to be happening on applying the volume ownership to > > the files: > > > > https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/ > > csi_mounter.go#L267 > > > > I might happen that a pod starts and creates many files in the volume. Then > > a second pod starts and tries to change the ownership of those files, taking > > some time to finish. > > > > The application creates many files in the volume? > > The application untars a linux kernel source code, maintaining a max of 4 > untar copies on the volume at any given time. It also periodically removes > an entire untarred directory. This is a mock application to test > functionality of a particular workload. More details here: > https://github.com/JohnStrunk/ocs-monkey/blob/master/osio-workload/workload. > py > > So I would assume at any given time there maybe 1-4 copies of the linux > source code, and that can be a lot of files in the volume. > > Are there any prescribed ways to avoid walking the tree and setting the > fsGroup? Or, ways in which this issue can be circumvented? Yes, this is indeed a heavy operation and one of the fix/workaround or isolation I can think of here is mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1745773#c4 Can we give a try? (In reply to Humble Chirammal from comment #4) > (In reply to Fabio Bertinatto from comment #2) > > It seems like this is where it's taking the most time: > > > > Aug 26 15:47:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:47:33.584277 > > 1171 operation_generator.go:623] MountVolume.MountDevice succeeded for > > volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > > "bad60295-c818-11e9-9022-02f7b213f410") device mount path > > "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022- > > 02f7b213f410/globalmount" > > Aug 26 15:49:36 ip-10-0-159-243 hyperkube[1171]: I0826 15:49:36.607093 > > 1171 operation_generator.go:667] MountVolume.SetUp succeeded for volume > > "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > > "bad60295-c818-11e9-9022-02f7b213f410") > > > > There's a 2-minute delay during the MountVolume.SetUp call. > > > > Assuming the NodePublishVolume call returns quickly (as confirmed by the > > logs), the delay seems to be happening on applying the volume ownership to > > the files: > > > > https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/ > > csi_mounter.go#L267 > > > > I might happen that a pod starts and creates many files in the volume. Then > > a second pod starts and tries to change the ownership of those files, taking > > some time to finish. > > > > The application creates many files in the volume? > > Nice catch fabio! I do think this is where the delay is introduced. > > Looking at FSGroup apply process and this workload case, I see more than one > pod > access this volume and good amount of volume/file operations happens in this > workload test. > > ... > accessModes := c.spec.PersistentVolume.Spec.AccessModes > if c.spec.PersistentVolume.Spec.AccessModes == nil { > klog.V(4).Info(log("mounter.SetupAt WARNING: skipping fsGroup, access > modes not provided")) > return nil > } > if !hasReadWriteOnce(accessModes) { > klog.V(4).Info(log("mounter.SetupAt WARNING: skipping fsGroup, only > support ReadWriteOnce access mode")) > return nil > } > > ... > > > This PVC is defined as "ReadWriteOnce" in this setup > > > > --snip-- > > volumeMode: Filesystem > volumeName: pvc-4df1f65f-c818-11e9-9022-02f7b213f410 > status: > accessModes: > - ReadWriteOnce > capacity: > storage: 3Gi > phase: Bound > > PV Dump: > > $ oc get pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410 -o yaml > apiVersion: v1 > kind: PersistentVolume > metadata: > annotations: > pv.kubernetes.io/provisioned-by: rook-ceph.rbd.csi.ceph.com > creationTimestamp: "2019-08-26T15:43:55Z" > finalizers: > - kubernetes.io/pv-protection > name: pvc-4df1f65f-c818-11e9-9022-02f7b213f410 > resourceVersion: "24217" > selfLink: > /api/v1/persistentvolumes/pvc-4df1f65f-c818-11e9-9022-02f7b213f410 > uid: 4fa6d706-c818-11e9-b67a-0e9b9a4cfc94 > spec: > accessModes: > - ReadWriteOnce > > > --/snip-- > > > Considering this is any way multi access volume, can we define this PVC as > "RWX" and perform this test? this mode change could avoid volume ownership > process and could workaround/Fix this issue. > > Fabio, any thoughts on my suggestion ? Yes, using a different access mode would circumvent this. Does the CSI driver support RWX? It seems like RBD doesn't support it: https://kubernetes.io/docs/concepts/storage/persistent-volumes/ Also, there's an on-going effort in upstream to set file ownership only after formatting: https://github.com/kubernetes/kubernetes/issues/69699 Other than that, the pod is reported unhealthy because the readiness probe is failing. The volume is eventually mounted and the pod starts, even with the 2-minute delay mounting the volume. From the node logs: Aug 26 15:52:14 ip-10-0-159-243 hyperkube[1171]: I0826 15:52:14.608187 1171 prober.go:125] Readiness probe for "osio-worker-524728073-848b9c5dd7-hf2h7_testws(bad60295-c818-11e9-9022-02f7b213f410):osio-workload" succeeded Aug 26 15:52:20 ip-10-0-159-243 hyperkube[1171]: I0826 15:52:20.053897 1171 prober.go:125] Readiness probe for "osio-worker-524728073-848b9c5dd7-hf2h7_testws(bad60295-c818-11e9-9022-02f7b213f410):osio-workload" succeeded Aug 26 15:52:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:52:33.263416 1171 prober.go:107] Readiness probe for "osio-worker-524728073-848b9c5dd7-hf2h7_testws(bad60295-c818-11e9-9022-02f7b213f410):osio-workload" errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1 Aug 26 15:52:44 ip-10-0-159-243 hyperkube[1171]: I0826 15:52:44.262419 1171 prober.go:107] Readiness probe for "osio-worker-524728073-848b9c5dd7-hf2h7_testws(bad60295-c818-11e9-9022-02f7b213f410):osio-workload" errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1 Aug 26 15:52:50 ip-10-0-159-243 hyperkube[1171]: I0826 15:52:50.001484 1171 prober.go:125] Readiness probe for "osio-worker-524728073-848b9c5dd7-hf2h7_testws(bad60295-c818-11e9-9022-02f7b213f410):osio-workload" succeeded Aug 26 15:53:01 ip-10-0-159-243 hyperkube[1171]: I0826 15:53:01.969155 1171 prober.go:125] Readiness probe for "osio-worker-524728073-848b9c5dd7-hf2h7_testws(bad60295-c818-11e9-9022-02f7b213f410):osio-workload" succeeded This is the readiness command: docker run --rm -it --entrypoint=/bin/sh quay.io/johnstrunk/osio-workload sh-4.2$ cat /health.sh #! /bin/bash touch /data/writable && \ stat /status/running Another option is to re-evaluate if this is the best way to assess the pod readiness in this situation. > > > > This PVC is defined as "ReadWriteOnce" in this setup > > > > > > > > --snip-- > > > > volumeMode: Filesystem > > volumeName: pvc-4df1f65f-c818-11e9-9022-02f7b213f410 > > status: > > accessModes: > > - ReadWriteOnce > > capacity: > > storage: 3Gi > > phase: Bound > > > > PV Dump: > > > > $ oc get pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410 -o yaml > > apiVersion: v1 > > kind: PersistentVolume > > metadata: > > annotations: > > pv.kubernetes.io/provisioned-by: rook-ceph.rbd.csi.ceph.com > > creationTimestamp: "2019-08-26T15:43:55Z" > > finalizers: > > - kubernetes.io/pv-protection > > name: pvc-4df1f65f-c818-11e9-9022-02f7b213f410 > > resourceVersion: "24217" > > selfLink: > > /api/v1/persistentvolumes/pvc-4df1f65f-c818-11e9-9022-02f7b213f410 > > uid: 4fa6d706-c818-11e9-b67a-0e9b9a4cfc94 > > spec: > > accessModes: > > - ReadWriteOnce > > > > > > --/snip-- > > > > > > Considering this is any way multi access volume, can we define this PVC as > > "RWX" and perform this test? this mode change could avoid volume ownership > > process and could workaround/Fix this issue. > > > > Fabio, any thoughts on my suggestion ? > > > Yes, using a different access mode would circumvent this. Does the CSI > driver support RWX? It seems like RBD doesn't support it: > > https://kubernetes.io/docs/concepts/storage/persistent-volumes/ We support RWX mode for RBD when operating on 'filesystem mode' which is the case here: https://github.com/ceph/ceph-csi/tree/master/examples#how-to-test-rbd-multi_node_multi_writer-block-feature > > Also, there's an on-going effort in upstream to set file ownership only > after formatting: > > https://github.com/kubernetes/kubernetes/issues/69699 Thanks for the pointer, raising the priority there too. (In reply to Humble Chirammal from comment #4) > (In reply to Fabio Bertinatto from comment #2) > > It seems like this is where it's taking the most time: > > > > Aug 26 15:47:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:47:33.584277 > > 1171 operation_generator.go:623] MountVolume.MountDevice succeeded for > > volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > > "bad60295-c818-11e9-9022-02f7b213f410") device mount path > > "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022- > > 02f7b213f410/globalmount" > > Aug 26 15:49:36 ip-10-0-159-243 hyperkube[1171]: I0826 15:49:36.607093 > > 1171 operation_generator.go:667] MountVolume.SetUp succeeded for volume > > "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > > "bad60295-c818-11e9-9022-02f7b213f410") > > > > There's a 2-minute delay during the MountVolume.SetUp call. > > > > Assuming the NodePublishVolume call returns quickly (as confirmed by the > > logs), the delay seems to be happening on applying the volume ownership to > > the files: > > > > https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/ > > csi_mounter.go#L267 > > > > I might happen that a pod starts and creates many files in the volume. Then > > a second pod starts and tries to change the ownership of those files, taking > > some time to finish. > > > > The application creates many files in the volume? > Considering this is any way multi access volume, can we define this PVC as > "RWX" and perform this test? this mode change could avoid volume ownership > process and could workaround/Fix this issue. This is not a multi access volume, or at least the way I understand the term. This volume is used by a single running application pod. Further RBD use is what is desired, and RBD supporting RWX mode is when the application can coordinate access to the device, not for non-clustered filesystem on top of the image anyway, and hence is not an option. > This is not a multi access volume, or at least the way I understand the term. This volume is used by a single running application pod.
I'm not sure if this is the case here, I found two different pods from two different nodes trying to mount the volume roughly at the same time.
$ ag "MountVolume\.[M|S].+pvc-4df1f65f-c818-11e9-9022-02f7b213f410" ip-10-0-162-206.ec2.internal.node.log ip-10-0-159-243.ec2.internal.node.log
ip-10-0-162-206.ec2.internal.node.log
18198:Aug 26 15:44:45 ip-10-0-162-206 hyperkube[1156]: I0826 15:44:45.645142 1156 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-9n8kw" (UID: "5f7de71a-c818-11e9-9022-02f7b213f410") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount"
18214:Aug 26 15:44:45 ip-10-0-162-206 hyperkube[1156]: I0826 15:44:45.688119 1156 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-9n8kw" (UID: "5f7de71a-c818-11e9-9022-02f7b213f410")
19798:Aug 26 15:45:49 ip-10-0-162-206 hyperkube[1156]: I0826 15:45:49.153107 1156 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-7nwrq" (UID: "936c319e-c818-11e9-9022-02f7b213f410") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount"
19814:Aug 26 15:45:49 ip-10-0-162-206 hyperkube[1156]: I0826 15:45:49.206486 1156 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-7nwrq" (UID: "936c319e-c818-11e9-9022-02f7b213f410")
ip-10-0-159-243.ec2.internal.node.log
23020:Aug 26 15:47:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:47:33.584277 1171 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: "bad60295-c818-11e9-9022-02f7b213f410") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount"
25577:Aug 26 15:49:36 ip-10-0-159-243 hyperkube[1171]: I0826 15:49:36.607093 1171 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: "bad60295-c818-11e9-9022-02f7b213f410")
*** Bug 1744958 has been marked as a duplicate of this bug. *** Reading through the comments, I think I need to clarify a few points: 1) This is a single access volume (RWO). The workload generator frequently scales the pod's Deployment 0 <=> 1, causing the logs seen in c#10. This is by design to stress exactly the type of issues this BZ and the other couple BZs have been exposing. Moving to RWX would imply switching to cephfs instead of rbd... at this time, the requirement is for rbd-based volumes. 2) Regarding readiness from c#7... This checks (1) whether the PV is writable and (2) whether the workload script in the pod is active (I don't check via `ps` for other reasons). The specific requirement is for pod startup to succeed within a fixed time as the application will have user interaction components, making startup delay directly visible as an unresponsive UI. Based on that, the workload generator has a timeout to achieve READY starting from the Deployment scale being set to 1. I understand that there are 2 components to this: (1) CSI doing the ownership change and (2) that ownership change taking too long in RBD given the number of files. An open question is whether (2) can be fixed or whether we need to wait for (1). (In reply to John Strunk from comment #13) > 1) This is a single access volume (RWO). The workload generator frequently > scales the pod's Deployment 0 <=> 1, causing the logs seen in c#10. This is > by design to stress exactly the type of issues this BZ and the other couple > BZs have been exposing. Moving to RWX would imply switching to cephfs > instead of rbd... at this time, the requirement is for rbd-based volumes. This should clarify the NEEDINFO on comment #10. @fabio do get back if there are still any gotchas that I missed. (In reply to Fabio Bertinatto from comment #10) > > This is not a multi access volume, or at least the way I understand the term. This volume is used by a single running application pod. > > I'm not sure if this is the case here, I found two different pods from two > different nodes trying to mount the volume roughly at the same time. > > > $ ag "MountVolume\.[M|S].+pvc-4df1f65f-c818-11e9-9022-02f7b213f410" > ip-10-0-162-206.ec2.internal.node.log ip-10-0-159-243.ec2.internal.node.log > > ip-10-0-162-206.ec2.internal.node.log > 18198:Aug 26 15:44:45 ip-10-0-162-206 hyperkube[1156]: I0826 15:44:45.645142 > 1156 operation_generator.go:623] MountVolume.MountDevice succeeded for > volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-9n8kw" (UID: > "5f7de71a-c818-11e9-9022-02f7b213f410") device mount path > "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022- > 02f7b213f410/globalmount" > 18214:Aug 26 15:44:45 ip-10-0-162-206 hyperkube[1156]: I0826 15:44:45.688119 > 1156 operation_generator.go:667] MountVolume.SetUp succeeded for volume > "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-9n8kw" (UID: > "5f7de71a-c818-11e9-9022-02f7b213f410") > 19798:Aug 26 15:45:49 ip-10-0-162-206 hyperkube[1156]: I0826 15:45:49.153107 > 1156 operation_generator.go:623] MountVolume.MountDevice succeeded for > volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-7nwrq" (UID: > "936c319e-c818-11e9-9022-02f7b213f410") device mount path > "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022- > 02f7b213f410/globalmount" > 19814:Aug 26 15:45:49 ip-10-0-162-206 hyperkube[1156]: I0826 15:45:49.206486 > 1156 operation_generator.go:667] MountVolume.SetUp succeeded for volume > "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-7nwrq" (UID: > "936c319e-c818-11e9-9022-02f7b213f410") > > ip-10-0-159-243.ec2.internal.node.log > 23020:Aug 26 15:47:33 ip-10-0-159-243 hyperkube[1171]: I0826 15:47:33.584277 > 1171 operation_generator.go:623] MountVolume.MountDevice succeeded for > volume "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > "bad60295-c818-11e9-9022-02f7b213f410") device mount path > "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022- > 02f7b213f410/globalmount" > 25577:Aug 26 15:49:36 ip-10-0-159-243 hyperkube[1171]: I0826 15:49:36.607093 > 1171 operation_generator.go:667] MountVolume.SetUp succeeded for volume > "pvc-4df1f65f-c818-11e9-9022-02f7b213f410" (UniqueName: > "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph- > 0000000000000001-4e253305-c818-11e9-be8f-0a580a830010") pod > "osio-worker-524728073-848b9c5dd7-hf2h7" (UID: > "bad60295-c818-11e9-9022-02f7b213f410") I believe above ( mount call on 2 nodes at same time ) has to be root caused or clarified. Considering this is a RWO volume, this shouldnt have happened. Again from above , it looks to me that, actually its 'staging' operation happens on both nodes in this kind of scenario and not really "publish" . If thats the case, Isnt it the same case reported at https://bugzilla.redhat.com/show_bug.cgi?id=1745776 ? I would like to point out one of the recent analysis here https://bugzilla.redhat.com/show_bug.cgi?id=1745776#c4 which could be the reason for RWO pod to be 'staged' on more than a NODE at a time. Removed other noise from the test case (IOW, pod rescheduling etc.) and ran a simple pod restart, on an otherwise idle cluster, that has a PVC filled with data like before. The time taken for CSI Stage and Publish to complete was more than last time due to an internal failure, it took 2:06 minutes. The time spent post the CSI operations was still significant and stood at 5:06 minutes. There seems to be a new log message in the code, that helps demarcate the CSI operations for the post CSI operations, and helped get times better. Message being: "operation_generator.go:623] MountVolume.MountDevice succeeded for volume ..." Version being: "OpenShift Version: 4.2.0-0.ci-2019-09-05-105207" So the issue does happen due to the recursive fsGroup setting as diagnosed earlier. Following is the log analysis, interspersed with CSI logs as well to show a possible linear event history. <relavent kube journal from the host> Sep 06 13:41:00 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:00.057906 1634 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:41:00 ip-10-0-131-197 hyperkube[1634]: E0906 13:41:00.058043 1634 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\"" failed. No retries permitted until 2019-09-06 13:41:00.557995296 +0000 UTC m=+62487.070977791 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\") pod \"osio-worker-1\" (UID: \"f6286169-d0ab-11e9-8b8c-0a41e590bc84\") " Sep 06 13:41:00 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:00.565264 1634 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:41:00 ip-10-0-131-197 hyperkube[1634]: E0906 13:41:00.565444 1634 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\"" failed. No retries permitted until 2019-09-06 13:41:01.565388905 +0000 UTC m=+62488.078371474 (durationBeforeRetry 1s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\") pod \"osio-worker-1\" (UID: \"f6286169-d0ab-11e9-8b8c-0a41e590bc84\") " Sep 06 13:41:01 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:01.572642 1634 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:41:01 ip-10-0-131-197 hyperkube[1634]: E0906 13:41:01.572805 1634 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\"" failed. No retries permitted until 2019-09-06 13:41:03.572756716 +0000 UTC m=+62490.085739296 (durationBeforeRetry 2s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\") pod \"osio-worker-1\" (UID: \"f6286169-d0ab-11e9-8b8c-0a41e590bc84\") " Sep 06 13:41:03 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:03.586890 1634 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:41:03 ip-10-0-131-197 hyperkube[1634]: E0906 13:41:03.587053 1634 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\"" failed. No retries permitted until 2019-09-06 13:41:07.587001727 +0000 UTC m=+62494.099984522 (durationBeforeRetry 4s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\") pod \"osio-worker-1\" (UID: \"f6286169-d0ab-11e9-8b8c-0a41e590bc84\") " Sep 06 13:41:07 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:07.616236 1634 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:41:07 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:07.623086 1634 operation_generator.go:1339] Controller attach succeeded for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") device path: "csi-6d4834d91c3e9daa7486f6b0c1c11f4028eb932be56d946c78c87723718d6965" Sep 06 13:41:07 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:07.717075 1634 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:41:07 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:07.719954 1634 operation_generator.go:587] MountVolume.WaitForAttach entering for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") DevicePath "csi-6d4834d91c3e9daa7486f6b0c1c11f4028eb932be56d946c78c87723718d6965" Sep 06 13:41:07 ip-10-0-131-197 hyperkube[1634]: I0906 13:41:07.725615 1634 operation_generator.go:596] MountVolume.WaitForAttach succeeded for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") DevicePath "csi-6d4834d91c3e9daa7486f6b0c1c11f4028eb932be56d946c78c87723718d6965" ---> <start> intersperced CSI call logs <--- ===> request to stage (this request times out and also fails within ceph-csi, but is not of concern for this bug) (track ID: 23 response for clarity) I0906 13:41:07.745318 99360 utils.go:119] ID: 23 GRPC call: /csi.v1.Node/NodeStageVolume I0906 13:41:07.745342 99360 utils.go:120] ID: 23 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1567714858772-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230"} ---> <end> intersperced CSI calls <--- ===> back to kube logs, where the first request fails Sep 06 13:43:07 ip-10-0-131-197 hyperkube[1634]: E0906 13:43:07.729137 1634 csi_mounter.go:433] kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount: directory not empty Sep 06 13:43:07 ip-10-0-131-197 hyperkube[1634]: E0906 13:43:07.729153 1634 csi_attacher.go:322] kubernetes.io/csi: attacher.MountDevice failed to remove mount dir after errir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount: directory not empty Sep 06 13:43:07 ip-10-0-131-197 hyperkube[1634]: E0906 13:43:07.729285 1634 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\"" failed. No retries permitted until 2019-09-06 13:43:08.229236489 +0000 UTC m=+62614.742219063 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230\") pod \"osio-worker-1\" (UID: \"f6286169-d0ab-11e9-8b8c-0a41e590bc84\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded" Sep 06 13:43:07 ip-10-0-131-197 hyperkube[1634]: I0906 13:43:07.729736 1634 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ocs-monkey", Name:"osio-worker-1", UID:"f6286169-d0ab-11e9-8b8c-0a41e590bc84", APIVersion:"v1", ResourceVersion:"562895", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.MountDevice failed for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" : rpc error: code = DeadlineExceeded desc = context deadline exceeded ===> Second request starting from kube Sep 06 13:43:08 ip-10-0-131-197 hyperkube[1634]: I0906 13:43:08.321644 1634 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Sep 06 13:43:08 ip-10-0-131-197 hyperkube[1634]: I0906 13:43:08.322177 1634 operation_generator.go:587] MountVolume.WaitForAttach entering for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") DevicePath "csi-6d4834d91c3e9daa7486f6b0c1c11f4028eb932be56d946c78c87723718d6965" Sep 06 13:43:08 ip-10-0-131-197 hyperkube[1634]: I0906 13:43:08.326371 1634 operation_generator.go:596] MountVolume.WaitForAttach succeeded for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") DevicePath "csi-6d4834d91c3e9daa7486f6b0c1c11f4028eb932be56d946c78c87723718d6965" ---> <start> intersperced CSI call logs <--- I0906 13:43:08.337859 99360 utils.go:119] ID: 29 GRPC call: /csi.v1.Node/NodeStageVolume I0906 13:43:08.337882 99360 utils.go:120] ID: 29 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1567714858772-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230"} ===> Blocked on lock that is held for the same request under processing by ID: 23 request E0906 13:43:13.283242 99360 utils.go:123] ID: 23 GRPC error: rpc error: code = Internal desc = rbd: map failed exit status 110, rbd output: rbd: wait failed In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (110) Connection timed out ===> ID: 23 request finally fails, releases the lock ===> ID: 29 suceeds I0906 13:43:13.628974 99360 utils.go:125] ID: 29 GRPC response: {} I0906 13:43:13.642323 99360 utils.go:119] ID: 31 GRPC call: /csi.v1.Node/NodePublishVolume I0906 13:43:13.642341 99360 utils.go:120] ID: 31 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount","target_path":"/var/lib/kubelet/pods/f6286169-d0ab-11e9-8b8c-0a41e590bc84/volumes/kubernetes.io~csi/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1567714858772-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230"} I0906 13:43:13.678720 99360 utils.go:125] ID: 31 GRPC response: {} ---> <end> intersperced CSI calls <--- ===> So CSI took 2:06 minutes to complete its operation (due to an internal failure, but irrelevant to this bug) Sep 06 13:43:13 ip-10-0-131-197 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount/0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230. ===> kube detects succesful mount (this is a new message from the last version under test, just an observation) Sep 06 13:43:13 ip-10-0-131-197 hyperkube[1634]: I0906 13:43:13.629848 1634 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a/globalmount" ===> Final success message below, took 5:06 minutes, and this is possibly (from code reading) due to the recursive fsGroup setting Sep 06 13:48:19 ip-10-0-131-197 hyperkube[1634]: I0906 13:48:19.525830 1634 operation_generator.go:667] MountVolume.SetUp succeeded for volume "pvc-929d9440-d01c-11e9-84fe-02b79d8eaf4a" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-92b4f24c-d01c-11e9-9996-0a580a810230") pod "osio-worker-1" (UID: "f6286169-d0ab-11e9-8b8c-0a41e590bc84") Thanks @srangana Update the bug to Verified according to https://bugzilla.redhat.com/show_bug.cgi?id=1745773#c28 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |