Description of problem (please be detailed as possible and provide log snippests): while running pod re-attached test, it took more then 3 min. to re attache a pod to an existing PVC Version of all relevant components (if applicable): OCP versions ============== clientVersion: buildDate: "2021-02-08T03:07:40Z" compiler: gc gitCommit: c66c03f3012a10f16eb86fdce6330433adf6c9ee gitTreeState: clean gitVersion: 4.7.0-202102080214.p0-c66c03f goVersion: go1.15.5 major: "" minor: "" platform: linux/amd64 openshiftVersion: 4.7.0 releaseClientVersion: 4.7.0-0.nightly-2021-02-09-224509 serverVersion: buildDate: "2021-02-11T23:05:38Z" compiler: gc gitCommit: bd9e4421804c212e6ac7ee074050096f08dda543 gitTreeState: clean gitVersion: v1.20.0+bd9e442 goVersion: go1.15.5 major: "1" minor: "20" platform: linux/amd64 Cluster version: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0 True False 22h Cluster version is 4.7.0 OCS versions ============== NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-278.ci OpenShift Container Storage 4.7.0-278.ci Succeeded Rook versions =============== rook: 4.7-103.a0622de60.release_4.7 go: go1.15.5 Ceph versions =============== ceph version 14.2.11-123.el8cp (f02fa4f00c2417b1bc86e6ec7711756454e70716) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? no Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: yes, on previous version (4.6) it took less then 60 sec. Steps to Reproduce: 1. deploy OCS 4.7 2. create PV and attached it to pod 3. write some data on this PVC 4. delete the pod 5. create new pod and attached it to the same PVC 6. measure the time the pod reach running state Actual results: the time to re-attache is more then 3 min. Expected results: the pod reached running state in less then 60 sec. Additional info: rfom the pod descibe output : Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 54s kubelet Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc default-token-97qhx]: timed out waiting for the condition Warning FailedMount 37s (x7 over 69s) kubelet MountVolume.SetUp failed for volume "pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d already exists Warning FailedMount 5s kubelet MountVolume.SetUp failed for volume "pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118" : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: mount Mounting arguments: -t ext4 -o bind,_netdev /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount Output: mount: /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d does not exist.
Looks like we spent a lot of time in unstage (~ 3 minutes): >> Unstage request came at 12:03:25 2021-03-08T12:03:25.421849243Z I0308 12:03:25.421834 13690 utils.go:132] ID: 1450 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC call: /csi.v1.Node/NodeUnstageVolume 2021-03-08T12:03:25.421883806Z I0308 12:03:25.421872 13690 utils.go:133] ID: 1450 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d"} >> But it was stuck somewhere and the response came back only at 12:06:01 2021-03-08T12:06:01.647562695Z I0308 12:06:01.647552 13690 utils.go:138] ID: 1467 GRPC response: {} 2021-03-08T12:06:22.816643460Z I0308 12:06:22.816609 13690 cephcmds.go:59] ID: 1450 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d command succeeded: rbd [unmap ocs-storagecluster-cephblockpool/csi-vol-d6e58758-8005-11eb-8383-0a580a80020d --device-type krbd --options noudev] 2021-03-08T12:06:22.816643460Z I0308 12:06:22.816634 13690 nodeserver.go:690] ID: 1450 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d successfully unmounted volume (0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d) 2021-03-08T12:06:22.816752618Z I0308 12:06:22.816740 13690 utils.go:138] ID: 1450 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC response: {} >> For all that time NodePublish (for the new pod) was complaining that the volume is already busy with some other operation (which is true as it was busy in unstaging): 2021-03-08T12:05:25.445361777Z I0308 12:05:25.445342 13690 utils.go:132] ID: 1454 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC call: /csi.v1.Node/NodePublishVolume 2021-03-08T12:05:25.445461133Z I0308 12:05:25.445437 13690 utils.go:133] ID: 1454 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount","target_path":"/var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118","csi.storage.k8s.io/pvc/name":"pvc-test-8243dcaf4fae4e578bfa113b044f5f9","csi.storage.k8s.io/pvc/namespace":"namespace-test-f4d54062089b438aa2d20ce83","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-d6e58758-8005-11eb-8383-0a580a80020d","journalPool":"ocs-storagecluster-cephblockpool","pool":"ocs-storagecluster-cephblockpool","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1615131414196-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d"} 2021-03-08T12:05:25.445468481Z E0308 12:05:25.445458 13690 nodeserver.go:405] ID: 1454 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d already exists >> Once unstaging from the previous pod finished, Kubernetes tried to run NodePublish again which didn't succeed this time because mount failed: 2021-03-08T12:06:29.508535032Z I0308 12:06:29.508482 13690 utils.go:132] ID: 1469 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC call: /csi.v1.Node/NodePublishVolume 2021-03-08T12:06:29.508617542Z I0308 12:06:29.508601 13690 utils.go:133] ID: 1469 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount","target_path":"/var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118","csi.storage.k8s.io/pvc/name":"pvc-test-8243dcaf4fae4e578bfa113b044f5f9","csi.storage.k8s.io/pvc/namespace":"namespace-test-f4d54062089b438aa2d20ce83","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-d6e58758-8005-11eb-8383-0a580a80020d","journalPool":"ocs-storagecluster-cephblockpool","pool":"ocs-storagecluster-cephblockpool","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1615131414196-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d"} 2021-03-08T12:06:29.508727433Z I0308 12:06:29.508711 13690 nodeserver.go:518] ID: 1469 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d target /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount 2021-03-08T12:06:29.508727433Z isBlock false 2021-03-08T12:06:29.508727433Z fstype ext4 2021-03-08T12:06:29.508727433Z stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d 2021-03-08T12:06:29.508727433Z readonly false 2021-03-08T12:06:29.508727433Z mountflags [bind _netdev] 2021-03-08T12:06:29.515325138Z E0308 12:06:29.515280 13690 mount_linux.go:150] Mount failed: exit status 32 2021-03-08T12:06:29.515325138Z Mounting command: mount 2021-03-08T12:06:29.515325138Z Mounting arguments: -t ext4 -o bind,_netdev /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount 2021-03-08T12:06:29.515325138Z Output: mount: /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d does not exist. 2021-03-08T12:06:29.515325138Z 2021-03-08T12:06:29.515356177Z E0308 12:06:29.515321 13690 utils.go:136] ID: 1469 Req-ID: 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d GRPC error: rpc error: code = Internal desc = mount failed: exit status 32 2021-03-08T12:06:29.515356177Z Mounting command: mount 2021-03-08T12:06:29.515356177Z Mounting arguments: -t ext4 -o bind,_netdev /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount 2021-03-08T12:06:29.515356177Z Output: mount: /var/lib/kubelet/pods/4ba47c20-4bdf-4738-9f25-8c49c51aeb16/volumes/kubernetes.io~csi/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d does not exist. >> mount failed saying /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118/globalmount/0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d does not exist. So, for the first pod NodeStageVolume and NodePublishVolume and was called, later because we deleted the pod it called NodeUnPublish and NodeUnstageVolume. While for the second pod directly NodePublishVolume was called. Should NodeStageVolume was also called before it, or I am missing something here. Madhu, can you please take a quick look?
Mudit, the kube component receiving the response on nodeunstage (>2021-03-08T12:06:22) and issuing of NodePublishVolume (<2021-03-08T12:06:29) would have happened in time sync. iow, the staging is a global mount or operation , so if its already staged ( at time of issuing publish) only publish is required. so, I dont think ( atleast in first glance ) there is an issue. The real issue is at unstage took time ( mostly due to the communication from csi or backend cluster took time for this operation )
Yeah, so its kind of timing issue. So, Unstage was issued but didn't complete hence Publish was issued instead of StageVolume. If Unstage was completed before Publish then we would have issued StageVolume before PublishVolume, right? If this is correct then this can happen any time, ideally before issuing command for attaching the pvc to another pod we must make sure that unstaging for the previous one is completed or not. >> The real issue is at unstage took time ( mostly due to the communication from csi or backend cluster took time for this operation ) This is difficult to debug for the obvious reasons.
Is this reproducible? If yes, can we please get a live setup.
(In reply to Mudit Agarwal from comment #8) > Is this reproducible? If yes, can we please get a live setup. re-producesd. cluster connection was sent to @Mudit
Thanks for the setup Avi, please keep it around for some more time if possible. Here is the demsg output which says that the umount was blocked: [148549.098223] INFO: task umount:2709303 blocked for more than 120 seconds. [148549.098373] Not tainted 4.18.0-240.10.1.el8_3.x86_64 #1 [148549.098492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [148549.098651] umount D 0 2709303 2484925 0x00004080 [148549.098771] Call Trace: [148549.098831] __schedule+0x2a6/0x700 [148549.098910] schedule+0x38/0xa0 [148549.098999] wb_wait_for_completion+0x52/0x80 [148549.099094] ? finish_wait+0x80/0x80 [148549.099173] __writeback_inodes_sb_nr+0xa1/0xc0 [148549.099272] sync_filesystem+0x26/0x90 [148549.099357] generic_shutdown_super+0x22/0x100 [148549.099452] kill_block_super+0x21/0x50 [148549.099535] deactivate_locked_super+0x34/0x70 [148549.099630] cleanup_mnt+0x3b/0x70 [148549.099705] task_work_run+0x8a/0xb0 [148549.099784] exit_to_usermode_loop+0xeb/0xf0 [148549.099876] do_syscall_64+0x198/0x1a0 [148549.099970] entry_SYSCALL_64_after_hwframe+0x65/0xca But looks like it was doing a filesystem sync which is possible if we have a lot of dirty data to clean. Has something changed recently in this test case, have we changed the amount of data we were writing before?
(In reply to Mudit Agarwal from comment #10) > Thanks for the setup Avi, please keep it around for some more time if > possible. > > Here is the demsg output which says that the umount was blocked: > > [148549.098223] INFO: task umount:2709303 blocked for more than 120 seconds. > [148549.098373] Not tainted 4.18.0-240.10.1.el8_3.x86_64 #1 > [148549.098492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [148549.098651] umount D 0 2709303 2484925 0x00004080 > [148549.098771] Call Trace: > [148549.098831] __schedule+0x2a6/0x700 > [148549.098910] schedule+0x38/0xa0 > [148549.098999] wb_wait_for_completion+0x52/0x80 > [148549.099094] ? finish_wait+0x80/0x80 > [148549.099173] __writeback_inodes_sb_nr+0xa1/0xc0 > [148549.099272] sync_filesystem+0x26/0x90 > [148549.099357] generic_shutdown_super+0x22/0x100 > [148549.099452] kill_block_super+0x21/0x50 > [148549.099535] deactivate_locked_super+0x34/0x70 > [148549.099630] cleanup_mnt+0x3b/0x70 > [148549.099705] task_work_run+0x8a/0xb0 > [148549.099784] exit_to_usermode_loop+0xeb/0xf0 > [148549.099876] do_syscall_64+0x198/0x1a0 > [148549.099970] entry_SYSCALL_64_after_hwframe+0x65/0xca > > But looks like it was doing a filesystem sync which is possible if we have a > lot of dirty data to clean. > > Has something changed recently in this test case, have we changed the amount > of data we were writing before? test case did not change, we are writing ~4 GiB of data to the PVC
This looks like OCP / kubelet bug. It should not call NodePublish when NodeUnstage is in progress. From kubelet logs http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1936408/node-logs/compute-0.logs: Mar 08 12:05:25.420858 compute-0 hyperkube[1885]: E0308 12:05:25.420831 1885 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d podName: nodeName:}" failed. No retries perm itted until 2021-03-08 12:05:25.920804185 +0000 UTC m=+74244.368749443 (durationBeforeRetry 500ms). Error: "UnmountDevice failed for volume \"pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118\" (UniqueName: \"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8 383-0a580a80020d\") on node \"compute-0\" : kubernetes.io/csi: attacher.UnmountDevice failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded" Mar 08 12:05:25.420858 compute-0 hyperkube[1885]: I0308 12:05:25.420852 1885 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" Mar 08 12:05:25.432995 compute-0 hyperkube[1885]: I0308 12:05:25.432975 1885 reconciler.go:269] operationExecutor.MountVolume started for volume "pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d") pod "pod-test-rbd-8965d0062c7f42bb80d9b3d0502" (UID: "4ba47c20-4bdf-4738-9f25-8c49c51aeb16") Mar 08 12:05:25.433289 compute-0 hyperkube[1885]: I0308 12:05:25.433274 1885 operation_generator.go:556] MountVolume.WaitForAttach entering for volume "pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d") pod "pod-test-rbd-8965d0062c7f42bb80d9b3d0502" (UID: "4ba47c20-4bdf-4738-9f25-8c49c51aeb16") DevicePath "csi-23ce19fe9cdc5c034bb1b4b68c40cdbfd98e40db72f100d01b30d7f9f0a0369f" Mar 08 12:05:25.442269 compute-0 hyperkube[1885]: I0308 12:05:25.442253 1885 operation_generator.go:565] MountVolume.WaitForAttach succeeded for volume "pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118" (UniqueName: "kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d") pod "pod-test-rbd-8965d0062c7f42bb80d9b3d0502" (UID: "4ba47c20-4bdf-4738-9f25-8c49c51aeb16") DevicePath "csi-23ce19fe9cdc5c034bb1b4b68c40cdbfd98e40db72f100d01b30d7f9f0a0369f" ... Mar 08 12:05:25.445712 compute-0 hyperkube[1885]: E0308 12:05:25.445630 1885 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d podName: nodeName:}" failed. No retries permitted until 2021-03-08 12:05:25.945604174 +0000 UTC m=+74244.393549424 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-cccf1173-91f6-4184-9ad1-d68cef4b0118\" (UniqueName: \"kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com^0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d\") pod \"pod-test-rbd-8965d0062c7f42bb80d9b3d0502\" (UID: \"4ba47c20-4bdf-4738-9f25-8c49c51aeb16\") : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-d6e58758-8005-11eb-8383-0a580a80020d already exists" We can see that MountVolume operation is started immediately after UnmountDevice times out and it considers the volume as device mounted!? Version: v1.20.0+ba45583 Please file the severity from your POV.
i just ran a new test on the same hardware but with OCP 4.6 (Latest GA) and OCS 4.6 (Latest GA) as suggested on comment 14. I hit the same issue. my suspicious is that in the tests that ran in the past data was not actually was written to the PVC, and now it does
(In reply to Avi Liani from comment #19) > i just ran a new test on the same hardware but with OCP 4.6 (Latest GA) and > OCS 4.6 (Latest GA) as suggested on comment 14. > > I hit the same issue. > > my suspicious is that in the tests that ran in the past data was not > actually was written to the PVC, and now it does For the kubelet timing or call sync issue specified in earlier comments will be tracked with this bug. However the unmount delay ( unstage call) is different from this. Mudit, do we need seperate bug to track the second case?
There are two problems here: [1] Unstage taking more time [2] NodePublish getting called when NodeUnstage is not finished. For [1] we should add sync to the test before deleting the pod because if https://bugzilla.redhat.com/show_bug.cgi?id=1936408#c19 is correct then these tests never passed (in real) and it is not a regression because we have to wait for deletion till all the data is synced. That's an expected behaviour. For [2] we have this BZ and here also looks like it is not a regression as we can see the issue both in OCP 4.6 and OCP 4.7 but we still have a bug here which is being worked upon by OCP team. Thanks Avi for helping continuously with the reproduction on different setups.
@muagarwa, what is the reasoning behind "urgent" severity? This is not a regression and it has no customer ticket. It has been like this for ages and nobody cared. OCP recovers in a minute or two. For me it looks like "medium" at best.
Agree, this was a blocker for OCS initially hence severity was urgent. Moving it back to medium
Filed issue upstream: https://github.com/kubernetes/kubernetes/issues/100182
Upstream PR: https://github.com/kubernetes/kubernetes/pull/100183
Is there a workaround for this issue? I see in upstream is scheduled for k8s 1.21, but the issue still reproduces in k8s 1.20 Thanks, Adrian
I managed to reproduce this quite often in my environment during deployment rolling updates. Sometimes the pod recovers, but there are some cases when it remains indefinitely in failing state. In the example below, my pod did not recover in 9h (probably will never recover) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 48m (x28 over 9h) kubelet Unable to attach or mount volumes: unmounted volumes=[datadir], unattached volumes=[datadir config default-token-2rjsd snapshots userconfig certificates]: timed out waiting for the condition Warning FailedMount 23m (x35 over 8h) kubelet Unable to attach or mount volumes: unmounted volumes=[datadir], unattached volumes=[userconfig certificates datadir config default-token-2rjsd snapshots]: timed out waiting for the condition Warning FailedMount 7m49s (x274 over 9h) kubelet MountVolume.SetUp failed for volume "pvc-f59bf6bf-2c6a-4862-9f14-4f6242c4a628" : rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: mount Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-f59bf6bf-2c6a-4862-9f14-4f6242c4a628/globalmount/0001-000b-kube-system-0000000000000001-f91330ea-a6d1-11eb-bd04-2ac100172dbc /var/lib/kubelet/pods/acd20598-13ab-40c1-b896-eede4932b49f/volumes/kubernetes.io~csi/pvc-f59bf6bf-2c6a-4862-9f14-4f6242c4a628/mount Output: mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-f59bf6bf-2c6a-4862-9f14-4f6242c4a628/globalmount/0001-000b-kube-system-0000000000000001-f91330ea-a6d1-11eb-bd04-2ac100172dbc does not exist Warning FailedMount 2m45s (x59 over 9h) kubelet Unable to attach or mount volumes: unmounted volumes=[datadir], unattached volumes=[default-token-2rjsd snapshots userconfig certificates datadir config]: timed out waiting for the condition What should I do in this case? Is there a workaround? Thanks, Adrian
After looking at the csi-rbdplugin logs it looks like it tried to do 3 NodeUnstageVolume, but all of them failed with "rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy" and afterwards it tried to do only NodePublishVolume which failed with the issue descried in this ticket I0330 08:29:52.014056 29552 utils.go:119] ID: 159 GRPC call: /csi.v1.Node/NodeUnpublishVolume I0330 08:29:52.014086 29552 utils.go:120] ID: 159 GRPC request: {"target_path":"/var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount","volume_id":"0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92"} I0330 08:29:52.017840 29552 mount_linux.go:209] Unmounting /var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount I0330 08:29:52.039340 29552 nodeserver.go:454] ID: 159 rbd: successfully unbound volume 0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 from /var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount I0330 08:29:52.039395 29552 utils.go:125] ID: 159 GRPC response: {} I0330 08:29:52.122056 29552 utils.go:119] ID: 160 GRPC call: /csi.v1.Node/NodeGetCapabilities I0330 08:29:52.122072 29552 utils.go:120] ID: 160 GRPC request: {} I0330 08:29:52.122559 29552 utils.go:125] ID: 160 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]} I0330 08:29:52.127363 29552 utils.go:119] ID: 161 GRPC call: /csi.v1.Node/NodeUnstageVolume I0330 08:29:52.127386 29552 utils.go:120] ID: 161 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount","volume_id":"0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92"} I0330 08:29:52.128746 29552 mount_linux.go:209] Unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 E0330 08:29:52.531817 29552 nodeserver.go:526] ID: 161 error unmapping volume (0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92): (rbd: unmap for spec (paas-hostpath-ceph-pool/csi-vol-eed72b89-90bc-11eb-b34b-22bafeaa9e92) failed (exit status 16): (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy )) E0330 08:29:52.535070 29552 utils.go:123] ID: 161 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (paas-hostpath-ceph-pool/csi-vol-eed72b89-90bc-11eb-b34b-22bafeaa9e92) failed (exit status 16): (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy ) I0330 08:29:53.121450 29552 utils.go:119] ID: 162 GRPC call: /csi.v1.Node/NodeGetCapabilities I0330 08:29:53.121495 29552 utils.go:120] ID: 162 GRPC request: {} I0330 08:29:53.123497 29552 utils.go:125] ID: 162 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]} I0330 08:29:53.129785 29552 utils.go:119] ID: 163 GRPC call: /csi.v1.Node/NodeUnstageVolume I0330 08:29:53.129801 29552 utils.go:120] ID: 163 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount","volume_id":"0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92"} E0330 08:29:53.448344 29552 nodeserver.go:526] ID: 163 error unmapping volume (0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92): (rbd: unmap for spec (paas-hostpath-ceph-pool/csi-vol-eed72b89-90bc-11eb-b34b-22bafeaa9e92) failed (exit status 16): (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy )) E0330 08:29:53.448386 29552 utils.go:123] ID: 163 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (paas-hostpath-ceph-pool/csi-vol-eed72b89-90bc-11eb-b34b-22bafeaa9e92) failed (exit status 16): (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy ) I0330 08:29:54.527037 29552 utils.go:119] ID: 164 GRPC call: /csi.v1.Node/NodeGetCapabilities I0330 08:29:54.527071 29552 utils.go:120] ID: 164 GRPC request: {} I0330 08:29:54.527733 29552 utils.go:125] ID: 164 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]} I0330 08:29:54.531367 29552 utils.go:119] ID: 165 GRPC call: /csi.v1.Node/NodeUnstageVolume I0330 08:29:54.531384 29552 utils.go:120] ID: 165 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount","volume_id":"0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92"} E0330 08:29:54.836841 29552 nodeserver.go:526] ID: 165 error unmapping volume (0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92): (rbd: unmap for spec (paas-hostpath-ceph-pool/csi-vol-eed72b89-90bc-11eb-b34b-22bafeaa9e92) failed (exit status 16): (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy )) E0330 08:29:54.836907 29552 utils.go:123] ID: 165 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (paas-hostpath-ceph-pool/csi-vol-eed72b89-90bc-11eb-b34b-22bafeaa9e92) failed (exit status 16): (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy ) I0330 08:29:56.314054 29552 utils.go:119] ID: 166 GRPC call: /csi.v1.Node/NodeGetCapabilities I0330 08:29:56.314076 29552 utils.go:120] ID: 166 GRPC request: {} I0330 08:29:56.314804 29552 utils.go:125] ID: 166 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]} I0330 08:29:56.720227 29552 utils.go:119] ID: 167 GRPC call: /csi.v1.Node/NodePublishVolume I0330 08:29:56.720249 29552 utils.go:120] ID: 167 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount","target_path":"/var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"kube-system","pool":"paas-hostpath-ceph-pool","storage.kubernetes.io/csiProvisionerIdentity":"1617042280095-8081-kube-system.rbd.csi.ceph.com"},"volume_id":"0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92"} I0330 08:29:56.722590 29552 nodeserver.go:370] ID: 167 target /var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount isBlock false fstype ext4 stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 readonly false mountflags [] I0330 08:29:56.726525 29552 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS I0330 08:29:56.726539 29552 mount_linux.go:171] systemd-run failed with: exit status 1 I0330 08:29:56.726549 29552 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory I0330 08:29:56.726574 29552 mount_linux.go:142] Mounting cmd (mount) with arguments ([-t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 /var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount]) E0330 08:29:56.729191 29552 mount_linux.go:147] Mount failed: exit status 32 Mounting command: mount Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 /var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount Output: mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 does not exist E0330 08:29:56.729226 29552 utils.go:123] ID: 167 GRPC error: rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: mount Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 /var/lib/kubelet/pods/5c8b60b3-9473-4e8a-8d7e-58dd059741f4/volumes/kubernetes.io~csi/pvc-85de0c54-3860-401b-9583-8a92a603a010/mount Output: mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-85de0c54-3860-401b-9583-8a92a603a010/globalmount/0001-000b-kube-system-0000000000000001-eed72b89-90bc-11eb-b34b-22bafeaa9e92 does not exist
The upstream PR has been merged, now waiting for Kubernetes rebase in OCP. This bug should be fixed in OCP 4.9. @dabuleanu.adrian Please open a support ticket to get a fix to older OCP releases.
Kubernetes rebase has landed.
Thanks krishnaram. From the test result, it is fixed in 4.9 and 4.8. Change the status to "Verified"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759