PVC was created successfully but looks like while attaching it to the pod, ceph command got stuck and never came back: 2021-02-23T11:56:02.429823495Z I0223 11:56:02.429804 138175 utils.go:159] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC call: /csi.v1.Node/NodeStageVolume 2021-02-23T11:56:02.431001275Z I0223 11:56:02.430983 138175 utils.go:160] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-storage","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b","journalPool":"ocs-cbp-46","pool":"ocs-cbp-46","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1614081315035-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b"} 2021-02-23T11:56:02.431228335Z I0223 11:56:02.431215 138175 rbd_util.go:771] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b setting disableInUseChecks on rbd volume to: false 2021-02-23T11:56:08.789108175Z I0223 11:56:08.789058 138175 omap.go:84] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b got omap values: (pool="ocs-cbp-46", namespace="", name="csi.volume.04dd2255-75ce-11eb-9cce-0a580a82001b"): map[csi.imageid:3b10758c09cd7f csi.imagename:csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b csi.volname:pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044] 2021-02-23T11:56:08.953645982Z I0223 11:56:08.953550 138175 cephcmds.go:59] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b command succeeded: rbd [device list --format=json --device-type krbd] 2021-02-23T11:56:14.399672171Z I0223 11:56:14.399625 138175 rbd_attach.go:215] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b rbd: map mon 10.1.8.70:6789 Last line from above message comes from https://github.com/openshift/ceph-csi/blob/49cf5efdd5663986c1c23c66163576bf77fdccb4/internal/rbd/rbd_attach.go#L215 where we would have executed rbd map command but looks like it never came back and hence we are in this situation. Now, the bad part here is that we lost the cluster before we could have gone to this container and would have checked where the map command is still stuck. But again this can happen because of connectivity issue at that point of time and we should see if this is reproducible, otherwise there is not much with which we can debug it further.
Thanks Sidhant for all the help. As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1931875#c6, attach didn't progress because "map" command was stuck. Before I could have looked into the pod to see where it was stuck the plugin pod was restarted manually, new plugin pod is csi-rbdplugin-r2jsg While looking at the plugin container of the new pod I saw that the original mapping request was successfully completed, please note that the command was issued inside old container 2 hours back and it got completed after 2 hours: =================== I0223 13:03:14.680894 436846 utils.go:159] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC c all: /csi.v1.Node/NodeStageVolume I0223 13:03:14.682159 436846 utils.go:160] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC r equest: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044/ globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-sto rage","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b","journalPool":"ocs-cbp-46","pool":"o cs-cbp-46","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1614081315035-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id" :"0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b"} I0223 13:03:14.682349 436846 rbd_util.go:771] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b set ting disableInUseChecks on rbd volume to: false I0223 13:03:16.755465 436846 utils.go:159] ID: 6 GRPC call: /csi.v1.Identity/Probe I0223 13:03:16.755927 436846 utils.go:160] ID: 6 GRPC request: {} I0223 13:03:16.756260 436846 utils.go:165] ID: 6 GRPC response: {} I0223 13:03:20.806436 436846 omap.go:84] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b got omap values: (pool="ocs-cbp-46", namespace="", name="csi.volume.04dd2255-75ce-11eb-9cce-0a580a82001b"): map[csi.imageid:3b10758c09cd7f csi.imagename:c si-vol-04dd2255-75ce-11eb-9cce-0a580a82001b csi.volname:pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044] I0223 13:03:20.862349 436846 cephcmds.go:59] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b comm and succeeded: rbd [device list --format=json --device-type krbd] I0223 13:03:20.862418 436846 nodeserver.go:288] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b r bd image: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b/ocs-cbp-46 was successfully mapped at /dev/rbd0 ==================== Now comes the strange part, though the pvc is bound and even attach request is completed in the above messages we can still see further NodeStage commands failing for the same pvc and complaining that an operation already exists for the same pvc: ==================== I0223 13:07:16.691854 436846 utils.go:159] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC call: /csi.v1.Node/NodeStageVolume I0223 13:07:16.693092 436846 utils.go:160] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044 /globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-st orage","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b","journalPool":"ocs-cbp-46","pool":" ocs-cbp-46","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1614081315035-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id ":"0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b"} E0223 13:07:16.693904 436846 nodeserver.go:141] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b already exists E0223 13:07:16.693974 436846 utils.go:163] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce -0a580a82001b already exists =================== Not sure whether this is expected or is there a real bug because of which we never sent the acknowledgement back and kept retrying. Need more analysis.
Had an offline discussion with Niels for this issue, thanks Niels for explaining this scenario. 1. Initially the map command was stuck, we don't know the reasons because a) we don't have enough logging for the same. [1] b) pod was restarted 2. Later the pod was restarted and the command succeeded but filesystem was probably not got created as the process was aborted abruptly after the long delay. 3. Because the process was aborted the journal still contains a note that there is an operation in progress, we don't handle the journal cleanup. [2] 4. That is the reason we never sent the ack back and kubernetes kept trying the operation. This can be improved by fixing [1] and [2], hence moving this BZ out of 4.7 Steps to recover from this are: 1. unmap the rbd image 2. remove the entry from the journal 3. kubernetes will retry again OR 1. unmap the rbd image. 2. Delete the pvc. [1] https://github.com/ceph/ceph-csi/issues/1619 [2] https://github.com/ceph/ceph-csi/issues/1881#issuecomment-782116902
No plans to do these improvements in 4.8, moving it out.
We should not see this issue after the fix for BZ #1910790, marking it as a duplicate. *** This bug has been marked as a duplicate of bug 1910790 ***
Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Blocking ODF deployment Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? https://my-accountaccess.com/