Bug 1931875

Summary:	[4.6] [External Mode] [BareMetal] [Assisted] ocs-operator stuck in Installing phase and noobaa-db-0 pod in Init:0/1
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Sidhant Agrawal <sagrawal>
Component:	csi-driver	Assignee:	Mudit Agarwal <muagarwa>
Status:	CLOSED DUPLICATE	QA Contact:	Elad <ebenahar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	alisauxbpn53, dzaken, madam, muagarwa, ndevos, ocs-bugs, odf-bz-bot, rar
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-20 07:40:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1910199

Comment 6 Mudit Agarwal 2021-02-24 13:51:02 UTC

PVC was created successfully but looks like while attaching it to the pod, ceph command got stuck and never came back:

2021-02-23T11:56:02.429823495Z I0223 11:56:02.429804  138175 utils.go:159] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC call: /csi.v1.Node/NodeStageVolume
2021-02-23T11:56:02.431001275Z I0223 11:56:02.430983  138175 utils.go:160] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-storage","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b","journalPool":"ocs-cbp-46","pool":"ocs-cbp-46","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1614081315035-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b"}
2021-02-23T11:56:02.431228335Z I0223 11:56:02.431215  138175 rbd_util.go:771] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b setting disableInUseChecks on rbd volume to: false
2021-02-23T11:56:08.789108175Z I0223 11:56:08.789058  138175 omap.go:84] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b got omap values: (pool="ocs-cbp-46", namespace="", name="csi.volume.04dd2255-75ce-11eb-9cce-0a580a82001b"): map[csi.imageid:3b10758c09cd7f csi.imagename:csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b csi.volname:pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044]
2021-02-23T11:56:08.953645982Z I0223 11:56:08.953550  138175 cephcmds.go:59] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b command succeeded: rbd [device list --format=json --device-type krbd]
2021-02-23T11:56:14.399672171Z I0223 11:56:14.399625  138175 rbd_attach.go:215] ID: 4 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b rbd: map mon 10.1.8.70:6789

Last line from above message comes from  https://github.com/openshift/ceph-csi/blob/49cf5efdd5663986c1c23c66163576bf77fdccb4/internal/rbd/rbd_attach.go#L215

where we would have executed rbd map command but looks like it never came back and hence we are in this situation.


Now, the bad part here is that we lost the cluster before we could have gone to this container and would have checked where the map command is still stuck.
But again this can happen because of connectivity issue at that point of time and we should see if this is reproducible, otherwise there is not much with which we can debug it further.

Comment 8 Mudit Agarwal 2021-02-25 06:30:19 UTC

Thanks Sidhant for all the help.

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1931875#c6, attach didn't progress because "map" command was stuck.

Before I could have looked into the pod to see where it was stuck the plugin pod was restarted manually, new plugin pod is csi-rbdplugin-r2jsg

While looking at the plugin container of the new pod I saw that the original mapping request was successfully completed, 
please note that the command was issued inside old container 2 hours back and it got completed after 2 hours:

===================
I0223 13:03:14.680894  436846 utils.go:159] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC c
all: /csi.v1.Node/NodeStageVolume
I0223 13:03:14.682159  436846 utils.go:160] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC r
equest: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044/
globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-sto
rage","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b","journalPool":"ocs-cbp-46","pool":"o
cs-cbp-46","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1614081315035-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id"
:"0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b"}
I0223 13:03:14.682349  436846 rbd_util.go:771] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b set
ting disableInUseChecks on rbd volume to: false
I0223 13:03:16.755465  436846 utils.go:159] ID: 6 GRPC call: /csi.v1.Identity/Probe
I0223 13:03:16.755927  436846 utils.go:160] ID: 6 GRPC request: {}
I0223 13:03:16.756260  436846 utils.go:165] ID: 6 GRPC response: {}
I0223 13:03:20.806436  436846 omap.go:84] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b got omap
 values: (pool="ocs-cbp-46", namespace="", name="csi.volume.04dd2255-75ce-11eb-9cce-0a580a82001b"): map[csi.imageid:3b10758c09cd7f csi.imagename:c
si-vol-04dd2255-75ce-11eb-9cce-0a580a82001b csi.volname:pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044]
I0223 13:03:20.862349  436846 cephcmds.go:59] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b comm
and succeeded: rbd [device list --format=json --device-type krbd]
I0223 13:03:20.862418  436846 nodeserver.go:288] ID: 5 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b r
bd image: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b/ocs-cbp-46 was successfully mapped at /dev/rbd0
====================

Now comes the strange part, though the pvc is bound and even attach request is completed in the above messages we can still see further NodeStage commands
failing for the same pvc and complaining that an operation already exists for the same pvc:

====================
I0223 13:07:16.691854  436846 utils.go:159] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC 
call: /csi.v1.Node/NodeStageVolume
I0223 13:07:16.693092  436846 utils.go:160] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC 
request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3ffeb68e-17cb-4d0b-8a09-45acb7c08044
/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-st
orage","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-04dd2255-75ce-11eb-9cce-0a580a82001b","journalPool":"ocs-cbp-46","pool":"
ocs-cbp-46","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1614081315035-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id
":"0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b"}
E0223 13:07:16.693904  436846 nodeserver.go:141] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b 
an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b already exists
E0223 13:07:16.693974  436846 utils.go:163] ID: 11 Req-ID: 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce-0a580a82001b GRPC 
error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000007-04dd2255-75ce-11eb-9cce
-0a580a82001b already exists
===================

Not sure whether this is expected or is there a real bug because of which we never sent the acknowledgement back and kept retrying.

Need more analysis.

Comment 9 Mudit Agarwal 2021-02-25 13:15:18 UTC

Had an offline discussion with Niels for this issue, thanks Niels for explaining this scenario.

1. Initially the map command was stuck, we don't know the reasons because 
   a) we don't have enough logging for the same. [1]
   b) pod was restarted

2. Later the pod was restarted and the command succeeded but filesystem was probably not got created as
   the process was aborted abruptly after the long delay.

3. Because the process was aborted the journal still contains a note that there is an operation in progress, we don't handle the journal cleanup. [2]

4. That is the reason we never sent the ack back and kubernetes kept trying the operation.

This can be improved by fixing [1] and [2], hence moving this BZ out of 4.7

Steps to recover from this are:

1. unmap the rbd image
2. remove the entry from the journal
3. kubernetes will retry again

OR

1. unmap the rbd image.
2. Delete the pvc.

[1] https://github.com/ceph/ceph-csi/issues/1619
[2] https://github.com/ceph/ceph-csi/issues/1881#issuecomment-782116902

Comment 10 Mudit Agarwal 2021-05-27 03:06:28 UTC

No plans to do these improvements in 4.8, moving it out.

Comment 11 Mudit Agarwal 2021-08-20 07:40:42 UTC

We should not see this issue after the fix for BZ #1910790, marking it as a duplicate.

*** This bug has been marked as a duplicate of bug 1910790 ***

Comment 12 Edith Stewart 2023-02-13 10:35:37 UTC Comment hidden (spam)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Blocking ODF deployment

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


https://my-accountaccess.com/