2164617 – Unable to expand ocs-storagecluster-ceph-rbd PVCs provisioned in Filesystem mode

Bug 2164617 - Unable to expand ocs-storagecluster-ceph-rbd PVCs provisioned in Filesystem mode

Summary: Unable to expand ocs-storagecluster-ceph-rbd PVCs provisioned in Filesystem ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Nobody
QA Contact:	Yuli Persky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2154341
TreeView+	depends on / blocked

Reported:	2023-01-25 20:16 UTC by bmcmurra
Modified:	2023-08-09 16:37 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.RBD Filesystem PVC expands even when the StagingTargetPath is missing Previously, the RBD Filesystem PVC expansion was not successful when the `StagingTartgetPath` was missing in the NodeExpandVolume RPC call and Ceph CSI was not able to get the device details to expand. With this fix, Ceph CSI goes through all the mount references to identify the `StageingTargetPath` where the RBD image is mounted. As a result, RBD Filesystem PVC expands successfully even when the StagingTargetPath is missing.
Clone Of:
Environment:
Last Closed:	2023-06-21 15:23:08 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 3624	0	None	Merged	rbd: discover if StagingTargetPath in NodeExpandVolume exists	2023-03-29 12:04:41 UTC
Github	red-hat-storage ocs-ci pull 7594	0	None	Merged	[GSS]Unable to expand ocs-storagecluster-ceph-rbd PVCs provisioned in Filesystem mode	2023-05-19 11:21:54 UTC

Description bmcmurra 2023-01-25 20:16:52 UTC

Description of problem (please be detailed as possible and provide log
snippets):

Expanding a PVC backed by SC/ocs-storagecluster-ceph-rbd in FileSystem mode fails with the following error:

"Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC."

[ANALYSIS]

Ceph backend is healthy:

$ ceph -s
cluster:
  id:     3c97ec89-a4a9-4619-b4d8-59bd0c13dc33
  health: HEALTH_OK

All pvc's in the openshift-storage namespaces are active and bound

[bmcmurra@supportshell-1 03417581]$ omg get pvc
NAME                                               STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS                 AGE
db-noobaa-db-pg-0                                  Bound   pvc-297e4d1f-a352-452f-aadf-18efbc212728  50Gi      RWO           ocs-storagecluster-ceph-rbd  64d
ocs-deviceset-ocs-local-volume-set-0-data-0b76rl   Bound   local-pv-c076fe97                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-10qcz6k  Bound   local-pv-4b6d6f0d                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-11fkr7z  Bound   local-pv-8c57f7c                          3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-1256lgh  Bound   local-pv-722d547c                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-13rd4fg  Bound   local-pv-6dddb05f                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-14pddrm  Bound   local-pv-19eeaea6                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-1gp2s4   Bound   local-pv-f613332c                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-22jv78   Bound   local-pv-17352a70                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-3mg2pt   Bound   local-pv-8487cbe0                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-4wcsqm   Bound   local-pv-e9d291b2                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-5xcjgn   Bound   local-pv-3c087778                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-6mn4kr   Bound   local-pv-56140414                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-7njrz7   Bound   local-pv-f972ff25                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-8srxmp   Bound   local-pv-81f1396c                         3576Gi    RWO           ocs-local-volume-set         64d
ocs-deviceset-ocs-local-volume-set-0-data-962lx2   Bound   local-pv-d2c211a0                         3576Gi    RWO           ocs-local-volume-set         64d

- All ODF pods/operators are up and healthy


Version of all relevant components (if applicable):

OCP 4.10.39
ODF v4.10.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

Possibly, but I don't have a cluster prior to 4.10 to test with

Steps to Reproduce:
1. Create a pvc with the ocs-storagecluster-ceph-rbd storage class in FileSystem mode
2. Try to expand the PVC using the ODF console or by editing it's yaml file
3. The pvc fails to expand and the event stream of the resource shows the following error: "Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC."

Actual results:

PVC fails to expand.

Expected results:

PVC succeeds in expanding.

Additional info:

Comment 10 khover 2023-01-27 19:46:57 UTC

Thanks Hemant,

That makes sense now regarding the behavior on my test cluster.

It seemed strange at first, as I was able to expand PVC on cephfs sc and rbd block without issue with the same " test-X " PVCs unattached to any pod workload.

Comment 15 Humble Chirammal 2023-01-31 06:54:43 UTC

> * Is this fix intended to be backported?
> 

>> Not sure about it as this case exists on all ODF versions and we have a workaround. If there is an ask for it, this needs to be decided by the program.

imo, this does not qualify/satisfy backport request, so, its very unlikely to be considered.

Comment 24 Yuli Persky 2023-04-18 09:59:23 UTC

I tried the scenario described in Comment#22 ( expanded the pvc number of times, chen changed pod count on the deployment from 0 to 1 also number of times), verified that the expansion is successful. 

However, I did not manage to receve a state in which the staging_target_path is missing in the NodeExpandVolume RPC call. 

From the logs : 

[ypersky@ypersky ocs-ci]$ oc logs csi-rbdplugin-n4xl8 -c csi-rbdplugin | grep NodeExpandVolume
I0418 05:34:46.538397   15996 utils.go:195] ID: 22 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC call: /csi.v1.Node/NodeExpandVolume
I0418 08:27:47.455884   15996 utils.go:195] ID: 53 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC call: /csi.v1.Node/NodeExpandVolume
I0418 08:27:47.629071   15996 utils.go:195] ID: 56 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC call: /csi.v1.Node/NodeExpandVolume
I0418 09:27:27.334022   15996 utils.go:195] ID: 170 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC call: /csi.v1.Node/NodeExpandVolume
I0418 09:27:27.529704   15996 utils.go:195] ID: 173 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC call: /csi.v1.Node/NodeExpandVolume
[ypersky@ypersky ocs-ci]$ oc logs csi-rbdplugin-n4xl8 -c csi-rbdplugin | grep "ID: 173 "
I0418 09:27:27.529704   15996 utils.go:195] ID: 173 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC call: /csi.v1.Node/NodeExpandVolume
I0418 09:27:27.529896   15996 utils.go:206] ID: 173 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC request: {"capacity_range":{"required_bytes":12884901888},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.rbd.csi.ceph.com/d299fc5020760f39dbd52b7b0bf3d33f93036abba4c82e664b28257c5c71ab52/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}},"volume_id":"0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e","volume_path":"/var/lib/kubelet/pods/293d17fb-5196-41de-a694-8a2034959b20/volumes/kubernetes.io~csi/pvc-639692f7-45ec-4e83-99bc-c3a5bf66c461/mount"}
I0418 09:27:27.592903   15996 cephcmds.go:105] ID: 173 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e command succeeded: rbd [device list --format=json --device-type krbd]
I0418 09:27:27.614409   15996 utils.go:212] ID: 173 Req-ID: 0001-0011-openshift-storage-0000000000000001-8a2bf746-5abe-4120-95f4-7ad0de0a855e GRPC response: {}

Comment 27 errata-xmlrpc 2023-06-21 15:23:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.