1961647 – [RBD] Deleting RBD provisioner leader pod breaks thick provisioning

Bug 1961647 - [RBD] Deleting RBD provisioner leader pod breaks thick provisioning

Summary: [RBD] Deleting RBD provisioner leader pod breaks thick provisioning

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Niels de Vos
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-18 12:06 UTC by Jilju Joy
Modified:	2021-08-03 18:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:	4.8.0-407.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-03 18:16:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 2101	None	closed	Continue thick-provisioning of RBD images on CreateVolume restart	2021-06-02 08:55:10 UTC
Github	openshift ceph-csi pull 52	None	closed	BUG 1961647: Continue thick-provisioning of RBD images on CreateVolume restart	2021-06-02 08:55:11 UTC
Red Hat Product Errata	RHBA-2021:3003	None	None	None	2021-08-03 18:16:26 UTC

Description Jilju Joy 2021-05-18 12:06:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):
If RBD provisioner leader pod is deleted when a thick PVC provision is progressing, the resulting volume will not be thick provisioned.
This was tested with a 20 GiB PVC. The PVC reached Bound state. But the used size of the rbd image is 984 MiB. This 984 MiB used size could be the thick provisioned size before deleting the csi-rbdplugin-provisioner leader pod. Provisioned size is 20GiB.


Test case error showing the du output of the image and describe output of PV:

E       AssertionError: PVC pvc-test-f854452cb4a64712beb2ce4a0ffd691 is not thick provisioned. Rbd image csi-vol-74a77a0a-b7c1-11eb-9543-0a580a830148 expected used size: 20GiB. Actual used size 984MiB.
E          Rbd du out: NAME                                         PROVISIONED USED csi-vol-74a77a0a-b7c1-11eb-9543-0a580a830148      20 GiB 984 MiB
E          PV describe :
E          Name:            pvc-487fb831-03b5-4bc6-8f67-72b5eee94a7c
E         Labels:          <none>
E         Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
E         Finalizers:      [kubernetes.io/pv-protection]
E         StorageClass:    ocs-storagecluster-ceph-rbd-thick
E         Status:          Bound
E         Claim:           namespace-test-d57cab19c5f14cd5a75ae33eb/pvc-test-f854452cb4a64712beb2ce4a0ffd691
E         Reclaim Policy:  Delete
E         Access Modes:    RWO
E         VolumeMode:      Filesystem
E         Capacity:        20Gi
E         Node Affinity:   <none>
E         Message:         
E         Source:
E             Type:              CSI (a Container Storage Interface (CSI) volume source)
E             Driver:            openshift-storage.rbd.csi.ceph.com
E             FSType:            ext4
E             VolumeHandle:      0001-0011-openshift-storage-0000000000000001-74a77a0a-b7c1-11eb-9543-0a580a830148
E             ReadOnly:          false
E             VolumeAttributes:      clusterID=openshift-storage
E                                    csi.storage.k8s.io/pv/name=pvc-487fb831-03b5-4bc6-8f67-72b5eee94a7c
E                                    csi.storage.k8s.io/pvc/name=pvc-test-f854452cb4a64712beb2ce4a0ffd691
E                                    csi.storage.k8s.io/pvc/namespace=namespace-test-d57cab19c5f14cd5a75ae33eb
E                                    imageFeatures=layering
E                                    imageFormat=2
E                                    imageName=csi-vol-74a77a0a-b7c1-11eb-9543-0a580a830148
E                                    journalPool=ocs-storagecluster-cephblockpool
E                                    pool=ocs-storagecluster-cephblockpool
E                                    storage.kubernetes.io/csiProvisionerIdentity=1621299316893-8081-openshift-storage.rbd.csi.ceph.com
E                                    thickProvision=true
E         Events:                <none>
E         
E       assert '984MiB' == '20GiB'
E         - 984MiB
E         + 20GiB

tests/manage/pv_services/test_delete_provisioner_pod_while_thick_provisioning.py:90: AssertionError


Test case:
tests/manage/pv_services/test_delete_provisioner_pod_while_thick_provisioning.py::TestDeleteProvisionerPodWhileThickProvisioning::test_delete_provisioner_pod_while_thick_provisioning

ocs-ci PR: https://github.com/red-hat-storage/ocs-ci/pull/4300


OCS and OCP must-gather logs : 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-17may/jijoy-17may_20210517T110547/logs/failed_testcase_ocs_logs_1621332632/test_delete_provisioner_pod_while_thick_provisioning_ocs_logs/


Test case logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-17may/jijoy-17may_20210517T110547/logs/ocs-ci-logs-1621332632/by_outcome/failed/tests/manage/pv_services/test_delete_provisioner_pod_while_thick_provisioning.py/TestDeleteProvisionerPodWhileThickProvisioning/test_delete_provisioner_pod_while_thick_provisioning


===========================================================================

Version of all relevant components (if applicable):
OCS 4.8.0-394.ci
OCP 4.8.0-0.nightly-2021-05-15-141455
Ceph 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)

rook_csi_provisioner	ose-csi-external-provisioner@sha256:0d1cab421c433c213d37043dd0dbaa6a2942ccf1d21d35afc32e35ce8216ddec


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, the volume will not be thick provisioned.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
RBD thick provisioning is a new feature in OCS 4.8

=======================================================================

Steps to Reproduce:
1. Start creating a PVC of size 20GiB. Use "ocs-storagecluster-ceph-rbd-thick" storage class in which thick provision is enabled.
2. When step 1 is progressing, delete the csi-rbdplugin-provisioner leader pod.
3. Wait for the new csi-rbdplugin-provisioner pod to spin up.
4. Wait for the PVC to reach Bound state and check the used size of the corresponding rbd image.
Command:
rbd du -p ocs-storagecluster-cephblockpool csi-vol-74a77a0a-b7c1-11eb-9543-0a580a830148

where "ocs-storagecluster-cephblockpool" is the pool name and "csi-vol-74a77a0a-b7c1-11eb-9543-0a580a830148" is the image name


OR

Run the ocs-ci test case
tests/manage/pv_services/test_delete_provisioner_pod_while_thick_provisioning.py::TestDeleteProvisionerPodWhileThickProvisioning::test_delete_provisioner_pod_while_thick_provisioning

====================================================================

Actual results:
The used size of the rbd image is not equal to the provisioned size.

Expected results:
The used size of the rbd image should be equal to the provisioned size.

Additional info:

Comment 2 Niels de Vos 2021-05-26 07:09:47 UTC

In case thick-provisioning is aborted/restarted, the metadata of the RBD image will not contain the thick-provisioning key/value. A CreateVolume request will have the thick-provision option set, so this can be checked with the (missing) value in the RBD image metadata.

Comment 3 Niels de Vos 2021-05-26 09:42:40 UTC

Upstream change posted at https://github.com/ceph/ceph-csi/pull/2101

Comment 5 Niels de Vos 2021-06-02 08:56:04 UTC

https://github.com/openshift/ceph-csi/pull/52 has been merged and will be included in the next build.

Comment 6 Jilju Joy 2021-06-08 12:57:44 UTC

Verified using the test case
tests/manage/pv_services/test_delete_provisioner_pod_while_thick_provisioning.py::TestDeleteProvisionerPodWhileThickProvisioning::test_delete_provisioner_pod_while_thick_provisioning, PR https://github.com/red-hat-storage/ocs-ci/pull/4300

Test case logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr4300-b822/jnk-pr4300-b822_20210607T052437/logs/


rbd du output from test case log:

2021-06-07 07:28:07,452 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n openshift-storage rsh rook-ceph-tools-56dc89c8c9-dw2tz rbd du -p ocs-storagecluster-cephblockpool csi-vol-8eb3c279-c761-11eb-a3b3-0a580a81020e
2021-06-07 07:28:18,042 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.499 - Command stdout: NAME                                         PROVISIONED USED   
csi-vol-8eb3c279-c761-11eb-a3b3-0a580a81020e      20 GiB 20 GiB 


Verified in version:
OCS operator	v4.8.0-409.ci
Cluster Version	4.8.0-0.nightly-2021-06-06-164529
Ceph Version	14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)
rook_csi_provisioner	ose-csi-external-provisioner@sha256:611a895fdc5c9d3b1561cfa0eb01d67349985c2a2909f00c3b010a693667ff8a

Comment 9 errata-xmlrpc 2021-08-03 18:16:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003

Note You need to log in before you can comment on or make changes to this bug.