1943137 – [Tracker for BZ #1945618] rbd: Storage is not reclaimed after persistentvolumeclaim and job that utilized it are deleted

Bug 1943137 - [Tracker for BZ #1945618] rbd: Storage is not reclaimed after persistentvolumeclaim and job that utilized it are deleted

Summary: [Tracker for BZ #1945618] rbd: Storage is not reclaimed after persistentvolum...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Ilya Dryomov
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2011420 (view as bug list)
Depends On:	1945618 2150968
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-25 13:02 UTC by Filip Balák
Modified:	2023-08-09 16:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1945618 (view as bug list)
Environment:
Last Closed:	2023-06-21 15:22:14 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2023:3742	0	None	None	None	2023-06-21 15:22:53 UTC

Description Filip Balák 2021-03-25 13:02:50 UTC

Description of problem (please be detailed as possible and provide log
snippests):
During test execution is filled 95% of max_avail from ceph pool. This is achieved by creating kubernetes job that fills pv that uses ocs-storagecluster-ceph-rbd storageclass. At the end of the test is pvc and the job deleted. After this is the storage reclaimed. In our current run this storage was not reclaimed.

Version of all relevant components (if applicable):
OCS 4.8.0-303.ci
OCP 4.8.0-0.nightly-2021-03-22-064955

If this is a regression, please provide more details to justify this:
This worked in OCS 4.7 and previous versions.

Steps to Reproduce:
1. Create PVC on ocs-storagecluster-ceph-rbd storageclass. Its size should be 95% of max_avail from ceph pool.
2. Fill the PVC with data.
3. Delete the PVC.
4. Check ceph df

or run test_rbd_capacity_workload_alerts test from ocs qe test suite:
https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheus/test_capacity.py#L19

Actual results:
Storage is not reclaimed.

Expected results:
Storage should be reclaimed.

Additional info:
Failed job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/284/
Test logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/ocs-ci-logs-1616421637/by_outcome/failed/tests/manage/monitoring/prometheus/test_capacity.py/test_rbd_capacity_workload_alerts/logs

Comment 3 Niels de Vos 2021-03-26 16:17:30 UTC

PVCs from the ocs-storagecluster-ceph-rbd StorageClass are deleted with the Ceph Mgr as deleting RBD images can take a non-trivial amount of time. So, Ceph-CSI moves the RBD image to the trash, and a task is scheduled with the Ceph Mgr remove it (when that is, depends on Ceph Mgr). That also means that there is a delay between the deletion of the PV (triggered by the PVC deletion) and the actual result where the space is available again.

The available logs from comment #0 make it very difficult to find the timestamps of the deletion of the PVC and the timestamp of the free-space-check.

Can you please provide these additional details:
- timestamp of deletion
- timestamp of free-space-check
- ocs-must-gather, and the name of the PVC that was created/removed
- in case the name of the PVC is not unique for the test, provide sufficient details so that the matching PV can be identified

Comment 4 Filip Balák 2021-03-26 16:38:18 UTC

(In reply to Niels de Vos from comment #3)
> Can you please provide these additional details:
> - timestamp of deletion
2021-03-22 18:49:36,144
> - timestamp of free-space-check
It's done multiple times. You can look for string "oc -n openshift-storage rsh rook-ceph-tools-6b8d89bc5b-hwsdv ceph df --format json-pretty"
Few lines under this call is output of response.
E.g. 2021-03-22 18:49:43,126 or 2021-03-22 18:57:47,373
The first check at 2021-03-22 18:05:47,851 is done to calculate the size that should be utilized.
> - ocs-must-gather,
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/failed_testcase_ocs_logs_1616421637/test_rbd_capacity_workload_alerts_ocs_logs/ocs_must_gather/
> and the name of the PVC that was created/removed
fio-target
> - in case the name of the PVC is not unique for the test, provide sufficient
> details so that the matching PV can be identified

Comment 5 Filip Balák 2021-03-29 10:24:54 UTC

It seems that this issue affects also OCS 4.7:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/317/

Versions:
OCS 4.7.0-307.ci
OCP 4.6.0-0.nightly-2021-03-24-095127

Comment 6 Mudit Agarwal 2021-03-29 11:09:36 UTC

If it exists in 4.7 then its not a regression, can we please remove the keyword?
I guess the blocker flag was also set because of this keyword, but that can be assessed once we have some analysis.

Comment 7 Filip Balák 2021-03-29 11:30:29 UTC

The first version with this error seems to be v4.7.0-294.ci. The issue was not present in v4.7.0-284.ci and previous versions. Regression keyword is valid.

Comment 8 Mudit Agarwal 2021-03-29 11:38:55 UTC

Thanks Filip

Comment 9 Niels de Vos 2021-03-29 15:14:51 UTC

After looking in details through the logs, I can see the PVC creation and deletion done by the provisioner:

From http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/failed_testcase_ocs_logs_1616421637/test_rbd_capacity_workload_alerts_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-02dd451d97eeb95bf3dc3b31e1e6db1753d971fd4fff27d1118b08574c5a9cbd/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-f468f84bc-lttv2/csi-rbdplugin/csi-rbdplugin/logs/current.log

2021-03-22T18:49:36.730528988Z I0322 18:49:36.730467       1 utils.go:132] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 GRPC call: /csi.v1.Controller/DeleteVolume
2021-03-22T18:49:36.730559308Z I0322 18:49:36.730552       1 utils.go:133] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018"}
2021-03-22T18:49:36.733988507Z I0322 18:49:36.733933       1 omap.go:84] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.3b4064b3-8b39-11eb-8988-0a580a830018"): map[csi.imageid:5eb2a832d2fc csi.imagename:csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018 csi.volname:pvc-d4e8a2c4-c0e6-46ac-bdf9-5abb21db41ac csi.volume.owner:namespace-test-df23036b8ad3422c9c781af02]
2021-03-22T18:49:36.930303942Z I0322 18:49:36.930255       1 controllerserver.go:680] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 deleting image csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018
2021-03-22T18:49:36.930303942Z I0322 18:49:36.930293       1 rbd_util.go:452] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 rbd: delete csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018 using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool
2021-03-22T18:49:37.036009850Z I0322 18:49:37.035938       1 rbd_util.go:423] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 executing [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2a832d2fc --id csi-rbd-provisioner --keyfile=/tmp/csi/keys/keyfile-181560777 -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789] for image (csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018) using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool
2021-03-22T18:49:37.422808973Z I0322 18:49:37.422725       1 cephcmds.go:59] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 command succeeded: ceph [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2a832d2fc --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789]
2021-03-22T18:49:37.460384988Z I0322 18:49:37.460316       1 omap.go:118] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-d4e8a2c4-c0e6-46ac-bdf9-5abb21db41ac]
2021-03-22T18:49:37.460457720Z I0322 18:49:37.460442       1 utils.go:138] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 GRPC response: {}
2021-03-22T18:50:00.465548280Z I0322 18:50:00.465485       1 utils.go:132] ID: 440 GRPC call: /csi.v1.Identity/Probe


This part is key:

    command succeeded: ceph [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2a832d2fc --id csi-rbd-provisioner

The Ceph Mgr also accepted this command just fine:

From http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/failed_testcase_ocs_logs_1616421637/test_rbd_capacity_workload_alerts_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-02dd451d97eeb95bf3dc3b31e1e6db1753d971fd4fff27d1118b08574c5a9cbd/namespaces/openshift-storage/pods/rook-ceph-mgr-a-787c5b49cc-czxs7/mgr/mgr/logs/current.log

2021-03-22T18:49:37.376402090Z debug 2021-03-22 18:49:37.375 7f82daea4700  0 log_channel(audit) log [DBG] : from='client.21750 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2a832d2fc", "target": ["mgr", ""]}]: dispatch

However, the same Ceph Mgr logs contain errors:

2021-03-22T15:23:10.136860601Z debug 2021-03-22 15:23:10.136 7f82daea4700  0 log_channel(audit) log [DBG] : from='client.8910 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2b3c2024b", "target": ["mgr", ""]}]: dispatch
2021-03-22T15:23:10.162841886Z debug 2021-03-22 15:23:10.161 7f82daea4700  0 log_channel(audit) log [DBG] : from='client.8913 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2da889c1e", "target": ["mgr", ""]}]: dispatch
2021-03-22T15:23:10.172208614Z debug 2021-03-22 15:23:10.171 7f82e6d11700  0 mgr[rbd_support] execute_task: [errno 39] error deleting image from trash
2021-03-22T15:23:10.373028171Z debug 2021-03-22 15:23:10.372 7f82ddea7700  0 log_channel(cluster) log [DBG] : pgmap v2915: 272 pgs: 272 active+clean; 11 GiB data, 31 GiB used, 1.5 TiB / 1.5 TiB avail; 27 KiB/s rd, 124 KiB/s wr, 58 op/s
2021-03-22T15:23:10.647519823Z debug 2021-03-22 15:23:10.646 7f82daea4700  0 log_channel(audit) log [DBG] : from='client.8922 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2704434dd", "target": ["mgr", ""]}]: dispatch
2021-03-22T15:23:10.676436136Z debug 2021-03-22 15:23:10.675 7f82e6d11700  0 mgr[rbd_support] execute_task: [errno 39] error deleting image from trash


The trace from the provisioner for "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2b3c2024b":

2021-03-22T15:23:09.679408834Z I0322 15:23:09.679351       1 utils.go:132] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 GRPC call: /csi.v1.Controller/DeleteVolume
2021-03-22T15:23:09.679454031Z I0322 15:23:09.679418       1 utils.go:133] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018"}
2021-03-22T15:23:09.682014874Z I0322 15:23:09.681976       1 omap.go:84] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.c1131ac4-8b20-11eb-8988-0a580a830018"): map[csi.imageid:5eb2da889c1e csi.imagename:csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018 csi.volname:pvc-965235f3-a296-45f8-b9b8-e78112651a56 csi.volume.owner:my-ripsaw]
..
2021-03-22T15:23:09.819020609Z I0322 15:23:09.818753       1 controllerserver.go:680] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 deleting image csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018
2021-03-22T15:23:09.819020609Z I0322 15:23:09.818993       1 rbd_util.go:452] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 rbd: delete csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018 using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool
..
2021-03-22T15:23:09.845354812Z I0322 15:23:09.845289       1 rbd_util.go:423] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 executing [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2da889c1e --id csi-rbd-provisioner --keyfile=/tmp/csi/keys/keyfile-635250415 -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789] for image (csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018) using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool
..
2021-03-22T15:23:10.185458110Z I0322 15:23:10.185422       1 cephcmds.go:59] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 command succeeded: ceph [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2da889c1e --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789]
2021-03-22T15:23:10.190826315Z I0322 15:23:10.190789       1 omap.go:118] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-965235f3-a296-45f8-b9b8-e78112651a56]
2021-03-22T15:23:10.190876823Z I0322 15:23:10.190865       1 utils.go:138] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 GRPC response: {}


No errors are reported here. The same counts for the other image_id_spec's that should get deleted.

I can not say why Ceph Mgr fails to delete image(s), or even which images (and hence PV/PVC) can not get deleted. This needs some assistance from someone that understands Ceph Mgr better.

Comment 16 Raz Tamir 2021-03-31 13:25:06 UTC

Thanks Filip,

Removing the regression flag for now

Comment 17 Mudit Agarwal 2021-05-24 11:52:04 UTC

Ceph BZ not targeted for RHCS4.2z*

Comment 18 Mudit Agarwal 2021-08-20 02:12:49 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1945618 is targeted for 5.1

Comment 20 Mudit Agarwal 2021-10-14 16:03:39 UTC

*** Bug 2011420 has been marked as a duplicate of this bug. ***

Comment 22 Mudit Agarwal 2022-01-27 03:01:14 UTC

Providing dev ack given that the ceph BZ is already approved for RHCS 5.1

Comment 25 Mudit Agarwal 2022-03-01 14:11:00 UTC

Not a blocker, moving it out of 4.10 because of dev freeze. Can be moved back in case the fix arrives in RHCS

Comment 49 Joy John Pinto 2023-06-19 10:21:12 UTC

Verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1945618#c0. "test_rbd_capacity_workload_alerts" passed without any errors 
OCP version:4.13.0-0.nightly-2023-06-13-070743
ODF build 4.13.0-218

Test log link: https://url.corp.redhat.com/6b44aba
Report Portal link : https://url.corp.redhat.com/0b54949

Comment 50 errata-xmlrpc 2023-06-21 15:22:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.