Description of problem (please be detailed as possible and provide log snippests): During test execution is filled 95% of max_avail from ceph pool. This is achieved by creating kubernetes job that fills pv that uses ocs-storagecluster-ceph-rbd storageclass. At the end of the test is pvc and the job deleted. After this is the storage reclaimed. In our current run this storage was not reclaimed. Version of all relevant components (if applicable): OCS 4.8.0-303.ci OCP 4.8.0-0.nightly-2021-03-22-064955 If this is a regression, please provide more details to justify this: This worked in OCS 4.7 and previous versions. Steps to Reproduce: 1. Create PVC on ocs-storagecluster-ceph-rbd storageclass. Its size should be 95% of max_avail from ceph pool. 2. Fill the PVC with data. 3. Delete the PVC. 4. Check ceph df or run test_rbd_capacity_workload_alerts test from ocs qe test suite: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheus/test_capacity.py#L19 Actual results: Storage is not reclaimed. Expected results: Storage should be reclaimed. Additional info: Failed job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/284/ Test logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/ocs-ci-logs-1616421637/by_outcome/failed/tests/manage/monitoring/prometheus/test_capacity.py/test_rbd_capacity_workload_alerts/logs
PVCs from the ocs-storagecluster-ceph-rbd StorageClass are deleted with the Ceph Mgr as deleting RBD images can take a non-trivial amount of time. So, Ceph-CSI moves the RBD image to the trash, and a task is scheduled with the Ceph Mgr remove it (when that is, depends on Ceph Mgr). That also means that there is a delay between the deletion of the PV (triggered by the PVC deletion) and the actual result where the space is available again. The available logs from comment #0 make it very difficult to find the timestamps of the deletion of the PVC and the timestamp of the free-space-check. Can you please provide these additional details: - timestamp of deletion - timestamp of free-space-check - ocs-must-gather, and the name of the PVC that was created/removed - in case the name of the PVC is not unique for the test, provide sufficient details so that the matching PV can be identified
(In reply to Niels de Vos from comment #3) > Can you please provide these additional details: > - timestamp of deletion 2021-03-22 18:49:36,144 > - timestamp of free-space-check It's done multiple times. You can look for string "oc -n openshift-storage rsh rook-ceph-tools-6b8d89bc5b-hwsdv ceph df --format json-pretty" Few lines under this call is output of response. E.g. 2021-03-22 18:49:43,126 or 2021-03-22 18:57:47,373 The first check at 2021-03-22 18:05:47,851 is done to calculate the size that should be utilized. > - ocs-must-gather, http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/failed_testcase_ocs_logs_1616421637/test_rbd_capacity_workload_alerts_ocs_logs/ocs_must_gather/ > and the name of the PVC that was created/removed fio-target > - in case the name of the PVC is not unique for the test, provide sufficient > details so that the matching PV can be identified
It seems that this issue affects also OCS 4.7: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/317/ Versions: OCS 4.7.0-307.ci OCP 4.6.0-0.nightly-2021-03-24-095127
If it exists in 4.7 then its not a regression, can we please remove the keyword? I guess the blocker flag was also set because of this keyword, but that can be assessed once we have some analysis.
The first version with this error seems to be v4.7.0-294.ci. The issue was not present in v4.7.0-284.ci and previous versions. Regression keyword is valid.
Thanks Filip
After looking in details through the logs, I can see the PVC creation and deletion done by the provisioner: From http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/failed_testcase_ocs_logs_1616421637/test_rbd_capacity_workload_alerts_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-02dd451d97eeb95bf3dc3b31e1e6db1753d971fd4fff27d1118b08574c5a9cbd/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-f468f84bc-lttv2/csi-rbdplugin/csi-rbdplugin/logs/current.log 2021-03-22T18:49:36.730528988Z I0322 18:49:36.730467 1 utils.go:132] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 GRPC call: /csi.v1.Controller/DeleteVolume 2021-03-22T18:49:36.730559308Z I0322 18:49:36.730552 1 utils.go:133] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018"} 2021-03-22T18:49:36.733988507Z I0322 18:49:36.733933 1 omap.go:84] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.3b4064b3-8b39-11eb-8988-0a580a830018"): map[csi.imageid:5eb2a832d2fc csi.imagename:csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018 csi.volname:pvc-d4e8a2c4-c0e6-46ac-bdf9-5abb21db41ac csi.volume.owner:namespace-test-df23036b8ad3422c9c781af02] 2021-03-22T18:49:36.930303942Z I0322 18:49:36.930255 1 controllerserver.go:680] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 deleting image csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018 2021-03-22T18:49:36.930303942Z I0322 18:49:36.930293 1 rbd_util.go:452] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 rbd: delete csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018 using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool 2021-03-22T18:49:37.036009850Z I0322 18:49:37.035938 1 rbd_util.go:423] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 executing [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2a832d2fc --id csi-rbd-provisioner --keyfile=/tmp/csi/keys/keyfile-181560777 -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789] for image (csi-vol-3b4064b3-8b39-11eb-8988-0a580a830018) using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool 2021-03-22T18:49:37.422808973Z I0322 18:49:37.422725 1 cephcmds.go:59] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 command succeeded: ceph [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2a832d2fc --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789] 2021-03-22T18:49:37.460384988Z I0322 18:49:37.460316 1 omap.go:118] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-d4e8a2c4-c0e6-46ac-bdf9-5abb21db41ac] 2021-03-22T18:49:37.460457720Z I0322 18:49:37.460442 1 utils.go:138] ID: 439 Req-ID: 0001-0011-openshift-storage-0000000000000001-3b4064b3-8b39-11eb-8988-0a580a830018 GRPC response: {} 2021-03-22T18:50:00.465548280Z I0322 18:50:00.465485 1 utils.go:132] ID: 440 GRPC call: /csi.v1.Identity/Probe This part is key: command succeeded: ceph [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2a832d2fc --id csi-rbd-provisioner The Ceph Mgr also accepted this command just fine: From http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j010vu1cs33-t2/j010vu1cs33-t2_20210322T124736/logs/failed_testcase_ocs_logs_1616421637/test_rbd_capacity_workload_alerts_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-02dd451d97eeb95bf3dc3b31e1e6db1753d971fd4fff27d1118b08574c5a9cbd/namespaces/openshift-storage/pods/rook-ceph-mgr-a-787c5b49cc-czxs7/mgr/mgr/logs/current.log 2021-03-22T18:49:37.376402090Z debug 2021-03-22 18:49:37.375 7f82daea4700 0 log_channel(audit) log [DBG] : from='client.21750 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2a832d2fc", "target": ["mgr", ""]}]: dispatch However, the same Ceph Mgr logs contain errors: 2021-03-22T15:23:10.136860601Z debug 2021-03-22 15:23:10.136 7f82daea4700 0 log_channel(audit) log [DBG] : from='client.8910 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2b3c2024b", "target": ["mgr", ""]}]: dispatch 2021-03-22T15:23:10.162841886Z debug 2021-03-22 15:23:10.161 7f82daea4700 0 log_channel(audit) log [DBG] : from='client.8913 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2da889c1e", "target": ["mgr", ""]}]: dispatch 2021-03-22T15:23:10.172208614Z debug 2021-03-22 15:23:10.171 7f82e6d11700 0 mgr[rbd_support] execute_task: [errno 39] error deleting image from trash 2021-03-22T15:23:10.373028171Z debug 2021-03-22 15:23:10.372 7f82ddea7700 0 log_channel(cluster) log [DBG] : pgmap v2915: 272 pgs: 272 active+clean; 11 GiB data, 31 GiB used, 1.5 TiB / 1.5 TiB avail; 27 KiB/s rd, 124 KiB/s wr, 58 op/s 2021-03-22T15:23:10.647519823Z debug 2021-03-22 15:23:10.646 7f82daea4700 0 log_channel(audit) log [DBG] : from='client.8922 -' entity='client.csi-rbd-provisioner' cmd=[{"prefix": "rbd task add trash remove", "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2704434dd", "target": ["mgr", ""]}]: dispatch 2021-03-22T15:23:10.676436136Z debug 2021-03-22 15:23:10.675 7f82e6d11700 0 mgr[rbd_support] execute_task: [errno 39] error deleting image from trash The trace from the provisioner for "image_id_spec": "ocs-storagecluster-cephblockpool/5eb2b3c2024b": 2021-03-22T15:23:09.679408834Z I0322 15:23:09.679351 1 utils.go:132] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 GRPC call: /csi.v1.Controller/DeleteVolume 2021-03-22T15:23:09.679454031Z I0322 15:23:09.679418 1 utils.go:133] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018"} 2021-03-22T15:23:09.682014874Z I0322 15:23:09.681976 1 omap.go:84] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.c1131ac4-8b20-11eb-8988-0a580a830018"): map[csi.imageid:5eb2da889c1e csi.imagename:csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018 csi.volname:pvc-965235f3-a296-45f8-b9b8-e78112651a56 csi.volume.owner:my-ripsaw] .. 2021-03-22T15:23:09.819020609Z I0322 15:23:09.818753 1 controllerserver.go:680] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 deleting image csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018 2021-03-22T15:23:09.819020609Z I0322 15:23:09.818993 1 rbd_util.go:452] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 rbd: delete csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018 using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool .. 2021-03-22T15:23:09.845354812Z I0322 15:23:09.845289 1 rbd_util.go:423] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 executing [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2da889c1e --id csi-rbd-provisioner --keyfile=/tmp/csi/keys/keyfile-635250415 -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789] for image (csi-vol-c1131ac4-8b20-11eb-8988-0a580a830018) using mon 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789, pool ocs-storagecluster-cephblockpool .. 2021-03-22T15:23:10.185458110Z I0322 15:23:10.185422 1 cephcmds.go:59] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 command succeeded: ceph [rbd task add trash remove ocs-storagecluster-cephblockpool/5eb2da889c1e --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.185.31:6789,172.30.230.255:6789,172.30.57.115:6789] 2021-03-22T15:23:10.190826315Z I0322 15:23:10.190789 1 omap.go:118] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-965235f3-a296-45f8-b9b8-e78112651a56] 2021-03-22T15:23:10.190876823Z I0322 15:23:10.190865 1 utils.go:138] ID: 148 Req-ID: 0001-0011-openshift-storage-0000000000000001-c1131ac4-8b20-11eb-8988-0a580a830018 GRPC response: {} No errors are reported here. The same counts for the other image_id_spec's that should get deleted. I can not say why Ceph Mgr fails to delete image(s), or even which images (and hence PV/PVC) can not get deleted. This needs some assistance from someone that understands Ceph Mgr better.
Thanks Filip, Removing the regression flag for now
Ceph BZ not targeted for RHCS4.2z*
https://bugzilla.redhat.com/show_bug.cgi?id=1945618 is targeted for 5.1
*** Bug 2011420 has been marked as a duplicate of this bug. ***
Providing dev ack given that the ceph BZ is already approved for RHCS 5.1
Not a blocker, moving it out of 4.10 because of dev freeze. Can be moved back in case the fix arrives in RHCS
Verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1945618#c0. "test_rbd_capacity_workload_alerts" passed without any errors OCP version:4.13.0-0.nightly-2023-06-13-070743 ODF build 4.13.0-218 Test log link: https://url.corp.redhat.com/6b44aba Report Portal link : https://url.corp.redhat.com/0b54949
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742