Bug 1964373
Summary: | [Tracker for BZ #1947673] [Ceph-mgr] 'rbd trash purge' gets hung on rbds with clones or snapshots and should skip to continue progress | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Mudit Agarwal <muagarwa> |
Component: | ceph | Assignee: | Scott Ostapovicz <sostapov> |
Status: | CLOSED ERRATA | QA Contact: | Jilju Joy <jijoy> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | bniver, ceph-eng-bugs, ceph-qe-bugs, ebenahar, gfarnum, hmunjulu, idryomov, jijoy, madam, mamccoma, mhackett, mmuench, mmurthy, mrajanna, muagarwa, ndevos, nojha, ocs-bugs, owasserm, pdhange, r.martinez, sostapov, srangana, tnielsen, vereddy, ygupta |
Target Milestone: | --- | Keywords: | AutomationTriaged |
Target Release: | OCS 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.8.0-416.ci | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | 1947673 | Environment: | |
Last Closed: | 2021-08-03 18:16:39 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1947673, 1966687 | ||
Bug Blocks: | 1898565 |
Comment 7
Jilju Joy
2021-06-17 15:53:22 UTC
Logs collected after testing (Comment #7) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-lso-jun14/jijoy-lso-jun14_20210614T080510/logs/testcases_1623940737/ >Removing images: 11% complete...2021-06-17 12:34:17.594 7fb0f77fe700 -1 librbd::image::PreRemoveRequest: 0x7fb0f0005ea0 check_image_watchers: image has watchers - not removing
Removing images: 50% complete...2021-06-17 12:51:51.384 7fb0f77fe700 -1 librbd::image::PreRemoveRequest: 0x7fb0f00053c0 check_image_watchers: image has watchers - not removing
Removing images: 84% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group, and were moved to the trash successfully.
For this one, I need to check.
@Jijlu can you please provide the steps done to verify this one?
* Are there any kubernetes PVC, snapshots are present when you are doing trash purge?
@jijly please check Ilya comment on #c9
Madhu, Snapshots were present when rbd trash purge command was issued. Parent PVC was not present for some of the snapshots. As mentioned in comment #16, as there were snapshots (with and without parent PVC) and PVCs clones already present , I did not create any new PVC/Snapshots for this test. So some rbd images were left behind, which is expected. Ceph status was also clean. But the output of rbd trash purge command(comment #7) shows that the command failed. The rbd purge command should show success after skipping the relevant images. Isn't this expected ? (In reply to Jilju Joy from comment #13) > Madhu, > > Snapshots were present when rbd trash purge command was issued. Parent PVC > was not present for some of the snapshots. > As mentioned in comment #16, as there were snapshots (with and without As mentioned in comment #7 Jilju, did you try deleting the images one by one after the failure as mentioned by Ilya in comment #9. Ilya, I have one question. The error message says that the image won't be deleted if it is expired. What is the meaning of expired here. Madhu, I agree with Jilju here because an image in trash which is a parent of a clone/snapshot can remain in trash till all the dependants for that image are deleted. This is the expected behavior, I think we paid this cost to have the parity with Kubernetes expectations. (In reply to Mudit Agarwal from comment #15) > Jilju, did you try deleting the images one by one after the failure as > mentioned by Ilya in comment #9. > Mudit, I did not try that. The cluster was destroyed before I noticed Ilya's comment. I was excepting the rbd trash purge command to succeed after skipping the images which still have watchers. Looks like expected as discussed in the above comments. Moving it back to ON_QA after an offline discussion with Madhu and Jilju. Verified in version: ocs-operator.v4.8.0-422.ci OCP 4.8.0-0.nightly-2021-06-19-005119 ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable) Steps performed: List the images present in trash sh-4.4# rbd trash ls ocs-storagecluster-cephblockpool 606527837db5 csi-vol-2515acb6-d344-11eb-93d0-0a580a80020f 6065338d897 csi-vol-d2a241c0-d344-11eb-93d0-0a580a80020f 60653a65abb csi-vol-15505254-d344-11eb-93d0-0a580a80020f 606550cf1430 csi-snap-cf581a31-d344-11eb-93d0-0a580a80020f 60656fd04246 csi-snap-e8a2e21d-d344-11eb-93d0-0a580a80020f 606570a61bd csi-vol-f5554124-d343-11eb-93d0-0a580a80020f 6065ab0ab2b2 csi-vol-ec061a6a-d344-11eb-93d0-0a580a80020f 6065b3d36baf csi-vol-0ac34e57-d344-11eb-93d0-0a580a80020f sh-4.4# Try to delete all images using rbd trash purge command. sh-4.4# rbd trash purge ocs-storagecluster-cephblockpool Removing images: 50% complete...failed. rbd: some expired images could not be removed Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group and were moved to the trash successfully. sh-4.4# rbd trash purge did not hung but completed after skipping the relevant images. List of the images remaining in trash sh-4.4# rbd trash ls ocs-storagecluster-cephblockpool 606527837db5 csi-vol-2515acb6-d344-11eb-93d0-0a580a80020f 60653a65abb csi-vol-15505254-d344-11eb-93d0-0a580a80020f 60656fd04246 csi-snap-e8a2e21d-d344-11eb-93d0-0a580a80020f 6065ab0ab2b2 csi-vol-ec061a6a-d344-11eb-93d0-0a580a80020f sh-4.4# Check ceph status sh-4.4# ceph status cluster: id: 339aeda9-4b2a-4733-b085-d33fa584ca6f health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 18h) mgr: a(active, since 18h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 18h), 3 in (since 18h) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) data: pools: 10 pools, 176 pgs objects: 11.57k objects, 39 GiB usage: 118 GiB used, 2.9 TiB / 3 TiB avail pgs: 176 active+clean io: client: 2.8 KiB/s rd, 180 KiB/s wr, 3 op/s rd, 5 op/s wr sh-4.4# Try to delete each of the remaining images. sh-4.4# rbd trash rm 606527837db5 -p ocs-storagecluster-cephblockpool rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. Removing image: 0% complete...failed. sh-4.4# sh-4.4# rbd trash rm 60653a65abb -p ocs-storagecluster-cephblockpool rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. Removing image: 0% complete...failed. sh-4.4# sh-4.4# rbd trash rm 60656fd04246 -p ocs-storagecluster-cephblockpool rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. Removing image: 0% complete...failed. sh-4.4# sh-4.4# sh-4.4# rbd trash rm 6065ab0ab2b2 -p ocs-storagecluster-cephblockpool rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. Removing image: 0% complete...failed. sh-4.4# sh-4.4# Attempt to delete the remaining images from trash has failed. This ensures that the rbd trash purge command did not skip any image that can be deleted. Check ceph status sh-4.4# ceph status cluster: id: 339aeda9-4b2a-4733-b085-d33fa584ca6f health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 18h) mgr: a(active, since 18h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 18h), 3 in (since 18h) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) data: pools: 10 pools, 176 pgs objects: 11.58k objects, 39 GiB usage: 118 GiB used, 2.9 TiB / 3 TiB avail pgs: 176 active+clean io: client: 5.8 KiB/s rd, 8.9 KiB/s wr, 7 op/s rd, 5 op/s wr sh-4.4# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003 |