Bug 1964373

Summary:	[Tracker for BZ #1947673] [Ceph-mgr] 'rbd trash purge' gets hung on rbds with clones or snapshots and should skip to continue progress
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Mudit Agarwal <muagarwa>
Component:	ceph	Assignee:	Scott Ostapovicz <sostapov>
Status:	CLOSED ERRATA	QA Contact:	Jilju Joy <jijoy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	bniver, ceph-eng-bugs, ceph-qe-bugs, ebenahar, gfarnum, hmunjulu, idryomov, jijoy, madam, mamccoma, mhackett, mmuench, mmurthy, mrajanna, muagarwa, ndevos, nojha, ocs-bugs, owasserm, pdhange, r.martinez, sostapov, srangana, tnielsen, vereddy, ygupta
Target Milestone:	---	Keywords:	AutomationTriaged
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.8.0-416.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1947673	Environment:
Last Closed:	2021-08-03 18:16:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1947673, 1966687
Bug Blocks:	1898565

Comment 7 Jilju Joy 2021-06-17 15:53:22 UTC

Tested in version:
ocs-operator.v4.8.0-416.ci
4.8.0-0.nightly-2021-06-13-101614
ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)


This is tested on a cluster where PVC snapshot, clone tests were executed. There were images present in trash so I did not try to create new snapshots or clone.

List of images present.

sh-4.4# rbd trash ls ocs-storagecluster-cephblockpool
1077f3e0f019f csi-vol-7701031e-cd9f-11eb-b551-0a580a800210
1077f6d9c4285 csi-vol-00443f03-cd9f-11eb-b551-0a580a800210
1077faffae2a0 csi-snap-e7ba191e-cda2-11eb-b551-0a580a800210
1077fff5a887a csi-vol-167a2441-cd9f-11eb-b551-0a580a800210
abfd236218df csi-vol-a2cb5579-cf64-11eb-a93e-0a580a810228
abfd31767421 csi-vol-38ba2b8f-cf62-11eb-a93e-0a580a810228
abfd345c897c csi-vol-3c01aee8-cf61-11eb-a93e-0a580a810228
abfd4b4f2bae csi-snap-9da28c5b-cf61-11eb-a93e-0a580a810228
abfd507d451d csi-snap-c441ed75-cf63-11eb-a93e-0a580a810228
abfd585e0a55 csi-vol-3acc1595-cf61-11eb-a93e-0a580a810228
abfd5fe06684 csi-snap-49525f2e-cf64-11eb-a93e-0a580a810228
abfd644c6d8 csi-snap-ec350e12-cf63-11eb-a93e-0a580a810228
abfd6884376c csi-vol-3b552e03-cf61-11eb-a93e-0a580a810228
abfd6d2b3bc8 csi-vol-5450bc11-cf67-11eb-a93e-0a580a810228
abfd7013d7bc csi-snap-ce854113-cf63-11eb-a93e-0a580a810228
abfd75609a5b csi-snap-a4eae183-cf61-11eb-a93e-0a580a810228
abfd81b3417c csi-snap-370ce8ab-cf64-11eb-a93e-0a580a810228
abfd83fa5861 csi-snap-d8875f1f-cf63-11eb-a93e-0a580a810228
abfd8df3b9a4 csi-snap-0075cb86-cf64-11eb-a93e-0a580a810228
abfd93d0ca18 csi-snap-1b37f24d-cf64-11eb-a93e-0a580a810228
abfdaebfcb0 csi-snap-ad188c21-cf61-11eb-a93e-0a580a810228
abfdc3f5f4b4 csi-vol-067f0a21-cf66-11eb-a93e-0a580a810228
abfdc8ea549a csi-vol-a9c81ce0-cf66-11eb-a93e-0a580a810228
abfddaf781b1 csi-snap-40dc0ec7-cf64-11eb-a93e-0a580a810228
abfddfd9ace0 csi-snap-245d8cfe-cf64-11eb-a93e-0a580a810228
abfdf78bd3e9 csi-snap-f5a36265-cf63-11eb-a93e-0a580a810228
abfdf82519d5 csi-snap-130b4793-cf64-11eb-a93e-0a580a810228
sh-4.4#


sh-4.4# rbd trash purge ocs-storagecluster-cephblockpool
Removing images: 11% complete...2021-06-17 12:34:17.594 7fb0f77fe700 -1 librbd::image::PreRemoveRequest: 0x7fb0f0005ea0 check_image_watchers: image has watchers - not removing
Removing images: 50% complete...2021-06-17 12:51:51.384 7fb0f77fe700 -1 librbd::image::PreRemoveRequest: 0x7fb0f00053c0 check_image_watchers: image has watchers - not removing
Removing images: 84% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group and were moved to the trash successfully.
sh-4.4#

Failed after 84% completion.


Images present after the failure of 'rbd trash purge' command.

sh-4.4# rbd trash ls ocs-storagecluster-cephblockpool
1077f3e0f019f csi-vol-7701031e-cd9f-11eb-b551-0a580a800210
1077f6d9c4285 csi-vol-00443f03-cd9f-11eb-b551-0a580a800210
1077faffae2a0 csi-snap-e7ba191e-cda2-11eb-b551-0a580a800210
1077fff5a887a csi-vol-167a2441-cd9f-11eb-b551-0a580a800210
sh-4.4#

Ceph status

sh-4.4# ceph status
  cluster:
    id:     eef8c304-8405-4383-ad40-14d52ef135aa
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum b,c,d (age 2d)
    mgr: a(active, since 2d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  data:
    pools:   10 pools, 176 pgs
    objects: 40.06k objects, 134 GiB
    usage:   372 GiB used, 1.1 TiB / 1.5 TiB avail
    pgs:     176 active+clean
 
  io:
    client:   13 MiB/s rd, 167 MiB/s wr, 3.24k op/s rd, 4.08k op/s wr
 
sh-4.4# 



Hi Scott/Mudti,

Marking this bug as failed_Qa because the 'rbd trash purge' command failed. Please let me know if I missed anything in the verification procedure.

Comment 8 Jilju Joy 2021-06-17 15:58:10 UTC

Logs collected after testing (Comment #7)
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-lso-jun14/jijoy-lso-jun14_20210614T080510/logs/testcases_1623940737/

Comment 11 Mudit Agarwal 2021-06-18 07:26:34 UTC

Madhu, PTAl https://bugzilla.redhat.com/show_bug.cgi?id=1964373#c9

Comment 12 Madhu Rajanna 2021-06-18 07:40:03 UTC

>Removing images: 11% complete...2021-06-17 12:34:17.594 7fb0f77fe700 -1 librbd::image::PreRemoveRequest: 0x7fb0f0005ea0 check_image_watchers: image has watchers - not removing
Removing images: 50% complete...2021-06-17 12:51:51.384 7fb0f77fe700 -1 librbd::image::PreRemoveRequest: 0x7fb0f00053c0 check_image_watchers: image has watchers - not removing
Removing images: 84% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group, and were moved to the trash successfully.

For this one, I need to check. 
@Jijlu can you please provide the steps done to verify this one?

* Are there any kubernetes PVC, snapshots are present when you are doing trash purge?

@jijly please check Ilya comment on #c9

Comment 13 Jilju Joy 2021-06-18 12:55:25 UTC

Madhu,

Snapshots were present when rbd trash purge command was issued. Parent PVC was not present for some of the snapshots.
As mentioned in comment #16, as there were snapshots (with and without parent PVC) and PVCs clones already present , I did not create any new PVC/Snapshots for this test. 
So  some rbd images were left behind, which is expected. Ceph status was also clean.

But the output of rbd trash purge command(comment #7) shows that the command failed. The rbd purge command should show success after skipping the relevant images. Isn't  this expected ?

Comment 14 Jilju Joy 2021-06-18 12:56:48 UTC

(In reply to Jilju Joy from comment #13)
> Madhu,
> 
> Snapshots were present when rbd trash purge command was issued. Parent PVC
> was not present for some of the snapshots.
> As mentioned in comment #16, as there were snapshots (with and without
As mentioned in comment #7

Comment 15 Mudit Agarwal 2021-06-18 13:50:52 UTC

Jilju, did you try deleting the images one by one after the failure as mentioned by Ilya in comment #9.

Ilya, I have one question. The error message says that the image won't be deleted if it is expired.
What is the meaning of expired here.

Madhu, I agree with Jilju here because an image in trash which is a parent of a clone/snapshot can remain in trash till all the dependants for that image are deleted.
This is the expected behavior, I think we paid this cost to have the parity with Kubernetes expectations.

Comment 17 Jilju Joy 2021-06-18 14:22:47 UTC

(In reply to Mudit Agarwal from comment #15)
> Jilju, did you try deleting the images one by one after the failure as
> mentioned by Ilya in comment #9.
> 
Mudit, I did not try that. The cluster was destroyed before I noticed Ilya's comment. I was excepting the rbd trash purge command to succeed after skipping the images which still have watchers.

Comment 20 Mudit Agarwal 2021-06-21 13:28:50 UTC

Looks like expected as discussed in the above comments.
Moving it back to ON_QA after an offline discussion with Madhu and Jilju.

Comment 21 Jilju Joy 2021-06-22 11:17:22 UTC

Verified in version:
ocs-operator.v4.8.0-422.ci
OCP 4.8.0-0.nightly-2021-06-19-005119
ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)


Steps performed:

List the images present in trash

sh-4.4# rbd trash ls ocs-storagecluster-cephblockpool
606527837db5 csi-vol-2515acb6-d344-11eb-93d0-0a580a80020f
6065338d897 csi-vol-d2a241c0-d344-11eb-93d0-0a580a80020f
60653a65abb csi-vol-15505254-d344-11eb-93d0-0a580a80020f
606550cf1430 csi-snap-cf581a31-d344-11eb-93d0-0a580a80020f
60656fd04246 csi-snap-e8a2e21d-d344-11eb-93d0-0a580a80020f
606570a61bd csi-vol-f5554124-d343-11eb-93d0-0a580a80020f
6065ab0ab2b2 csi-vol-ec061a6a-d344-11eb-93d0-0a580a80020f
6065b3d36baf csi-vol-0ac34e57-d344-11eb-93d0-0a580a80020f
sh-4.4# 

Try to delete all images using rbd trash purge command.

sh-4.4# rbd trash purge ocs-storagecluster-cephblockpool
Removing images: 50% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group and were moved to the trash successfully.
sh-4.4# 

rbd trash purge did not hung but completed after skipping the relevant images.

List of the images remaining in trash

sh-4.4# rbd trash ls ocs-storagecluster-cephblockpool
606527837db5 csi-vol-2515acb6-d344-11eb-93d0-0a580a80020f
60653a65abb csi-vol-15505254-d344-11eb-93d0-0a580a80020f
60656fd04246 csi-snap-e8a2e21d-d344-11eb-93d0-0a580a80020f
6065ab0ab2b2 csi-vol-ec061a6a-d344-11eb-93d0-0a580a80020f
sh-4.4# 

Check ceph status

sh-4.4# ceph status
  cluster:
    id:     339aeda9-4b2a-4733-b085-d33fa584ca6f
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 18h)
    mgr: a(active, since 18h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 18h), 3 in (since 18h)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  data:
    pools:   10 pools, 176 pgs
    objects: 11.57k objects, 39 GiB
    usage:   118 GiB used, 2.9 TiB / 3 TiB avail
    pgs:     176 active+clean
 
  io:
    client:   2.8 KiB/s rd, 180 KiB/s wr, 3 op/s rd, 5 op/s wr
 
sh-4.4# 


Try to delete each of the remaining images.

sh-4.4# rbd trash rm 606527837db5 -p ocs-storagecluster-cephblockpool
rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed.
Removing image: 0% complete...failed.
sh-4.4# 
sh-4.4# rbd trash rm 60653a65abb -p ocs-storagecluster-cephblockpool
rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed.
Removing image: 0% complete...failed.
sh-4.4# 
sh-4.4# rbd trash rm 60656fd04246 -p ocs-storagecluster-cephblockpool
rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed.
Removing image: 0% complete...failed.
sh-4.4# 
sh-4.4# 
sh-4.4# rbd trash rm 6065ab0ab2b2 -p ocs-storagecluster-cephblockpool
rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed.
Removing image: 0% complete...failed.
sh-4.4# 
sh-4.4# 

Attempt to delete the remaining images from trash has failed. This ensures that the rbd trash purge command did not skip any image that can be deleted.

Check ceph status
sh-4.4# ceph status
  cluster:
    id:     339aeda9-4b2a-4733-b085-d33fa584ca6f
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 18h)
    mgr: a(active, since 18h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 18h), 3 in (since 18h)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  data:
    pools:   10 pools, 176 pgs
    objects: 11.58k objects, 39 GiB
    usage:   118 GiB used, 2.9 TiB / 3 TiB avail
    pgs:     176 active+clean
 
  io:
    client:   5.8 KiB/s rd, 8.9 KiB/s wr, 7 op/s rd, 5 op/s wr
 
sh-4.4#

Comment 23 errata-xmlrpc 2021-08-03 18:16:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003