Description of problem (please be detailed as possible and provide log snippests): ======================================================================= On a vmware based (3m+3w,i+3w) cluster, ceph health progress section seems to be stuck at removing the rbd image from trash. sh-4.4# ceph -s cluster: id: fdf2f77d-1201-427f-9ca9-db0fecd2da5c health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 5h) mgr: a(active, since 11d) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 6h), 3 in (since 3w) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 20.37k objects, 77 GiB usage: 234 GiB used, 534 GiB / 768 GiB avail pgs: 176 active+clean io: client: 6.7 KiB/s rd, 162 KiB/s wr, 8 op/s rd, 1 op/s wr progress: Removing image ocs-storagecluster-cephblockpool/a10a894b726cc from trash [..............................] Removing image ocs-storagecluster-cephblockpool/a10a87ad46bc5 from trash [..............................] Removing image ocs-storagecluster-cephblockpool/a10a862e83ce1 from trash [..............................] The progress section is in same state from a long time, I see below debug messages in ceph-mgr pod logs logged from last 3 days. debug 2020-11-17 13:02:54.211 7f44e544e700 -1 librbd::SnapshotRemoveRequest: 0x558a65235a20 should_complete: encountered error: (16) Device or resource busy debug 2020-11-17 13:02:54.211 7f44e544e700 -1 librbd::image::PreRemoveRequest: 0x558a64e78f20 handle_remove_snapshot: failed to auto-prune snapshot 16: (16) Device or resource busy debug 2020-11-17 13:02:54.215 7f44e32ca700 0 mgr[rbd_support] execute_task: [errno 39] error deleting image from trash And ocs-storagecluster is in phase "progressing" [tdesala@localhost vmware]$ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 26d Progressing 2020-10-22T10:43:08Z 4.6.0 Version of all relevant components (if applicable): ================================================================================= upgraded to OCS 4.6.0-rc2 from rc1 (from the logs it seems this was seen before upgrade as well) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? I didn't observe any functionality impact and ceph cluster is healthy from both cli and UI. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Not sure on the exact reproducer steps. Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: ================== Not sure on the exact reproducer steps, the last tests that were run were related to snapshot and clones (https://github.com/red-hat-storage/ocs-ci/pull/3199) and then the system was left idle for few days. Actual results: =============== Ceph health progress section seems to be stuck at removing the rbd image from trash. Expected results: ================ Ceph health progress section task should complete without any issues/errors.
Re-assigning to CSI team for assistance: the cluster has a phantom PVC that hasn't been deleted and is keeping a chain of cloned images alive as a result. OCS does not have the offending PVC nor does the CSI's "csi.volumes.default" RADOS object: sh-4.4# rbd --pool ocs-storagecluster-cephblockpool info csi-vol-5358e637-19cf-11eb-8626-0a580a81020f rbd image 'csi-vol-5358e637-19cf-11eb-8626-0a580a81020f': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 0 id: a10a81fd6fe1 block_name_prefix: rbd_data.a10a81fd6fe1 format: 2 features: layering, operations op_features: clone-child flags: create_timestamp: Thu Oct 29 10:13:42 2020 access_timestamp: Thu Oct 29 10:13:42 2020 modify_timestamp: Thu Oct 29 10:13:42 2020 parent: ocs-storagecluster-cephblockpool/csi-vol-5358e637-19cf-11eb-8626-0a580a81020f-temp@45d05286-a7b9-401f-848e-13284bb3cc7d overlap: 10 GiB sh-4.4# rbd --pool ocs-storagecluster-cephblockpool info csi-vol-5358e637-19cf-11eb-8626-0a580a81020f-temp rbd image 'csi-vol-5358e637-19cf-11eb-8626-0a580a81020f-temp': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 1 id: a10a8646b0555 block_name_prefix: rbd_data.a10a8646b0555 format: 2 features: layering, deep-flatten, operations op_features: clone-parent, clone-child, snap-trash flags: create_timestamp: Thu Oct 29 10:13:40 2020 access_timestamp: Thu Oct 29 10:13:40 2020 modify_timestamp: Thu Oct 29 10:13:40 2020 parent: ocs-storagecluster-cephblockpool/csi-vol-04b80b34-19cf-11eb-8626-0a580a81020f@70e2a9ac-1d6e-4257-9fc0-cbf5c8cd460a (trash a10a87ad46bc5) overlap: 10 GiB sh-4.4# rbd info --pool ocs-storagecluster-cephblockpool --image-id a10a87ad46bc5 rbd image 'csi-vol-04b80b34-19cf-11eb-8626-0a580a81020f': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 1 id: a10a87ad46bc5 block_name_prefix: rbd_data.a10a87ad46bc5 format: 2 features: layering, operations op_features: clone-parent, clone-child, snap-trash flags: create_timestamp: Thu Oct 29 10:10:49 2020 access_timestamp: Thu Oct 29 10:10:49 2020 modify_timestamp: Thu Oct 29 10:10:49 2020 parent: ocs-storagecluster-cephblockpool/csi-snap-d174c636-19ce-11eb-8626-0a580a81020f@d812acc6-fe6b-4404-a8ac-d77478a4d3b8 (trash a10a894b726cc) overlap: 10 GiB sh-4.4# rbd info --pool ocs-storagecluster-cephblockpool --image-id a10a894b726cc rbd image 'csi-snap-d174c636-19ce-11eb-8626-0a580a81020f': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 1 id: a10a894b726cc block_name_prefix: rbd_data.a10a894b726cc format: 2 features: layering, deep-flatten, operations op_features: clone-parent, clone-child, snap-trash flags: create_timestamp: Thu Oct 29 10:09:23 2020 access_timestamp: Thu Oct 29 10:09:23 2020 modify_timestamp: Thu Oct 29 10:09:23 2020 parent: ocs-storagecluster-cephblockpool/csi-vol-1b68f495-19ce-11eb-8626-0a580a81020f@b6293954-6e98-48a1-a443-4c6846d1e3f9 (trash a10a862e83ce1) overlap: 10 GiB sh-4.4# rbd info --pool ocs-storagecluster-cephblockpool --image-id a10a862e83ce1 rbd image 'csi-vol-1b68f495-19ce-11eb-8626-0a580a81020f': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 1 id: a10a862e83ce1 block_name_prefix: rbd_data.a10a862e83ce1 format: 2 features: layering, operations op_features: clone-parent, snap-trash flags: create_timestamp: Thu Oct 29 10:04:17 2020 access_timestamp: Thu Oct 29 10:04:17 2020 modify_timestamp: Thu Oct 29 10:04:17 2020 # rados --pool ocs-storagecluster-cephblockpool listomapvals csi.volume.5358e637-19cf-11eb-8626-0a580a81020f csi.imageid value (12 bytes) : 00000000 61 31 30 61 38 31 66 64 36 66 65 31 |a10a81fd6fe1| 0000000c sh-4.4# rados --pool ocs-storagecluster-cephblockpool listomapvals csi.snaps.default sh-4.4# rados --pool ocs-storagecluster-cephblockpool listomapvals csi.volumes.default csi.volume.pvc-179d2dfe-aee4-4f39-8a08-114ace2b89ec value (36 bytes) : 00000000 35 31 64 37 61 62 66 32 2d 31 34 35 34 2d 31 31 |51d7abf2-1454-11| 00000010 65 62 2d 61 33 63 37 2d 30 61 35 38 30 61 38 30 |eb-a3c7-0a580a80| 00000020 30 34 30 36 |0406| 00000024 csi.volume.pvc-2315044e-fcb2-47d7-b0a0-c45e2cd94f27 value (36 bytes) : 00000000 35 61 32 33 35 30 64 35 2d 31 34 35 34 2d 31 31 |5a2350d5-1454-11| 00000010 65 62 2d 61 33 63 37 2d 30 61 35 38 30 61 38 30 |eb-a3c7-0a580a80| 00000020 30 34 30 36 |0406| 00000024 csi.volume.pvc-887b6980-63cb-4cb0-bee6-603a68792bfa value (36 bytes) : 00000000 35 39 62 37 36 63 63 30 2d 31 34 35 34 2d 31 31 |59b76cc0-1454-11| 00000010 65 62 2d 61 33 63 37 2d 30 61 35 38 30 61 38 30 |eb-a3c7-0a580a80| 00000020 30 34 30 36 |0406| 00000024 csi.volume.pvc-aba362bf-5dcc-48b3-bd82-a902b4abaa4b value (36 bytes) : 00000000 31 62 37 35 30 66 61 36 2d 31 34 35 34 2d 31 31 |1b750fa6-1454-11| 00000010 65 62 2d 61 33 63 37 2d 30 61 35 38 30 61 38 30 |eb-a3c7-0a580a80| 00000020 30 34 30 36 |0406| 00000024 csi.volume.pvc-b3682e0e-5509-4c3b-aeeb-424e84785cf9 value (36 bytes) : 00000000 35 31 61 37 34 30 31 32 2d 31 34 35 34 2d 31 31 |51a74012-1454-11| 00000010 65 62 2d 61 33 63 37 2d 30 61 35 38 30 61 38 30 |eb-a3c7-0a580a80| 00000020 30 34 30 36 |0406| 00000024 csi.volume.pvc-db10fc9f-45c7-41e4-8768-7ee189f56d25 value (36 bytes) : 00000000 35 39 66 38 65 30 62 30 2d 31 34 35 34 2d 31 31 |59f8e0b0-1454-11| 00000010 65 62 2d 61 33 63 37 2d 30 61 35 38 30 61 38 30 |eb-a3c7-0a580a80| 00000020 30 34 30 36 |0406| 00000024
These images belong to trash only, looks like a corner case. Not a blocker at the moment, moving it out of 4.6. Will continue the investigation n 4.7
Unable to repro, moving it to 4.8 while we keep trying to reproduce the same.