Bug 1898565 - Ceph health progress section seems to be stuck at removing the rbd image from trash
Summary: Ceph health progress section seems to be stuck at removing the rbd image from...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Mudit Agarwal
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On: 1899566 1947673 1964373 1966687
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-17 14:26 UTC by Prasad Desala
Modified: 2023-08-09 16:37 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-12 01:04:18 UTC
Embargoed:


Attachments (Terms of Use)

Description Prasad Desala 2020-11-17 14:26:22 UTC
Description of problem (please be detailed as possible and provide log
snippests):
=======================================================================
On a vmware based (3m+3w,i+3w) cluster, ceph health progress section seems to be stuck at removing the rbd image from trash.

sh-4.4# ceph -s
  cluster:
    id:     fdf2f77d-1201-427f-9ca9-db0fecd2da5c
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5h)
    mgr: a(active, since 11d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 6h), 3 in (since 3w)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 176 pgs
    objects: 20.37k objects, 77 GiB
    usage:   234 GiB used, 534 GiB / 768 GiB avail
    pgs:     176 active+clean
 
  io:
    client:   6.7 KiB/s rd, 162 KiB/s wr, 8 op/s rd, 1 op/s wr
 
  progress:
    Removing image ocs-storagecluster-cephblockpool/a10a894b726cc from trash
      [..............................]
    Removing image ocs-storagecluster-cephblockpool/a10a87ad46bc5 from trash
      [..............................]
    Removing image ocs-storagecluster-cephblockpool/a10a862e83ce1 from trash
      [..............................]

The progress section is in same state from a long time, I see below debug messages in ceph-mgr pod logs logged from last 3 days.

debug 2020-11-17 13:02:54.211 7f44e544e700 -1 librbd::SnapshotRemoveRequest: 0x558a65235a20 should_complete: encountered error: (16) Device or resource busy
debug 2020-11-17 13:02:54.211 7f44e544e700 -1 librbd::image::PreRemoveRequest: 0x558a64e78f20 handle_remove_snapshot: failed to auto-prune snapshot 16: (16) Device or resource busy
debug 2020-11-17 13:02:54.215 7f44e32ca700  0 mgr[rbd_support] execute_task: [errno 39] error deleting image from trash

And ocs-storagecluster is in phase "progressing"
[tdesala@localhost vmware]$ oc get storagecluster 
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   26d   Progressing              2020-10-22T10:43:08Z   4.6.0


Version of all relevant components (if applicable):
=================================================================================
upgraded to OCS 4.6.0-rc2 from rc1 (from the logs it seems this was seen before upgrade as well)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
I didn't observe any functionality impact and ceph cluster is healthy from both cli and UI.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Not sure on the exact reproducer steps.

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
==================
Not sure on the exact reproducer steps, the last tests that were run were related to snapshot and clones (https://github.com/red-hat-storage/ocs-ci/pull/3199) and then the system was left idle for few days.

Actual results:
===============
Ceph health progress section seems to be stuck at removing the rbd image from trash.

Expected results:
================
Ceph health progress section task should complete without any issues/errors.

Comment 10 Jason Dillaman 2020-11-19 16:13:06 UTC
Re-assigning to CSI team for assistance: the cluster has a phantom PVC that hasn't been deleted and is keeping a chain of cloned images alive as a result. OCS does not have the offending PVC nor does the CSI's "csi.volumes.default" RADOS object:

sh-4.4# rbd --pool ocs-storagecluster-cephblockpool  info csi-vol-5358e637-19cf-11eb-8626-0a580a81020f     
rbd image 'csi-vol-5358e637-19cf-11eb-8626-0a580a81020f':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: a10a81fd6fe1
	block_name_prefix: rbd_data.a10a81fd6fe1
	format: 2
	features: layering, operations
	op_features: clone-child
	flags: 
	create_timestamp: Thu Oct 29 10:13:42 2020
	access_timestamp: Thu Oct 29 10:13:42 2020
	modify_timestamp: Thu Oct 29 10:13:42 2020
	parent: ocs-storagecluster-cephblockpool/csi-vol-5358e637-19cf-11eb-8626-0a580a81020f-temp@45d05286-a7b9-401f-848e-13284bb3cc7d
	overlap: 10 GiB
sh-4.4# rbd --pool ocs-storagecluster-cephblockpool  info csi-vol-5358e637-19cf-11eb-8626-0a580a81020f-temp
rbd image 'csi-vol-5358e637-19cf-11eb-8626-0a580a81020f-temp':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 1
	id: a10a8646b0555
	block_name_prefix: rbd_data.a10a8646b0555
	format: 2
	features: layering, deep-flatten, operations
	op_features: clone-parent, clone-child, snap-trash
	flags: 
	create_timestamp: Thu Oct 29 10:13:40 2020
	access_timestamp: Thu Oct 29 10:13:40 2020
	modify_timestamp: Thu Oct 29 10:13:40 2020
	parent: ocs-storagecluster-cephblockpool/csi-vol-04b80b34-19cf-11eb-8626-0a580a81020f@70e2a9ac-1d6e-4257-9fc0-cbf5c8cd460a (trash a10a87ad46bc5)
	overlap: 10 GiB
sh-4.4# rbd info --pool ocs-storagecluster-cephblockpool --image-id a10a87ad46bc5
rbd image 'csi-vol-04b80b34-19cf-11eb-8626-0a580a81020f':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 1
	id: a10a87ad46bc5
	block_name_prefix: rbd_data.a10a87ad46bc5
	format: 2
	features: layering, operations
	op_features: clone-parent, clone-child, snap-trash
	flags: 
	create_timestamp: Thu Oct 29 10:10:49 2020
	access_timestamp: Thu Oct 29 10:10:49 2020
	modify_timestamp: Thu Oct 29 10:10:49 2020
	parent: ocs-storagecluster-cephblockpool/csi-snap-d174c636-19ce-11eb-8626-0a580a81020f@d812acc6-fe6b-4404-a8ac-d77478a4d3b8 (trash a10a894b726cc)
	overlap: 10 GiB
sh-4.4# rbd info --pool ocs-storagecluster-cephblockpool --image-id a10a894b726cc
rbd image 'csi-snap-d174c636-19ce-11eb-8626-0a580a81020f':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 1
	id: a10a894b726cc
	block_name_prefix: rbd_data.a10a894b726cc
	format: 2
	features: layering, deep-flatten, operations
	op_features: clone-parent, clone-child, snap-trash
	flags: 
	create_timestamp: Thu Oct 29 10:09:23 2020
	access_timestamp: Thu Oct 29 10:09:23 2020
	modify_timestamp: Thu Oct 29 10:09:23 2020
	parent: ocs-storagecluster-cephblockpool/csi-vol-1b68f495-19ce-11eb-8626-0a580a81020f@b6293954-6e98-48a1-a443-4c6846d1e3f9 (trash a10a862e83ce1)
	overlap: 10 GiB
sh-4.4# rbd info --pool ocs-storagecluster-cephblockpool --image-id a10a862e83ce1
rbd image 'csi-vol-1b68f495-19ce-11eb-8626-0a580a81020f':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 1
	id: a10a862e83ce1
	block_name_prefix: rbd_data.a10a862e83ce1
	format: 2
	features: layering, operations
	op_features: clone-parent, snap-trash
	flags: 
	create_timestamp: Thu Oct 29 10:04:17 2020
	access_timestamp: Thu Oct 29 10:04:17 2020
	modify_timestamp: Thu Oct 29 10:04:17 2020


# rados --pool ocs-storagecluster-cephblockpool listomapvals csi.volume.5358e637-19cf-11eb-8626-0a580a81020f
csi.imageid
value (12 bytes) :
00000000  61 31 30 61 38 31 66 64  36 66 65 31              |a10a81fd6fe1|
0000000c

sh-4.4# rados --pool ocs-storagecluster-cephblockpool listomapvals csi.snaps.default
sh-4.4# rados --pool ocs-storagecluster-cephblockpool listomapvals csi.volumes.default
csi.volume.pvc-179d2dfe-aee4-4f39-8a08-114ace2b89ec
value (36 bytes) :
00000000  35 31 64 37 61 62 66 32  2d 31 34 35 34 2d 31 31  |51d7abf2-1454-11|
00000010  65 62 2d 61 33 63 37 2d  30 61 35 38 30 61 38 30  |eb-a3c7-0a580a80|
00000020  30 34 30 36                                       |0406|
00000024

csi.volume.pvc-2315044e-fcb2-47d7-b0a0-c45e2cd94f27
value (36 bytes) :
00000000  35 61 32 33 35 30 64 35  2d 31 34 35 34 2d 31 31  |5a2350d5-1454-11|
00000010  65 62 2d 61 33 63 37 2d  30 61 35 38 30 61 38 30  |eb-a3c7-0a580a80|
00000020  30 34 30 36                                       |0406|
00000024

csi.volume.pvc-887b6980-63cb-4cb0-bee6-603a68792bfa
value (36 bytes) :
00000000  35 39 62 37 36 63 63 30  2d 31 34 35 34 2d 31 31  |59b76cc0-1454-11|
00000010  65 62 2d 61 33 63 37 2d  30 61 35 38 30 61 38 30  |eb-a3c7-0a580a80|
00000020  30 34 30 36                                       |0406|
00000024

csi.volume.pvc-aba362bf-5dcc-48b3-bd82-a902b4abaa4b
value (36 bytes) :
00000000  31 62 37 35 30 66 61 36  2d 31 34 35 34 2d 31 31  |1b750fa6-1454-11|
00000010  65 62 2d 61 33 63 37 2d  30 61 35 38 30 61 38 30  |eb-a3c7-0a580a80|
00000020  30 34 30 36                                       |0406|
00000024

csi.volume.pvc-b3682e0e-5509-4c3b-aeeb-424e84785cf9
value (36 bytes) :
00000000  35 31 61 37 34 30 31 32  2d 31 34 35 34 2d 31 31  |51a74012-1454-11|
00000010  65 62 2d 61 33 63 37 2d  30 61 35 38 30 61 38 30  |eb-a3c7-0a580a80|
00000020  30 34 30 36                                       |0406|
00000024

csi.volume.pvc-db10fc9f-45c7-41e4-8768-7ee189f56d25
value (36 bytes) :
00000000  35 39 66 38 65 30 62 30  2d 31 34 35 34 2d 31 31  |59f8e0b0-1454-11|
00000010  65 62 2d 61 33 63 37 2d  30 61 35 38 30 61 38 30  |eb-a3c7-0a580a80|
00000020  30 34 30 36                                       |0406|
00000024

Comment 11 Mudit Agarwal 2020-11-23 08:17:58 UTC
These images belong to trash only, looks like a corner case.
Not a blocker at the moment, moving it out of 4.6. Will continue the investigation n 4.7

Comment 12 Mudit Agarwal 2021-01-29 05:18:59 UTC
Unable to repro, moving it to 4.8 while we keep trying to reproduce the same.


Note You need to log in before you can comment on or make changes to this bug.