[4.10.z clone] [DR] [tracker for Ceph BZ #2068531] ceph status is in warn state with msg snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
Product:
[Red Hat Storage] Red Hat OpenShift Data Foundation
This bug was initially created as a copy of Bug #2067056
I am copying this bug because:
Description of problem (please be detailed as possible and provide log
snippests):
[DR] ceph status is in warn state with msg snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
Version of all relevant components (if applicable):
OCP version:- 4.10.0-0.nightly-2022-03-17-204457
ODF version:- 4.10.0-199
Ceph version:- {
"mon": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 3
},
"mgr": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
},
"osd": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 3
},
"mds": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 2
},
"rbd-mirror": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
},
"rgw": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
},
"overall": {
"ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 11
}
}
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, ceph health never comes to health_ok
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1
Can this issue reproducible?
yes
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Deploy DR cluster
2. Run Io for 2-3 days
3. Check ceph status
Actual results:
$ ceph -s
cluster:
id: f9c4bbbf-4acf-41cd-8f78-5c7afbad18ba
health: HEALTH_WARN
snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
services:
mon: 3 daemons, quorum a,b,c (age 4d)
mgr: a(active, since 4d)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 4d), 3 in (since 4d)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 177 pgs
objects: 291.55k objects, 265 GiB
usage: 780 GiB used, 5.2 TiB / 6 TiB avail
pgs: 170 active+clean
4 active+clean+snaptrim
3 active+clean+snaptrim_wait
io:
client: 387 KiB/s rd, 350 KiB/s wr, 403 op/s rd, 67 op/s wr
$ ceph health detail
HEALTH_WARN snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
[WRN] PG_SLOW_SNAP_TRIMMING: snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
snap trim queue for pg 2.d at 101487
snap trim queue for pg 2.c at 82174
snap trim queue for pg 2.a at 74772
snap trim queue for pg 2.9 at 117600
snap trim queue for pg 2.6 at 105643
snap trim queue for pg 2.3 at 119723
snap trim queue for pg 2.2 at 103280
snap trim queue for pg 2.4 at 79100
snap trim queue for pg 2.f at 118522
snap trim queue for pg 2.10 at 75615
snap trim queue for pg 2.11 at 34784
snap trim queue for pg 2.12 at 77035
snap trim queue for pg 2.14 at 114293
snap trim queue for pg 2.15 at 116927
snap trim queue for pg 2.16 at 87353
snap trim queue for pg 2.17 at 97739
snap trim queue for pg 2.19 at 107880
snap trim queue for pg 2.1a at 36867
snap trim queue for pg 2.1b at 45569
snap trim queue for pg 2.1d at 99413
snap trim queue for pg 2.1e at 102046
snap trim queue for pg 2.1f at 51987
longest queue on pg 2.3 at 119723
try decreasing "osd snap trim sleep" and/or increasing "osd pg max concurrent snap trims".
Expected results:
Additional info: