Description of problem (please be detailed as possible and provide log snippests): [DR] ceph status is in warn state with msg snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on) Version of all relevant components (if applicable): OCP version:- 4.10.0-0.nightly-2022-03-17-204457 ODF version:- 4.10.0-199 Ceph version:- { "mon": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1 }, "osd": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 3 }, "mds": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 2 }, "rbd-mirror": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1 }, "rgw": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1 }, "overall": { "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 11 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes, ceph health never comes to health_ok Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy DR cluster 2. Run Io for 2-3 days 3. Check ceph status Actual results: $ ceph -s cluster: id: f9c4bbbf-4acf-41cd-8f78-5c7afbad18ba health: HEALTH_WARN snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on) services: mon: 3 daemons, quorum a,b,c (age 4d) mgr: a(active, since 4d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 4d), 3 in (since 4d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 177 pgs objects: 291.55k objects, 265 GiB usage: 780 GiB used, 5.2 TiB / 6 TiB avail pgs: 170 active+clean 4 active+clean+snaptrim 3 active+clean+snaptrim_wait io: client: 387 KiB/s rd, 350 KiB/s wr, 403 op/s rd, 67 op/s wr $ ceph health detail HEALTH_WARN snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on) [WRN] PG_SLOW_SNAP_TRIMMING: snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on) snap trim queue for pg 2.d at 101487 snap trim queue for pg 2.c at 82174 snap trim queue for pg 2.a at 74772 snap trim queue for pg 2.9 at 117600 snap trim queue for pg 2.6 at 105643 snap trim queue for pg 2.3 at 119723 snap trim queue for pg 2.2 at 103280 snap trim queue for pg 2.4 at 79100 snap trim queue for pg 2.f at 118522 snap trim queue for pg 2.10 at 75615 snap trim queue for pg 2.11 at 34784 snap trim queue for pg 2.12 at 77035 snap trim queue for pg 2.14 at 114293 snap trim queue for pg 2.15 at 116927 snap trim queue for pg 2.16 at 87353 snap trim queue for pg 2.17 at 97739 snap trim queue for pg 2.19 at 107880 snap trim queue for pg 2.1a at 36867 snap trim queue for pg 2.1b at 45569 snap trim queue for pg 2.1d at 99413 snap trim queue for pg 2.1e at 102046 snap trim queue for pg 2.1f at 51987 longest queue on pg 2.3 at 119723 try decreasing "osd snap trim sleep" and/or increasing "osd pg max concurrent snap trims". Expected results: Additional info:
After investigating a similar setup from Paul Cuzner, it's clear this is due to https://tracker.ceph.com/issues/52026
Raising the priority since this can lead to OOM and out of space issues and is easily reproducible with snap mirroring.
For testing purposes, you can avoid hitting the problem by disabling scrubbing in ceph. This would allow longevity testing to proceed. To disable scrub in ceph, use the toolbox pod to run 'ceph osd set noscrub' on all clusters.
*** Bug 2021079 has been marked as a duplicate of this bug. ***
*** Bug 2017429 has been marked as a duplicate of this bug. ***
Moving DR BZs to 4.10.z/4.11
Aman, can you reproduce with higher log levels (debug_osd = 20, debug_ms = 1, log_to_file=true for all osds) - these kinds of bugs can't be investigated without more detailed logs.
@Josh is it fine if we increase the log level after hitting this issue?
@prsurve - it's highly unlikely to help if not On before the bug occurs.
Have we reproduced this on a cluster with the higher log level set yet (see comment 25)?
Is this a TP blocker, if not I will move it out of 4.11
(In reply to Mudit Agarwal from comment #32) > Is this a TP blocker, if not I will move it out of 4.11 This is not considered a TP blocker.
Pls provide doc text
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days