Bug 2067056

Summary:	[RDR] [tracker for Ceph BZ #2068531] ceph status is in warn state with msg snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	ceph	Assignee:	Ronen Friedman <rfriedma>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	amagrawa, bniver, edonnell, ekuric, etamir, idryomov, jdurgin, jespy, kramdoss, kseeger, madam, mmuench, muagarwa, nojha, ocs-bugs, odf-bz-bot, owasserm, pcuzner, pdhiran, pnataraj, rcyriac, rfriedma, sostapov, srangana
Version:	4.10	Keywords:	TestBlocker, Tracking
Target Milestone:	---
Target Release:	ODF 4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.11.0-50	Doc Type:	Bug Fix
Doc Text:	.Ceph OSD snap trimming is no longer blocked by a running scrub Previously, OSD snap trimming, once blocked by a running scrub, was not restarted. As a result, no trimming was performed until an OSD reset. This release fixes the handling of restarting the trimming if blocked after the scrub and snap trimming works as expected.	Story Points:	---
Clone Of:
Clones:	2068531 2095674 (view as bug list)		Environment:
Last Closed:	2022-09-06 08:15:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2068531, 2078956, 2094357, 2095674

Description Pratik Surve 2022-03-23 06:56:15 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[DR] ceph status is in warn state with msg snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)


Version of all relevant components (if applicable):
OCP version:- 4.10.0-0.nightly-2022-03-17-204457
ODF version:- 4.10.0-199
Ceph version:- {
    "mon": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 2
    },
    "rbd-mirror": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)": 11
    }
}

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, ceph health never comes to health_ok

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy DR cluster
2. Run Io for 2-3 days
3. Check ceph status


Actual results:
$ ceph -s
  cluster:
    id:     f9c4bbbf-4acf-41cd-8f78-5c7afbad18ba
    health: HEALTH_WARN
            snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
 
  services:
    mon:        3 daemons, quorum a,b,c (age 4d)
    mgr:        a(active, since 4d)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 4d), 3 in (since 4d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 291.55k objects, 265 GiB
    usage:   780 GiB used, 5.2 TiB / 6 TiB avail
    pgs:     170 active+clean
             4   active+clean+snaptrim
             3   active+clean+snaptrim_wait
 
  io:
    client:   387 KiB/s rd, 350 KiB/s wr, 403 op/s rd, 67 op/s wr


$ ceph health detail
HEALTH_WARN snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
[WRN] PG_SLOW_SNAP_TRIMMING: snap trim queue for 22 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on)
    snap trim queue for pg 2.d at 101487
    snap trim queue for pg 2.c at 82174
    snap trim queue for pg 2.a at 74772
    snap trim queue for pg 2.9 at 117600
    snap trim queue for pg 2.6 at 105643
    snap trim queue for pg 2.3 at 119723
    snap trim queue for pg 2.2 at 103280
    snap trim queue for pg 2.4 at 79100
    snap trim queue for pg 2.f at 118522
    snap trim queue for pg 2.10 at 75615
    snap trim queue for pg 2.11 at 34784
    snap trim queue for pg 2.12 at 77035
    snap trim queue for pg 2.14 at 114293
    snap trim queue for pg 2.15 at 116927
    snap trim queue for pg 2.16 at 87353
    snap trim queue for pg 2.17 at 97739
    snap trim queue for pg 2.19 at 107880
    snap trim queue for pg 2.1a at 36867
    snap trim queue for pg 2.1b at 45569
    snap trim queue for pg 2.1d at 99413
    snap trim queue for pg 2.1e at 102046
    snap trim queue for pg 2.1f at 51987
    longest queue on pg 2.3 at 119723
    try decreasing "osd snap trim sleep" and/or increasing "osd pg max concurrent snap trims".


Expected results:


Additional info:

Comment 6 Josh Durgin 2022-03-24 15:04:14 UTC

After investigating a similar setup from Paul Cuzner, it's clear this is due to https://tracker.ceph.com/issues/52026

Comment 7 Josh Durgin 2022-03-24 15:18:18 UTC

Raising the priority since this can lead to OOM and out of space issues and is easily reproducible with snap mirroring.

Comment 8 Josh Durgin 2022-03-24 16:00:13 UTC

For testing purposes, you can avoid hitting the problem by disabling scrubbing in ceph. This would allow longevity testing to proceed.

To disable scrub in ceph, use the toolbox pod to run 'ceph osd set noscrub' on all clusters.

Comment 10 Josh Durgin 2022-03-24 17:00:42 UTC

*** Bug 2021079 has been marked as a duplicate of this bug. ***

Comment 11 Josh Durgin 2022-03-24 17:01:05 UTC

*** Bug 2017429 has been marked as a duplicate of this bug. ***

Comment 13 Mudit Agarwal 2022-04-05 13:44:59 UTC

Moving DR BZs to 4.10.z/4.11

Comment 25 Josh Durgin 2022-06-06 14:23:29 UTC

Aman, can you reproduce with higher log levels (debug_osd = 20, debug_ms = 1, log_to_file=true for all osds) - these kinds of bugs can't be investigated without more detailed logs.

Comment 26 Pratik Surve 2022-06-06 14:51:05 UTC

@Josh is it fine if we increase the log level after hitting this issue?

Comment 27 Ronen Friedman 2022-06-06 15:10:19 UTC

@prsurve - it's highly unlikely to help if not On before the bug occurs.

Comment 30 Scott Ostapovicz 2022-06-09 08:07:53 UTC

Have we reproduced this on a cluster with the higher log level set yet (see comment 25)?

Comment 32 Mudit Agarwal 2022-06-29 13:28:57 UTC

Is this a TP blocker, if not I will move it out of 4.11

Comment 33 krishnaram Karthick 2022-07-05 10:26:12 UTC

(In reply to Mudit Agarwal from comment #32)
> Is this a TP blocker, if not I will move it out of 4.11

This is not considered a TP blocker.

Comment 40 Mudit Agarwal 2022-08-11 05:10:53 UTC

Pls provide doc text

Comment 46 Red Hat Bugzilla 2023-12-08 04:28:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days