Bug 1998330 - [RFE] Limit slow request details to cluster log
Summary: [RFE] Limit slow request details to cluster log
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 4.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 6.0
Assignee: Prashant Dhange
QA Contact: Tintu Mathew
Masauso Lungu
URL:
Whiteboard:
Depends On:
Blocks: 2048681 2126050
TreeView+ depends on / blocked
 
Reported: 2021-08-26 20:18 UTC by Vikhyat Umrao
Modified: 2023-06-06 03:53 UTC (History)
14 users (show)

Fixed In Version: ceph-17.2.3-1.el9cp
Doc Type: Enhancement
Doc Text:
.OSDs report the slow operation details in an aggregated format to the Ceph cluster log Previously, slow requests would overwhelm a cluster log with too many details, filling up the monitor database. With this release, slow requests by operation type and by pool information gets logged to the cluster log in an aggregated format.
Clone Of:
: 2048681 (view as bug list)
Environment:
Last Closed: 2023-03-20 18:55:34 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 52424 0 None None None 2021-08-26 20:19:42 UTC
Ceph Project Bug Tracker 53978 0 None None None 2022-01-31 17:24:22 UTC
Github ceph ceph pull 43732 0 None Merged osd/OSD: Log aggregated slow ops detail to cluster logs 2022-08-23 07:53:22 UTC
Red Hat Issue Tracker RHCEPH-1029 0 None None None 2021-08-26 20:19:48 UTC
Red Hat Product Errata RHBA-2023:1360 0 None None None 2023-03-20 18:56:13 UTC

Description Vikhyat Umrao 2021-08-26 20:18:05 UTC
Description of problem:
[RFE] Limit slow request details to mgr log

Slow requests can overwhelm a cluster log with too many details, filling up the monitor DB.

There's no need to log details of every single slow request or send any of those details to the cluster log. Instead, OSDs can report the top N slow ops details to the mgr, and they can be output in the mgr log.


Version-Release number of selected component (if applicable):
RHCS 4.2

Comment 2 Prashant Dhange 2021-11-18 04:07:46 UTC
@linuxkidd In pacific (to be backported) release, our plan is to log only aggregated slow ops details instead of logging every slow op to cluster log. Below will be aggregated format to log slow ops in cluster log.


<date>T<time> osd.<osdid> (osd.<osdid>) : cluster [WRN] <number-of-slow-requests> slow requests (by type [ 'slow-op-type1' : <number-of-slow-op-of-type1> 'slow-op-type2' : <number-of-slow-op-of-type2> .. ] most affected pool [ '<max-slow-for-pool-name>' : <number-of-slow-ops-for-pool> ])

e.g 
2021-10-29T09:35:48.104306-0400 osd.0 (osd.0) 3 : cluster [WRN] 10 slow requests (by type [ 'delayed' : 4 'queued for pg' : 2 'started' : 3 'waiting for sub ops' : 1 ] most affected pool [ 'scbench' : 10 ])

We are not logging all pools and/or PGs observing slow requests as there will be possibility of many pools/PGs having slow requests and with this info log message will be too long to print in cluster log.

What other information would be useful apart from logging slow op type and pool in aggregated format ?

Comment 3 Michael J. Kidd 2021-12-15 19:02:19 UTC
Some quick feedback:

In support, we usually slice up slow requests by a few parameters:
1: By OSD ( still possible with this format )
2: By pool ( still possible with this format )
3: By type ( still possible with this format )
4: By PG ( hidden )
5: By object ( hidden )
-- Usually filtered to only 1st log ( 30 to 59 seconds old ), and not re-logged events ( 60+ seconds old )
-- We have used the ratio of initial to re-logged events to help understand if a given batch of slow requests is just stuck on a log queue, and not reflective of a current issue.
-- We also parse for 'deep-scrub' event entries against a given PG, to see if the timing aligns as the cause of the slow request.


## Observations:
- I don't see many ways to get the By PG / By Object outputs in any way beyond 1 line per slow op.
  + Maybe the 'by pg' could be added to a 'ceph pg dump' output? Or maybe only with --format json/json-pretty?
  + The problem here though, is it's "point in time", not historical, so usefulness of 'pg dump' data would be extremely limited.
- I also don't see that this new format takes into account the age of the request.
- Nor do I see a way to account for slow requests as a side-effect of deep-scrub (without per-PG specifics).


I definitely agree we have an issue with MONs getting flooded from aggressive logging, especially when some major turbulence is occurring.  Maybe there would be a way for the MONs to set a state once they are > X entries ( or > X time ) behind in their log processing, that the other services recognize and cease clog submissions until the MONs are caught back up, dropping all logs between the "I'm flooded" and "I'm caught up" states.  I know we've tried some intelligent back-off at the service levels, but that hasn't seemed to help ( or help enough ).

Comment 4 Prashant Dhange 2022-01-10 19:54:20 UTC
Thanks Michael.

(In reply to Michael J. Kidd from comment #3)
> Some quick feedback:
> 
> In support, we usually slice up slow requests by a few parameters:
> 1: By OSD ( still possible with this format )
> 2: By pool ( still possible with this format )
> 3: By type ( still possible with this format )
> 4: By PG ( hidden )
> 5: By object ( hidden )
> -- Usually filtered to only 1st log ( 30 to 59 seconds old ), and not
> re-logged events ( 60+ seconds old )
> -- We have used the ratio of initial to re-logged events to help understand
> if a given batch of slow requests is just stuck on a log queue, and not
> reflective of a current issue.
> -- We also parse for 'deep-scrub' event entries against a given PG, to see
> if the timing aligns as the cause of the slow request.
Yes. By PG and Object info will be missing in new format but we can still be going to the cluster log in old format by setting osd_log_slow_op_to_clog to true for OSD(s) although historical information will be missing here.
  
> 
> 
> ## Observations:
> - I don't see many ways to get the By PG / By Object outputs in any way
> beyond 1 line per slow op.
>   + Maybe the 'by pg' could be added to a 'ceph pg dump' output? Or maybe
> only with --format json/json-pretty?
>   + The problem here though, is it's "point in time", not historical, so
> usefulness of 'pg dump' data would be extremely limited.
> - I also don't see that this new format takes into account the age of the
> request.
> - Nor do I see a way to account for slow requests as a side-effect of
> deep-scrub (without per-PG specifics).

We can introduce time for the oldest and newest slow op observed when aggregated information is logged to be comparable with deep-scrub events.  

> 
> 
> I definitely agree we have an issue with MONs getting flooded from
> aggressive logging, especially when some major turbulence is occurring. 
> Maybe there would be a way for the MONs to set a state once they are > X
> entries ( or > X time ) behind in their log processing, that the other
> services recognize and cease clog submissions until the MONs are caught back
> up, dropping all logs between the "I'm flooded" and "I'm caught up" states. 
We may miss slow ops info if we start dropping logs in-between the "I'm flooded" and "I'm caught up" states. Also cluster logging happens in quick succession so we will still be consuming significant monstore space. May be we can use osd_log_slow_op_to_clog config option on use-case basis and keeping aggregated slow ops info in the cluster log ? 

> I know we've tried some intelligent back-off at the service levels, but that
> hasn't seemed to help ( or help enough ).

Let us discuss these inputs for future enhancement.

Comment 34 errata-xmlrpc 2023-03-20 18:55:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360


Note You need to log in before you can comment on or make changes to this bug.