1998330 – [RFE] Limit slow request details to cluster log

Bug 1998330 - [RFE] Limit slow request details to cluster log

Summary: [RFE] Limit slow request details to cluster log

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	6.0
Assignee:	Prashant Dhange
QA Contact:	Tintu Mathew
Docs Contact:	Masauso Lungu
URL:
Whiteboard:
Depends On:
Blocks:	2048681 2126050
TreeView+	depends on / blocked

Reported:	2021-08-26 20:18 UTC by Vikhyat Umrao
Modified:	2023-06-06 03:53 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ceph-17.2.3-1.el9cp
Doc Type:	Enhancement
Doc Text:	.OSDs report the slow operation details in an aggregated format to the Ceph cluster log Previously, slow requests would overwhelm a cluster log with too many details, filling up the monitor database. With this release, slow requests by operation type and by pool information gets logged to the cluster log in an aggregated format.
Clone Of:
Clones:	2048681 (view as bug list)
Environment:
Last Closed:	2023-03-20 18:55:34 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	52424	None	None	None	2021-08-26 20:19:42 UTC
Ceph Project Bug Tracker	53978	None	None	None	2022-01-31 17:24:22 UTC
Github	ceph ceph pull 43732	None	Merged	osd/OSD: Log aggregated slow ops detail to cluster logs	2022-08-23 07:53:22 UTC
Red Hat Issue Tracker	RHCEPH-1029	None	None	None	2021-08-26 20:19:48 UTC
Red Hat Product Errata	RHBA-2023:1360	None	None	None	2023-03-20 18:56:13 UTC

Description Vikhyat Umrao 2021-08-26 20:18:05 UTC

Description of problem:
[RFE] Limit slow request details to mgr log

Slow requests can overwhelm a cluster log with too many details, filling up the monitor DB.

There's no need to log details of every single slow request or send any of those details to the cluster log. Instead, OSDs can report the top N slow ops details to the mgr, and they can be output in the mgr log.


Version-Release number of selected component (if applicable):
RHCS 4.2

Comment 2 Prashant Dhange 2021-11-18 04:07:46 UTC

@linuxkidd In pacific (to be backported) release, our plan is to log only aggregated slow ops details instead of logging every slow op to cluster log. Below will be aggregated format to log slow ops in cluster log.


<date>T<time> osd.<osdid> (osd.<osdid>) : cluster [WRN] <number-of-slow-requests> slow requests (by type [ 'slow-op-type1' : <number-of-slow-op-of-type1> 'slow-op-type2' : <number-of-slow-op-of-type2> .. ] most affected pool [ '<max-slow-for-pool-name>' : <number-of-slow-ops-for-pool> ])

e.g 
2021-10-29T09:35:48.104306-0400 osd.0 (osd.0) 3 : cluster [WRN] 10 slow requests (by type [ 'delayed' : 4 'queued for pg' : 2 'started' : 3 'waiting for sub ops' : 1 ] most affected pool [ 'scbench' : 10 ])

We are not logging all pools and/or PGs observing slow requests as there will be possibility of many pools/PGs having slow requests and with this info log message will be too long to print in cluster log.

What other information would be useful apart from logging slow op type and pool in aggregated format ?

Comment 3 Michael J. Kidd 2021-12-15 19:02:19 UTC

Some quick feedback:

In support, we usually slice up slow requests by a few parameters:
1: By OSD ( still possible with this format )
2: By pool ( still possible with this format )
3: By type ( still possible with this format )
4: By PG ( hidden )
5: By object ( hidden )
-- Usually filtered to only 1st log ( 30 to 59 seconds old ), and not re-logged events ( 60+ seconds old )
-- We have used the ratio of initial to re-logged events to help understand if a given batch of slow requests is just stuck on a log queue, and not reflective of a current issue.
-- We also parse for 'deep-scrub' event entries against a given PG, to see if the timing aligns as the cause of the slow request.

## Observations:
- I don't see many ways to get the By PG / By Object outputs in any way beyond 1 line per slow op.
+ Maybe the 'by pg' could be added to a 'ceph pg dump' output? Or maybe only with --format json/json-pretty?
+ The problem here though, is it's "point in time", not historical, so usefulness of 'pg dump' data would be extremely limited.
- I also don't see that this new format takes into account the age of the request.
- Nor do I see a way to account for slow requests as a side-effect of deep-scrub (without per-PG specifics).

I definitely agree we have an issue with MONs getting flooded from aggressive logging, especially when some major turbulence is occurring. Maybe there would be a way for the MONs to set a state once they are > X entries ( or > X time ) behind in their log processing, that the other services recognize and cease clog submissions until the MONs are caught back up, dropping all logs between the "I'm flooded" and "I'm caught up" states. I know we've tried some intelligent back-off at the service levels, but that hasn't seemed to help ( or help enough ).

Comment 4 Prashant Dhange 2022-01-10 19:54:20 UTC

Thanks Michael.

(In reply to Michael J. Kidd from comment #3)
> Some quick feedback:
> 
> In support, we usually slice up slow requests by a few parameters:
> 1: By OSD ( still possible with this format )
> 2: By pool ( still possible with this format )
> 3: By type ( still possible with this format )
> 4: By PG ( hidden )
> 5: By object ( hidden )
> -- Usually filtered to only 1st log ( 30 to 59 seconds old ), and not
> re-logged events ( 60+ seconds old )
> -- We have used the ratio of initial to re-logged events to help understand
> if a given batch of slow requests is just stuck on a log queue, and not
> reflective of a current issue.
> -- We also parse for 'deep-scrub' event entries against a given PG, to see
> if the timing aligns as the cause of the slow request.
Yes. By PG and Object info will be missing in new format but we can still be going to the cluster log in old format by setting osd_log_slow_op_to_clog to true for OSD(s) although historical information will be missing here.
  
> 
> 
> ## Observations:
> - I don't see many ways to get the By PG / By Object outputs in any way
> beyond 1 line per slow op.
>   + Maybe the 'by pg' could be added to a 'ceph pg dump' output? Or maybe
> only with --format json/json-pretty?
>   + The problem here though, is it's "point in time", not historical, so
> usefulness of 'pg dump' data would be extremely limited.
> - I also don't see that this new format takes into account the age of the
> request.
> - Nor do I see a way to account for slow requests as a side-effect of
> deep-scrub (without per-PG specifics).

We can introduce time for the oldest and newest slow op observed when aggregated information is logged to be comparable with deep-scrub events.  

> 
> 
> I definitely agree we have an issue with MONs getting flooded from
> aggressive logging, especially when some major turbulence is occurring. 
> Maybe there would be a way for the MONs to set a state once they are > X
> entries ( or > X time ) behind in their log processing, that the other
> services recognize and cease clog submissions until the MONs are caught back
> up, dropping all logs between the "I'm flooded" and "I'm caught up" states. 
We may miss slow ops info if we start dropping logs in-between the "I'm flooded" and "I'm caught up" states. Also cluster logging happens in quick succession so we will still be consuming significant monstore space. May be we can use osd_log_slow_op_to_clog config option on use-case basis and keeping aggregated slow ops info in the cluster log ? 

> I know we've tried some intelligent back-off at the service levels, but that
> hasn't seemed to help ( or help enough ).

Let us discuss these inputs for future enhancement.

Comment 34 errata-xmlrpc 2023-03-20 18:55:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360

Note You need to log in before you can comment on or make changes to this bug.