Bug 1929760 - [RFE] [Ceph-Dashboard] [Ceph-mgr] Dashboard to display per OSD slow op counter and type of slow op
Summary: [RFE] [Ceph-Dashboard] [Ceph-mgr] Dashboard to display per OSD slow op counte...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Dashboard
Version: 5.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 6.1
Assignee: Pere Diaz Bou
QA Contact: Sayalee
Akash Raj
URL:
Whiteboard:
Depends On: 2180567 2186095
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-17 15:25 UTC by Mike Hackett
Modified: 2023-10-14 04:25 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
.A metric to track slow operation per daemon is added Previously, tracking slow operations was cumbersome as it required logging parsing. With this release, a metric to track slow operations in Ceph daemon is added in Prometheus.
Clone Of:
Environment:
Last Closed: 2023-06-15 09:15:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 49519 0 None Merged quincy: mgr/prometheus: expose daemon health metrics 2023-02-08 12:46:13 UTC
Red Hat Issue Tracker RHCSDASH-318 0 None None None 2021-11-04 19:53:25 UTC
Red Hat Product Errata RHSA-2023:3623 0 None None None 2023-06-15 09:16:16 UTC

Description Mike Hackett 2021-02-17 15:25:12 UTC
Description of problem:
Expanding on the work done in https://bugzilla.redhat.com/show_bug.cgi?id=1929756. Display OSD counter for number of slow ops against a specific OSD device and also list counters for the type of slow request against the OSD. 

We currently have to use bash scripts to generate these numbers from the ceph cluster logs, example:

Here is the slow request by OSD breakdown in the cluster log:
$ grep 'slow request [3-5][0-9]\.' ceph.log | awk '{print $3}' | sort -g | uniq -c | sort -g
      2 osd.1869
      4 osd.1446
      7 osd.1145
      8 osd.1045
      8 osd.2084
     13 osd.1172
     22 osd.0
     35 osd.17
     49 osd.309
    106 osd.2361
    196 osd.1651
    450 osd.2484
    533 osd.1629
   1849 osd.1237
   4228 osd.2301
   9332 osd.118

We can also breakdown by slow request type.

Displaying these metrics per OSD in the dashboard will help customers troubleshoot where the actual issue resides as determining the largest offender is what usually solves these issues. Also being able to classify these slow requests in per bucket (host, rack) is largely part of troubleshooting slow request issues.

Comment 16 Ken Dreyer (Red Hat) 2023-03-31 16:04:21 UTC
https://github.com/ceph/ceph/pull/49519 will be in v17.2.6

Comment 33 errata-xmlrpc 2023-06-15 09:15:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623

Comment 34 Red Hat Bugzilla 2023-10-14 04:25:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.