Bug 2230801 - Unable to get "dump_osd_network" output via mgr admin socket
Summary: Unable to get "dump_osd_network" output via mgr admin socket
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Radoslaw Zarzynski
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-10 06:32 UTC by Pawan
Modified: 2023-08-11 16:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-11 16:43:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7188 0 None None None 2023-08-10 06:33:49 UTC

Description Pawan 2023-08-10 06:32:05 UTC
Description of problem:
We observed that there were slow heartbeats on front and back interfaces of OSDs, and health warn had been generated for the same.

[WARN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 3477.396ms)
        Slow OSD heartbeats on back from osd.4 [] to osd.11 [] 3477.396 msec
        Slow OSD heartbeats on back from osd.4 [] to osd.8 [] 3470.730 msec
        Slow OSD heartbeats on back from osd.13 [] to osd.8 [] 3229.386 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.11 [] 3019.577 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.8 [] 3012.205 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.2 [] 2450.898 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.5 [] 2450.715 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.14 [] 2436.617 msec
        Slow OSD heartbeats on back from osd.13 [] to osd.14 [] 1833.005 msec
        Slow OSD heartbeats on back from osd.13 [] to osd.2 [] 1832.006 msec
        Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information
[WARN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 3019.537ms)
        Slow OSD heartbeats on front from osd.7 [] to osd.11 [] 3019.537 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.8 [] 3014.470 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.14 [] 2451.640 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.5 [] 2450.600 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.2 [] 2438.592 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.2 [] 1826.537 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.5 [] 1826.496 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.11 [] 1820.281 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.14 [] 1819.868 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.8 [] 1816.324 msec
        Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information

Looking at the info provided in the health_warn, We tried to get the "dump_osd_network" o/p as suggested, but the command fails with error, "invalid command"

# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-mgr.argo012.odttqx.asok dump_osd_network
no valid command found; 10 closest matches:
0
1
2
abort
assert
config diff
config diff get <var>
config get <var>
config help [<var>]
config set <var> <val>...
admin_socket: invalid command

Ran command as specified in the document referenced below, but no luck! : 

# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-mgr.argo012.odttqx.asok dump_osd_network 0
no valid command found; 10 closest matches:
0
1
2
abort
assert
config diff
config diff get <var>
config get <var>
config help [<var>]
config set <var> <val>...
admin_socket: invalid command


I am connected to the correct admin socket, as i'm getting o/p for other commands :
# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-mgr.argo012.odttqx.asok dump_cache
{
    "cache": []
}

Tried the same command with the OSD admin socket, and the command works :
# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-osd.23.asok dump_osd_network
{
    "threshold": 1000,
    "entries": []
}

The upstream guide says " This command is usually sent to a Ceph Manager Daemon, but it can be used to collect information about a specific OSD’s interactions by sending it to that OSD." -> Even though the guide says it works on both MGR and OSD, I am observing it to work only on OSDs.

Reference : https://docs.ceph.com/en/latest/rados/operations/monitoring/#network-performance-checks 


Questions:
1. Are we running the command with wrong options? If yes, can you please share the usage for the command?
2. Is the command support for mgr revoked? If yes, guides and the health warn should be updated.

Version-Release number of selected component (if applicable):
# ceph version
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)

How reproducible:
Always

Steps to Reproduce:
1. Deploy RHCS 6.1 cluster, Upgrade the cluster to 6.1z1
2. Observe slow heartbeat warnings post upgrade.
3. Try running the "dump_osd_network" command to get details.
4. Error upon command execution 

Actual results:
Command errors out stating invalid command

Expected results:
Command runs as expected

Additional info:


Note You need to log in before you can comment on or make changes to this bug.