Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2230801

Summary:	Unable to get "dump_osd_network" output via mgr admin socket
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Pawan <pdhiran>
Component:	RADOS	Assignee:	Radoslaw Zarzynski <rzarzyns>
Status:	CLOSED NOTABUG	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	bhubbard, ceph-eng-bugs, cephqe-warriors, nojha, rzarzyns, vumrao
Target Milestone:	---
Target Release:	7.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-11 16:43:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pawan 2023-08-10 06:32:05 UTC

Description of problem:
We observed that there were slow heartbeats on front and back interfaces of OSDs, and health warn had been generated for the same.

[WARN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 3477.396ms)
        Slow OSD heartbeats on back from osd.4 [] to osd.11 [] 3477.396 msec
        Slow OSD heartbeats on back from osd.4 [] to osd.8 [] 3470.730 msec
        Slow OSD heartbeats on back from osd.13 [] to osd.8 [] 3229.386 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.11 [] 3019.577 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.8 [] 3012.205 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.2 [] 2450.898 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.5 [] 2450.715 msec
        Slow OSD heartbeats on back from osd.7 [] to osd.14 [] 2436.617 msec
        Slow OSD heartbeats on back from osd.13 [] to osd.14 [] 1833.005 msec
        Slow OSD heartbeats on back from osd.13 [] to osd.2 [] 1832.006 msec
        Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information
[WARN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 3019.537ms)
        Slow OSD heartbeats on front from osd.7 [] to osd.11 [] 3019.537 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.8 [] 3014.470 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.14 [] 2451.640 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.5 [] 2450.600 msec
        Slow OSD heartbeats on front from osd.7 [] to osd.2 [] 2438.592 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.2 [] 1826.537 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.5 [] 1826.496 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.11 [] 1820.281 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.14 [] 1819.868 msec
        Slow OSD heartbeats on front from osd.13 [] to osd.8 [] 1816.324 msec
        Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information

Looking at the info provided in the health_warn, We tried to get the "dump_osd_network" o/p as suggested, but the command fails with error, "invalid command"

# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-mgr.argo012.odttqx.asok dump_osd_network
no valid command found; 10 closest matches:
0
1
2
abort
assert
config diff
config diff get <var>
config get <var>
config help [<var>]
config set <var> <val>...
admin_socket: invalid command

Ran command as specified in the document referenced below, but no luck! : 

# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-mgr.argo012.odttqx.asok dump_osd_network 0
no valid command found; 10 closest matches:
0
1
2
abort
assert
config diff
config diff get <var>
config get <var>
config help [<var>]
config set <var> <val>...
admin_socket: invalid command


I am connected to the correct admin socket, as i'm getting o/p for other commands :
# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-mgr.argo012.odttqx.asok dump_cache
{
    "cache": []
}

Tried the same command with the OSD admin socket, and the command works :
# ceph daemon /var/run/ceph/66070a80-2f84-11ee-bc2c-0cc47af3ea56/ceph-osd.23.asok dump_osd_network
{
    "threshold": 1000,
    "entries": []
}

The upstream guide says " This command is usually sent to a Ceph Manager Daemon, but it can be used to collect information about a specific OSD’s interactions by sending it to that OSD." -> Even though the guide says it works on both MGR and OSD, I am observing it to work only on OSDs.

Reference : https://docs.ceph.com/en/latest/rados/operations/monitoring/#network-performance-checks 


Questions:
1. Are we running the command with wrong options? If yes, can you please share the usage for the command?
2. Is the command support for mgr revoked? If yes, guides and the health warn should be updated.

Version-Release number of selected component (if applicable):
# ceph version
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)

How reproducible:
Always

Steps to Reproduce:
1. Deploy RHCS 6.1 cluster, Upgrade the cluster to 6.1z1
2. Observe slow heartbeat warnings post upgrade.
3. Try running the "dump_osd_network" command to get details.
4. Error upon command execution 

Actual results:
Command errors out stating invalid command

Expected results:
Command runs as expected

Additional info: