Bug 1535634 - [RFE] (Ceph) Better method besides heartbeats to differentiate between network and performance issues
Summary: [RFE] (Ceph) Better method besides heartbeats to differentiate between networ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 4.1
Assignee: David Zafman
QA Contact: Manohar Murthy
URL:
Whiteboard:
Depends On:
Blocks: 1816167
TreeView+ depends on / blocked
 
Reported: 2018-01-17 19:08 UTC by Benjamin Schmaus
Modified: 2023-12-15 16:01 UTC (History)
14 users (show)

Fixed In Version: 14.2.8-50.el7cp and 14.2.8-59.el8cp
Doc Type: Enhancement
Doc Text:
.Update to use ping times to track network performance Previously, when network problems occur, it was difficult to distinguish from other performance issues. With this release, a heath warning is generated if the average {storage-product} OSD heartbeat exceeds a configurable threshold for any computed intervals. The {storage-product} OSD computes 1 minute,5 minute and 15 minute intervals with the average, minimum and maximum values.
Clone Of:
Environment:
Last Closed: 2020-09-30 17:46:08 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 40640 0 None None None 2019-08-28 20:49:02 UTC
Red Hat Issue Tracker RHCEPH-1537 0 None None None 2021-09-09 13:04:26 UTC

Description Benjamin Schmaus 2018-01-17 19:08:55 UTC
Description of problem: Customer has indicated they would like to see a better method for determining if there is a networking infrastructure issue or if the issue does indeed reside in Ceph application.

For example customer had a large Ceph cluster with millions of objects that were continuously being added.  During the process they would have OSDs that would hang.  Initial findings from engineering indicated this was probably a networking issue however when they reduced the number of objects the hang behaviour went away and/or was reduced thus proving the networking infrastructure was sound.

Having a method to better identify Ceph related issues or external related infrastructure issues would help to focus where troubleshooting should continue in a complex case.


Version-Release number of selected component (if applicable):
2.4

How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
NA

Expected results:
NA

Additional info:

Comment 10 Giridhar Ramaraju 2019-08-05 13:09:12 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 11 Giridhar Ramaraju 2019-08-05 13:10:31 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 12 Josh Durgin 2020-04-22 20:08:50 UTC
This is in all 5.0 builds - needs qa ack.

Comment 13 Josh Durgin 2020-06-17 15:27:15 UTC
Not sure why the bot didn't change this, but it has all acks and is in all build, so it should be ON_QA


Note You need to log in before you can comment on or make changes to this bug.