Bug 1535634

Summary: [RFE] (Ceph) Better method besides heartbeats to differentiate between network and performance issues
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Benjamin Schmaus <bschmaus>
Component: RADOSAssignee: David Zafman <dzafman>
Status: CLOSED CURRENTRELEASE QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.4CC: asakthiv, asriram, assingh, bhubbard, ceph-eng-bugs, ceph-qe-bugs, dzafman, hklein, jdurgin, kchai, mhackett, pasik, tchandra, vumrao
Target Milestone: rcKeywords: FutureFeature
Target Release: 4.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 14.2.8-50.el7cp and 14.2.8-59.el8cp Doc Type: Enhancement
Doc Text:
.Update to use ping times to track network performance Previously, when network problems occur, it was difficult to distinguish from other performance issues. With this release, a heath warning is generated if the average {storage-product} OSD heartbeat exceeds a configurable threshold for any computed intervals. The {storage-product} OSD computes 1 minute,5 minute and 15 minute intervals with the average, minimum and maximum values.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 17:46:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1816167    

Description Benjamin Schmaus 2018-01-17 19:08:55 UTC
Description of problem: Customer has indicated they would like to see a better method for determining if there is a networking infrastructure issue or if the issue does indeed reside in Ceph application.

For example customer had a large Ceph cluster with millions of objects that were continuously being added.  During the process they would have OSDs that would hang.  Initial findings from engineering indicated this was probably a networking issue however when they reduced the number of objects the hang behaviour went away and/or was reduced thus proving the networking infrastructure was sound.

Having a method to better identify Ceph related issues or external related infrastructure issues would help to focus where troubleshooting should continue in a complex case.


Version-Release number of selected component (if applicable):
2.4

How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
NA

Expected results:
NA

Additional info:

Comment 10 Giridhar Ramaraju 2019-08-05 13:09:12 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 11 Giridhar Ramaraju 2019-08-05 13:10:31 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 12 Josh Durgin 2020-04-22 20:08:50 UTC
This is in all 5.0 builds - needs qa ack.

Comment 13 Josh Durgin 2020-06-17 15:27:15 UTC
Not sure why the bot didn't change this, but it has all acks and is in all build, so it should be ON_QA