Bug 1255877 - [RFE] Provide more thorough live cluster statistics via a new command or within the 'ceph -s' and 'ceph osd pool stats' commands
[RFE] Provide more thorough live cluster statistics via a new command or with...
Status: NEW
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
1.3.0
All Linux
medium Severity medium
: rc
: 3.*
Assigned To: Josh Durgin
ceph-qe-bugs
: FutureFeature, Triaged
Depends On:
Blocks: 1258382 1319075
  Show dependency treegraph
 
Reported: 2015-08-21 14:33 EDT by Kyle Squizzato
Modified: 2018-05-09 07:27 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2898771 None None None 2017-02-02 09:11 EST

  None (edit)
Description Kyle Squizzato 2015-08-21 14:33:50 EDT
1. Proposed title of this feature request 
-------------------------------
Provide more thorough live cluster statistics via a new command or within the 'ceph -s' and 'ceph osd pool stats' commands
    
2. Who is the customer behind the request? 
------------------------------- 
Account: Cisco Cloud Services - Account #5563522
  
TAM customer: yes  
SRM customer: yes
Strategic: yes 
  
3. What is the nature and description of the request?
-------------------------------
To provide statistics for live operations across the cluster.  
  
4. Why does the customer need this? (List the business requirements here)
-------------------------------
From the customer's update:

We have a few cluster that appears to be running in to IOPS capacity issue that the tenant performance is impacted due to IOPS overload.
We are trying to implement a solution to monitor and measure the cluster IOPS to a threshold point for capacity expansion decision. However, the IOPS depends on many aspect (block size, q-depth, read/write mix, random vs sequential workload etc) and measure and define total cluster IOPS capacity and usage would require normalize the workload pattern.  The current "ceph -s" report on client OP does not show any details of block size, read/write mix etc hence it is difficult to predict. 

I am wondering if RHCS has another formula or algorithm to 1) determine the cluster IOPS capacity based on hardware and ceph configuration; 2) ways to measure IOPS consumption and predict the usage trend.
  
5. How would the customer like to achieve this? (List the functional requirements here) 
-------------------------------
The customer would like an update to the outputs provided by 'ceph -s' or 'ceph osd pool stats' to include more thorough results regarding the live ops occurring across the cluster, specifically: 

 * Block size
 * IOPS
 * Read/write mix
 * Random vs Sequential workload

Some of the aforementioned commands perform some of this functionality, but not to the level that the customer is expecting.  They already utilize the various 'bench' commands in their environment as well as host based toolsets such as 'dd' or 'iostat' to measure disk performance. 

I believe utilizing the functionality baked into the admin sockets' perf dumps would be a good start, if there's someway we can take this data across all the osd's and mon's in the cluster and output it in a meaningful, readable way, I believe that could satisfy this request.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
-------------------------------
They can run the tool across their environment and see if the results line up with what they're expecting or seeing in other host-based tools.  
  
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?  
-------------------------------
Not that I can see.  There's a variety of community offerings for monitoring solutions, but they all seem to piggyback off of either the bench commands or the pool stats and then average that data.  They don't provide what the customer is looking for.
  
8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  
-------------------------------
In regards to timeline, there isn't a specific release they'd like to target, however, to communicate the importance of this request, the customer has stated that this is stopping them from understanding how to predict trends for how to expand their cluster to meet growing demands. 

9. Is the sales team involved in this request and do they have any additional input?  
-------------------------------
No
  
10. List any affected packages or components.  
-------------------------------  
ceph-common

11. Would the customer be able to assist in testing this functionality if implemented?  
-------------------------------
Yes
Comment 2 Vikhyat Umrao 2015-08-25 09:49:16 EDT
(In reply to Kyle Squizzato from comment #0)

> 4. Why does the customer need this? (List the business requirements here)
> -------------------------------
> From the customer's update:
> 
> We have a few cluster that appears to be running in to IOPS capacity issue
> that the tenant performance is impacted due to IOPS overload.
> We are trying to implement a solution to monitor and measure the cluster
> IOPS to a threshold point for capacity expansion decision. However, the IOPS
> depends on many aspect (block size, q-depth, read/write mix, random vs
> sequential workload etc) and measure and define total cluster IOPS capacity
> and usage would require normalize the workload pattern.  The current "ceph
> -s" report on client OP does not show any details of block size, read/write
> mix etc hence it is difficult to predict. 
> 

Thanks Kyle for detail description.  

We do have Calamari GUI and REST APIs (http://ceph.com/calamari/docs/calamari_rest/index.html).
Calamari is capable of showing system , disk and network related stuff for any node in cluster it should be attached to calamari.

1. System covers:
- CPU summary 
- Load Average 
- Memory 
- All CPU Detail for that node

2. Disk
- Bytes
- IOPS
- RW Await
- Capacity (Disk Space)
- Inodes

3. Network
- Bytes
- Packets (Network TX/RX Packets,  Network TX/RX Errors)

All above given data would be in graphs if this is okay with Cisco they can use calamari gui or 
they can develop their own application using calamari REST APIs (We will cross verify with calamari team).

or 

Cisco wants these all details via ceph commands ?
Comment 3 Kyle Squizzato 2015-08-25 13:21:36 EDT
Hi Vikhyat, 

I think the current issue isn't really the fact that Cisco isn't open to using stuff like Calamari or other types of baked in tools, it's just their offerings for monitoring aren't robust enough.  Cisco was specifically looking for information regarding the read/write mix across the cluster, specific details on workload, ie. sequential vs random writes, etc.

I'll ask them to provide a more thorough list of what they're looking for.
Comment 4 Yuming Ma 2015-10-23 12:28:42 EDT
What we are looking for is a way to measure 1). storage performance capacity in terms of IOPS and Bandwidth; 2) run-time performance performance consumption/utilization in terms of IOPS and Bandwidth, such that we can tell if how much of the performance (IOPS/ BW) being used and how much is still available and when to expand the cluster to sustain need. 

The problem with ceph right now is that it does not have a normalized way to measure the report the performance capacity and utilization hence we cannot tell how much being used and how much available. A good example of such measurement is SolidFire which normalize all performance measure at 4KB IO size.

Note You need to log in before you can comment on or make changes to this bug.