Bug 1261700

Summary: RFE : Feature: Periodic FOP statistics dumps for v3.6.x/v3.7.x
Product: [Community] GlusterFS Reporter: rwareing
Component: coreAssignee: rwareing
Status: CLOSED NEXTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.6CC: asengupt, bengland, bugs, kaushal, sshreyas
Target Milestone: ---Keywords: FutureFeature, Triaged
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: glusterfs-3.8.0 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 1266476 (view as bug list) Environment:
Last Closed: 2016-08-23 12:34:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1266476    
Attachments:
Description Flags
Patch for stats dump code.
none
Example output for the dumps (nfsd) none

Description rwareing 2015-09-10 02:13:39 UTC
Created attachment 1071977 [details]
Patch for stats dump code.

Description of problem:
Patch to add periodic JSON dumps of FOP latency & hit rate statistics from the io-stats translator.  Dumps are controlled by the diagnostics.stats-dump-interval <dump interval sec> option and stored in /var/lib/glusterd/stats under their respective FUSE, gNFSd or brick instance.

This is immensely useful to reliably ferret out diagnostics & performance metrics from GlusterFS for injection into a robust analytics backend for future analysis or alarming.  Heavily in-use here at Facebook.

Patches clean onto the release-3.6 or release-3.7 branches as of this bug creation.

Version-Release number of selected component (if applicable):
v3.6.x or v3.7.x, should be trivial to port to master.

How reproducible:
100%

Steps to Reproduce:
N/A

Actual results:


Expected results:


Additional info:

Comment 1 Ben England 2015-09-22 20:55:47 UTC
Richard, 

this is an extremely good idea.  I have had to parse gluster volume profile output and it is extremely hard to do.  JSON would make it much easier. Also, io-stats translator can run client-side so you get client-side latency, not server-side.   Would be great if /usr/sbin/gluster could initiate the profiling so we didn't have to edit a volfile.

Can you provide an attachment with JSON output from the patch so that lazy folks like me can see what it looks like?

thx

-Ben England, Perf. Engr., Red Hat

Comment 2 rwareing 2015-09-22 21:25:55 UTC
Created attachment 1076043 [details]
Example output for the dumps (nfsd)

Comment 3 rwareing 2015-09-22 21:34:08 UTC
Added example output.  Also, this is automatically engaged when either of these options is enabled:

diagnostics.latency-measurement
diagnostics.count-fop-hits

and 

diagnostics.ios-dump-interval 

...is set to something non-zero.

We run with these enabled 24x7 on all clusters at all times, and as you have noted it's extremely powerful to be able to look at performance from all layers of the stack (FUSE client, gNFSd and bricks).  And with lockless counters (also in this patch) we haven't observed any perf hit.

Comment 4 Avra Sengupta 2015-09-25 11:23:19 UTC
Cloning this bug to master.

Comment 5 Kaushal 2016-08-23 12:34:51 UTC
This bug is being closed as GlusterFS-3.6 is nearing its End-Of-Life and only important security bugs will be fixed. This bug has been fixed in more recent GlusterFS releases. If you still face this bug with the newer GlusterFS versions, please open a new bug.