Bug 1271310

Summary: RFE : Feature: Tunable FOP sampling for v3.6.x/v3.7.x
Product: [Community] GlusterFS Reporter: Jeff Darcy <jdarcy>
Component: coreAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: mainlineCC: bugs, kkeithle, rwareing, skoduri, sshreyas
Target Milestone: ---Keywords: FutureFeature, Reopened, Triaged
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: glusterfs-3.8rc2 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: 1262092 Environment:
Last Closed: 2016-06-16 13:39:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1262092    
Bug Blocks:    

Description Jeff Darcy 2015-10-13 15:24:02 UTC
Cloning for master.

+++ This bug was initially created as a clone of Bug #1262092 +++

Description of problem:

debug/io-stats: FOP sampling feature

- Using sampling feature you can record details about every Nth FOP.
  The fields in each sample are: FOP type, hostname, uid, gid, FOP priority,
  port and time taken (latency) to fufill the request.
- Implemented using a ring buffer which is not (m/c) allocated in the IO path,
  this should make the sampling process pretty cheap.
- DNS resolution done @ dump time not @ sample time for performance w/
  cache
- Metrics can be used for both diagnostics, traffic/IO profiling as well
  as P95/P99 calculations
- To control this feature there are two new volume options:
  diagnostics.fop-sample-interval - The sampling interval, e.g. 1 means
  sample every FOP, 100 means sample every 100th FOP
  diagnostics.fop-sample-buf-size - The size (in bytes) of the ring
  buffer used to store the samples.  In the even more samples
  are collected in the stats dump interval than can be held in this buffer,
  the oldest samples shall be discarded.  Samples are stored in the log
  directory under /var/log/glusterfs/samples.
- Uses DNS cache written by sshreyas (Thank-you!), the DNS cache
  TTL is controlled by the diagnostics.stats-dnscache-ttl-sec option
  and defaults to 24hrs.

Pre-requisite: Requires stats dump patch from bug 1261700 to function.

Version-Release number of selected component (if applicable):
3.6.x, 3.7.x

How reproducible:
100%

Steps to Reproduce:
n/a

Actual results:
n/a

Expected results:
n/a

Additional info:
n/a

--- Additional comment from Vijay Bellur on 2015-10-09 14:17:24 EDT ---

REVIEW: http://review.gluster.org/12210 (debug/io-stats: Add FOP sampling feature) posted (#3) for review on master by Vijay Bellur (vbellur)

Comment 1 Vijay Bellur 2015-10-13 15:38:28 UTC
REVIEW: http://review.gluster.org/12210 (debug/io-stats: Add FOP sampling feature) posted (#4) for review on master by Jeff Darcy (jdarcy)

Comment 2 Vijay Bellur 2015-10-19 16:13:49 UTC
REVIEW: http://review.gluster.org/12210 (debug/io-stats: Add FOP sampling feature) posted (#5) for review on master by Jeff Darcy (jdarcy)

Comment 3 Vijay Bellur 2015-11-01 17:14:38 UTC
COMMIT: http://review.gluster.org/12210 committed in master by Vijay Bellur (vbellur) 
------
commit d3e496cbcd35b9d9b840e328ae109c44f59083ce
Author: Richard Wareing <rwareing>
Date:   Tue Jun 23 17:03:11 2015 -0700

    debug/io-stats: Add FOP sampling feature
    
    Summary:
    - Using sampling feature you can record details about every Nth FOP.
      The fields in each sample are: FOP type, hostname, uid, gid, FOP priority,
      port and time taken (latency) to fufill the request.
    - Implemented using a ring buffer which is not (m/c) allocated in the IO path,
      this should make the sampling process pretty cheap.
    - DNS resolution done @ dump time not @ sample time for performance w/
      cache
    - Metrics can be used for both diagnostics, traffic/IO profiling as well
      as P95/P99 calculations
    - To control this feature there are two new volume options:
      diagnostics.fop-sample-interval - The sampling interval, e.g. 1 means
      sample every FOP, 100 means sample every 100th FOP
      diagnostics.fop-sample-buf-size - The size (in bytes) of the ring
      buffer used to store the samples.  In the even more samples
      are collected in the stats dump interval than can be held in this buffer,
      the oldest samples shall be discarded.  Samples are stored in the log
      directory under /var/log/glusterfs/samples.
    - Uses DNS cache written by sshreyas (Thank-you!), the DNS cache
      TTL is controlled by the diagnostics.stats-dnscache-ttl-sec option
      and defaults to 24hrs.
    
    Test Plan:
    - Valgrind'd to ensure it's leak free
    - Run prove test(s)
    - Shadow testing on 100+ brick cluster
    
    Change-Id: I9ee14c2fa18486b7efb38e59f70687249d3f96d8
    BUG: 1271310
    Signed-off-by: Jeff Darcy <jdarcy>
    Reviewed-on: http://review.gluster.org/12210
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 4 Kaleb KEITHLEY 2016-05-17 12:39:00 UTC
committed in master branch in 2016-06-xx. closing as current release (3.7.x and/or later)

Comment 5 Niels de Vos 2016-06-16 13:39:57 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user