Bug 1467614

Summary: Gluster read/write performance improvements on NVMe backend
Product: [Community] GlusterFS Reporter: Krutika Dhananjay <kdhananj>
Component: coreAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, ingard, jahernan, jcall, kdhananj, mchangir, moagrawa, mpillai, pasik, rgowdapp, sabose, zhhuan
Target Milestone: ---Keywords: Performance, Reopened, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-6.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-25 16:30:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1495397, 1495430, 1501013    
Attachments:
Description Flags
I attached one of the bricks to mutrace and ran the same fio randrd job again. Attached is the output of the run.
none
iobuf.diff
none
modified inode_unref to reduce lock contention on table->lock
none
mutrace output for fio randread 24 concurrent jobs
none
mutrace output for fio randread with 48 concurrent jobs none

Description Krutika Dhananjay 2017-07-04 11:04:05 UTC
Description of problem:

For more information, refer to http://lists.gluster.org/pipermail/gluster-devel/2017-June/052993.html

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Worker Ant 2017-07-04 11:24:58 UTC
REVIEW: https://review.gluster.org/17687 (socket: Use granular mutex locks during pollin and pollout event processing) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2017-07-04 11:25:08 UTC
REVIEW: https://review.gluster.org/17688 (socket: Allow parallel processing of pollin and pollout events wherever possible) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

Comment 3 Worker Ant 2017-07-27 06:37:05 UTC
REVIEW: https://review.gluster.org/17687 (socket: Use granular mutex locks during pollin and pollout event processing) posted (#5) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Worker Ant 2017-07-27 06:37:09 UTC
REVIEW: https://review.gluster.org/17688 (socket: Allow parallel processing of pollin and pollout events wherever possible) posted (#5) for review on master by Krutika Dhananjay (kdhananj)

Comment 5 Worker Ant 2017-08-01 09:32:19 UTC
REVIEW: https://review.gluster.org/17687 (socket: Use granular mutex locks during pollin and pollout event processing) posted (#6) for review on master by Krutika Dhananjay (kdhananj)

Comment 6 Worker Ant 2017-08-01 09:32:23 UTC
REVIEW: https://review.gluster.org/17688 (socket: Allow parallel processing of pollin and pollout events wherever possible) posted (#6) for review on master by Krutika Dhananjay (kdhananj)

Comment 7 Worker Ant 2017-08-02 17:10:14 UTC
REVIEW: https://review.gluster.org/17958 (rpc: Eliminate conn->lock contention by using more granular locks) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 8 Worker Ant 2017-08-14 04:49:18 UTC
REVIEW: https://review.gluster.org/17958 (rpc: Eliminate conn->lock contention by using more granular locks) posted (#2) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 9 Manoj Pillai 2017-08-28 05:00:27 UTC
Profiling a run for a simplified test: single-brick gluster volume mounted on same server; brick on nvme device; 4K random read fio test with direct=1. 

From: perf record -F 99 -p <glusterfs pid>

<quote>
Overhead  Command    Shared Object          Symbol
  11.34%  glusterfs  libpthread-2.17.so     [.] pthread_mutex_lock
   6.10%  glusterfs  libpthread-2.17.so     [.] pthread_mutex_unlock
   5.66%  glusterfs  libc-2.17.so           [.] vfprintf
   3.12%  glusterfs  [kernel.kallsyms]      [k] try_to_wake_up
   2.47%  glusterfs  libglusterfs.so.0.0.1  [.] _gf_msg
   2.07%  glusterfs  libc-2.17.so           [.] __memset_sse2
   1.88%  glusterfs  libc-2.17.so           [.] _IO_default_xsputn
   1.82%  glusterfs  libc-2.17.so           [.] _int_free
   1.59%  glusterfs  [kernel.kallsyms]      [k] native_queued_spin_lock_slowpath
   1.56%  glusterfs  libc-2.17.so           [.] __libc_calloc
   1.54%  glusterfs  libc-2.17.so           [.] _int_malloc
   1.41%  glusterfs  [kernel.kallsyms]      [k] _raw_qspin_lock
   1.35%  glusterfs  [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
   1.14%  glusterfs  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
   1.04%  glusterfs  [kernel.kallsyms]      [k] futex_wait_setup
   1.01%  glusterfs  libglusterfs.so.0.0.1  [.] mem_get_from_pool
   0.97%  glusterfs  fuse.so                [.] fuse_attr_cbk
   0.91%  glusterfs  [kernel.kallsyms]      [k] futex_wake
   0.90%  glusterfs  libpthread-2.17.so     [.] pthread_spin_lock
   0.78%  glusterfs  libglusterfs.so.0.0.1  [.] mem_put
   0.77%  glusterfs  libglusterfs.so.0.0.1  [.] mem_get
   0.73%  glusterfs  libc-2.17.so           [.] _itoa_word
...
</quote>

From: perf record -F 99 -p <glusterfsd pid>

<quote>
Overhead  Command     Shared Object          Symbol
   6.25%  glusterfsd  libpthread-2.17.so     [.] pthread_mutex_lock
   4.48%  glusterfsd  libglusterfs.so.0.0.1  [.] _gf_msg
   3.60%  glusterfsd  libpthread-2.17.so     [.] pthread_mutex_unlock
   3.10%  glusterfsd  libc-2.17.so           [.] vfprintf
   1.37%  glusterfsd  [kernel.kallsyms]      [k] __bufio_new
   1.37%  glusterfsd  [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
   1.36%  glusterfsd  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
   1.31%  glusterfsd  libc-2.17.so           [.] _IO_default_xsputn
   1.21%  glusterfsd  [kernel.kallsyms]      [k] __d_lookup_rcu
   1.12%  glusterfsd  [kernel.kallsyms]      [k] blk_throtl_bio
   1.10%  glusterfsd  [kernel.kallsyms]      [k] _raw_qspin_lock
   1.10%  glusterfsd  libc-2.17.so           [.] __libc_calloc
   0.92%  glusterfsd  [kernel.kallsyms]      [k] avc_has_perm_noaudit
   0.86%  glusterfsd  libpthread-2.17.so     [.] pthread_getspecific
   0.84%  glusterfsd  libc-2.17.so           [.] __memset_sse2
   0.82%  glusterfsd  [kernel.kallsyms]      [k] try_to_wake_up
   0.81%  glusterfsd  [kernel.kallsyms]      [k] kmem_cache_alloc
   0.79%  glusterfsd  libc-2.17.so           [.] _int_malloc
   0.75%  glusterfsd  libc-2.17.so           [.] __strchrnul
...
</quote>

A lot of cycles being spent in pthread mutex lock/unlock. However, afaik, this doesn't tell us whether there is lock contention, and on which locks there is contention. Will need to use something like mutrace for that.

Comment 10 Krutika Dhananjay 2017-08-28 06:35:23 UTC
That's interesting.

I actually ran pure randrd job again on Sanjay's setup (through vms) with the best configuration known yet and with mempool and mem-accounting disabled in code last week. And I saw that the FUSE thread is consistently at 95% CPU utilization. I'm running perf-record on this build right now to see where the thread is spending all its time.

As for the perf record output above, I think using the "-call-graph=dwarf" option should also get us the backtrace, though this would require glusterfs-debuginfo package to be installed.

-Krutika

Comment 11 Krutika Dhananjay 2017-08-28 06:37:30 UTC
(In reply to Krutika Dhananjay from comment #10)
> That's interesting.
> 
> I actually ran pure randrd job again on Sanjay's setup (through vms) with
> the best configuration known yet and with mempool and mem-accounting
> disabled in code last week. And I saw that the FUSE thread is consistently
> at 95% CPU utilization. I'm running perf-record on this build right now to
> see where the thread is spending all its time.
> 
> As for the perf record output above, I think using the "-call-graph=dwarf"
> option should also get us the backtrace, though this would require
> glusterfs-debuginfo package to be installed.
> 
> -Krutika

The 95% CPU utilization was *with* client-io-threads enabled. Also double-checked to see that the translator does appear in the client volfile too.

Comment 12 Manoj Pillai 2017-08-28 07:01:21 UTC
(In reply to Krutika Dhananjay from comment #10)
> That's interesting.
> 
> I actually ran pure randrd job again on Sanjay's setup (through vms) with
> the best configuration known yet and with mempool and mem-accounting
> disabled in code last week. And I saw that the FUSE thread is consistently
> at 95% CPU utilization. I'm running perf-record on this build right now to
> see where the thread is spending all its time.

What was the gain, if any, in IOPs over previous runs?

FYI, the build I'm running with is a downstream build. This is from a few quick runs over the weekend on an existing setup that has other work in progress. The build has https://review.gluster.org/17391, and client-io-threads enabled.

> 
> As for the perf record output above, I think using the "-call-graph=dwarf"
> option should also get us the backtrace, though this would require
> glusterfs-debuginfo package to be installed.
> 
> -Krutika

debuginfo package is installed, and I have some data captured with call-graph enabled. I found the above to be cleaner to just get the functions where it is spending most time.

Comment 13 Krutika Dhananjay 2017-08-28 07:05:51 UTC
(In reply to Manoj Pillai from comment #12)
> (In reply to Krutika Dhananjay from comment #10)
> > That's interesting.
> > 
> > I actually ran pure randrd job again on Sanjay's setup (through vms) with
> > the best configuration known yet and with mempool and mem-accounting
> > disabled in code last week. And I saw that the FUSE thread is consistently
> > at 95% CPU utilization. I'm running perf-record on this build right now to
> > see where the thread is spending all its time.
> 
> What was the gain, if any, in IOPs over previous runs?

There was no gain or drop in IOPs. The only difference I noticed was that fuse thread used to be at 78% CPU utilization before; now with memacct and mempool disabled, it shot up to 95%.

> 
> FYI, the build I'm running with is a downstream build. This is from a few
> quick runs over the weekend on an existing setup that has other work in
> progress. The build has https://review.gluster.org/17391, and
> client-io-threads enabled.

OK.

> 
> > 
> > As for the perf record output above, I think using the "-call-graph=dwarf"
> > option should also get us the backtrace, though this would require
> > glusterfs-debuginfo package to be installed.
> > 
> > -Krutika
> 
> debuginfo package is installed, and I have some data captured with
> call-graph enabled. I found the above to be cleaner to just get the
> functions where it is spending most time.

Comment 14 Krutika Dhananjay 2017-08-29 06:51:15 UTC
Created attachment 1319327 [details]
I attached one of the bricks to mutrace and ran the same fio randrd job again. Attached is the output of the run.

Comment 15 Krutika Dhananjay 2017-09-01 08:50:41 UTC
Created attachment 1320865 [details]
iobuf.diff

Changed the code in iobuf.c such that all iobuf allocations happen directly through std allocation (calloc) without going through the cycle of taking iobuf_pool->mutex and selecting the right iobuf_arena and then selecting an iobuf within it; and tested all of this on Sanjay's setup. Somehow the iops have dropped by 10K with this change and the CPU utilization of the fuse thread  further increased to 97%. I don't have an explanation for what's happening and why.

Raghavendra has found inode_ref() and inode_unref() as places where we can do away with inode_table->lock and replace it with inode->lock instead but that change alone is leading to deadlock in lots of places. So there's more that needs to be fixed there. Will test it thoroughly and then run it on Sanjay's setup.

-Krutika

Comment 16 Krutika Dhananjay 2017-09-07 11:01:26 UTC
Some more updates since comment #15:

NOTE: All of these tests were done on ovirt-gluster hyperconverged setup with 3 hosts, 9 vms, and bricks carved out of NVMe drives.

* iobuf.diff had a bug, and had not eliminated one case of contention on iobuf_pool->mutex in __iobuf_put(). I corrected that and ran randrd tests again. This time iobuf_pool->mutex didn't show up in mutrace info captured on brick. But the iops continued to drop by 10K.

* Replacing inode_table->lock with inode->lock will lead to correctness issues and possible corruption in inode_unref() since inode_unref() does not merely decrement ref count but also operates on inode_table and the lists within it. So it is necessary to acquire inode_table->lock when performing inode_unref(). The old (existing) code is correct.

* Applied https://review.gluster.org/#/c/16832/ from facebook to eliminate lock contention on conf->mutex in io-threads. Somehow this brought the perf down. This was with the default value of fops-per-thread-ratio - 20. This actually caused iops to drop from ~60K to 30K.

* Default io-thread-count is 16. So I decreased io-thread-count to 1 and yet there was no drop in IOPs. Only disabling client-io-threads altogether causes drop in performance.

The last point is possibly means either that even the io-threads are getting held up at lower layers - perhaps epoll, rpc, etc or not enough parallel requests are received by client-io-threads for thread count to matter.

* Most importantly I disabled fuse event history in the crudest manner possible (by deleting calls to fuse_log_eh() and fuse_log_eh_fop()) and I'm seeing that the IOPs increase by 5K and disk utilization is up to 88-90%. This change takes the total IOPs we are seeing in the best case scenario for randrds to 66K from 61K. That still did not bring down fuse thread's CPU utilization though!

That's it for now.

Comment 17 Raghavendra G 2017-09-08 01:49:51 UTC
> * Replacing inode_table->lock with inode->lock will lead to correctness issues and possible corruption in inode_unref() since inode_unref() does not merely decrement ref count but also operates on inode_table and the lists within it. So it is necessary to acquire inode_table->lock when performing inode_unref(). The old (existing) code is correct.

We can acquire lock only when refcount becomes zero (that's when we modify various lists). I guess there are lots of inode_unrefs the execution of which don't result in refcount becoming 0.

Comment 18 Raghavendra G 2017-09-08 02:04:12 UTC
Created attachment 1323547 [details]
modified inode_unref to reduce lock contention on table->lock

Comment 19 Raghavendra G 2017-09-08 02:10:57 UTC
(In reply to Raghavendra G from comment #18)
> Created attachment 1323547 [details]
> modified inode_unref to reduce lock contention on table->lock

Similar improvements can be made to inode_ref, inode_ref_reduce_by_n. Key point is to follow the following pattern:

* modify inode->ref and note the current ref under inode->lock. unlock post modification.
* if the value of ref before or after modification results in modification of various lists of table,
     - acquire table->lock
     - acquire inode->lock
     - ref = inode->ref
     - release inode->lock
     - If the value of ref still indicates that lists of table need to modified, modify various lists
     - release table->lock

Comment 20 Krutika Dhananjay 2017-09-08 02:12:34 UTC
(In reply to Raghavendra G from comment #18)
> Created attachment 1323547 [details]
> modified inode_unref to reduce lock contention on table->lock

Mind sending the patch on gerrit? It's easier to view the diff there. Thanks.

-Krutika

Comment 21 Worker Ant 2017-09-08 05:50:11 UTC
REVIEW: https://review.gluster.org/18242 (mount/fuse: Make event-history feature configurable) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 22 Worker Ant 2017-09-08 16:18:52 UTC
COMMIT: https://review.gluster.org/17687 committed in master by Jeff Darcy (jeff.us) 
------
commit ecc5b30289be1a63e2616733a022b3a9a35f12b7
Author: Krutika Dhananjay <kdhananj>
Date:   Wed Jun 21 12:56:14 2017 +0530

    socket: Use granular mutex locks during pollin and pollout event processing
    
    ... instead of one global lock. This is because pollin and pollout
    processing code operate on mutually exclusive members of
    socket_private_t. Keeping this in mind, this patch introduces
    the more granular priv->in_lock and priv->out_lock locks.
    
    For pollerr, which modifies both priv->incoming and priv->ioq, both
    locks need to be taken.
    
    Change-Id: Id7aeb608dc7755551b6b404470d5d80709c81960
    BUG: 1467614
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17687
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    Reviewed-by: mohammed rafi  kc <rkavunga>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 23 Worker Ant 2017-09-11 05:12:22 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#1) for review on master by Raghavendra G (rgowdapp)

Comment 24 Worker Ant 2017-09-11 06:55:52 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 25 Worker Ant 2017-09-12 11:17:28 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 26 Worker Ant 2017-09-14 11:20:30 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#4) for review on master by Raghavendra G (rgowdapp)

Comment 27 Krutika Dhananjay 2017-09-15 12:37:19 UTC
(In reply to Krutika Dhananjay from comment #16)
> Some more updates since comment #15:
> 
> NOTE: All of these tests were done on ovirt-gluster hyperconverged setup
> with 3 hosts, 9 vms, and bricks carved out of NVMe drives.
> 
> * iobuf.diff had a bug, and had not eliminated one case of contention on
> iobuf_pool->mutex in __iobuf_put(). I corrected that and ran randrd tests
> again. This time iobuf_pool->mutex didn't show up in mutrace info captured
> on brick. But the iops continued to drop by 10K.
> 
> * Replacing inode_table->lock with inode->lock will lead to correctness
> issues and possible corruption in inode_unref() since inode_unref() does not
> merely decrement ref count but also operates on inode_table and the lists
> within it. So it is necessary to acquire inode_table->lock when performing
> inode_unref(). The old (existing) code is correct.
> 
> * Applied https://review.gluster.org/#/c/16832/ from facebook to eliminate
> lock contention on conf->mutex in io-threads. Somehow this brought the perf
> down. This was with the default value of fops-per-thread-ratio - 20. This
> actually caused iops to drop from ~60K to 30K.
> 
> * Default io-thread-count is 16. So I decreased io-thread-count to 1 and yet
> there was no drop in IOPs. Only disabling client-io-threads altogether
> causes drop in performance.
> 
> The last point is possibly means either that even the io-threads are getting
> held up at lower layers - perhaps epoll, rpc, etc or not enough parallel
> requests are received by client-io-threads for thread count to matter.
> 
> * Most importantly I disabled fuse event history in the crudest manner
> possible (by deleting calls to fuse_log_eh() and fuse_log_eh_fop()) and I'm
> seeing that the IOPs increase by 5K and disk utilization is up to 88-90%.
> This change takes the total IOPs we are seeing in the best case scenario for
> randrds to 66K from 61K. That still did not bring down fuse thread's CPU
> utilization though!

I stand corrected. The patch I was using had a bug. I fixed that and ran the test again and now I see iops go up from 61K to ~73K. I saw disk utilization of up to 92%.

Also, on a single brick distribute volume, I now see IOPs go up from ~18K to ~26K. The disk was 100% utilized (yes, I know that doesn't necessarily mean we've reached a dead end, just saying :)).

All of this was with randrd.job. I am hoping some of the patches we'd tried earlier (such as fb's patch in io-threads) will be more relevant now, because turns out the lock contention due to event history was not only affecting the callbacks of fops in fuse-bridge but also the call path (see FUSE_FOP, in fact this was the bug I was mentioning above). More updates next week..

-Krutika


> 
> That's it for now.

Comment 28 Worker Ant 2017-09-18 07:16:47 UTC
REVIEW: https://review.gluster.org/18242 (mount/fuse: Make event-history feature configurable) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 29 Worker Ant 2017-09-19 09:30:28 UTC
REVIEW: https://review.gluster.org/18242 (mount/fuse: Make event-history feature configurable) posted (#3) for review on master by Krutika Dhananjay (kdhananj)

Comment 30 Worker Ant 2017-09-24 06:22:12 UTC
COMMIT: https://review.gluster.org/18242 committed in master by Amar Tumballi (amarts) 
------
commit 956d43d6e89d40ee683547003b876f1f456f03b6
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Sep 7 18:48:34 2017 +0530

    mount/fuse: Make event-history feature configurable
    
    ... and disable it by default.
    
    This is because having it disabled seems to improve performance.
    This could be due to the lock contention by the different epoll threads
    on the circular buff lock in the fop cbks just before writing their response
    to /dev/fuse.
    
    Just to provide some data - wrt ovirt-gluster hyperconverged
    environment, I saw an increase in IOPs by 12K with event-history
    disabled for randrom read workload.
    
    Usage:
    mount -t glusterfs -o event-history=on $HOSTNAME:$VOLNAME $MOUNTPOINT
    OR
    glusterfs --event-history=on --volfile-server=$HOSTNAME --volfile-id=$VOLNAME $MOUNTPOINT
    
    Change-Id: Ia533788d309c78688a315dc8cd04d30fad9e9485
    BUG: 1467614
    Signed-off-by: Krutika Dhananjay <kdhananj>

Comment 31 Worker Ant 2017-09-26 09:51:47 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#5) for review on master by Raghavendra G (rgowdapp)

Comment 32 Worker Ant 2017-09-26 10:32:48 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#6) for review on master by Raghavendra G (rgowdapp)

Comment 33 Worker Ant 2017-09-27 05:44:47 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#7) for review on master by Raghavendra G (rgowdapp)

Comment 34 Worker Ant 2017-09-27 08:20:46 UTC
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#8) for review on master by Raghavendra G (rgowdapp)

Comment 35 Manoj Pillai 2017-10-26 11:35:37 UTC
Did some tests to understand the iops limits on the client-side and on the brick-side. Updating the bz with the results.

Results are based on: glusterfs-3.12.1-2.el7.x86_64

First, given that NVMe equipped servers are not that easy to get in our lab, wanted to see if I could use a ram disk instead as a fast brick device to identify bottlenecks in the gluster software stack. That works well, IMO. Runs here were done with brick on ramdisk, and results are similar to what Krutika has been reporting on on NVMe drive.

Server-side setup:
# create the ramdisk of size 16g:
modprobe brd rd_nr=1 rd_size=16777216 max_part=0
# check with "ls -l /dev/ram*"

# create a single-brick volume on ramdisk
mkfs.xfs -f -i size=512 /dev/ram0
mount -t xfs /dev/ram0 /mnt/rhs_brick1

gluster v create perfvol ${server}:/mnt/rhs_brick1 force
gluster v start perfvol

Since this is random I/O test, applying with these settings:
performance.strict-o-direct: on
network.remote-dio: disable
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.io-cache: off
performance.write-behind: off
performance.client-io-threads: on
performance.io-thread-count: 4
performance.read-ahead: off

I have upto 4 clients that connect to the server over a 10GbE link.

Test steps:
* mount the volume on each client on /mnt/glustervol
* create a separate directory for each client: /mnt/glustervol/<hostname>
* prepare for fio dist test: "fio --server --daemonize=/var/run/fio-svr.pid"
* create data set using fio seq. write test
* sync and drop caches all around
* perform fio randread test

Single-client results:

command and job file for fio seq. write test to create a 12g data set:

fio --output=out.fio.write --client=<hosts file> job.fio.write

[global]
rw=write
create_on_open=1
fsync_on_close=1
bs=1024k
startdelay=0
ioengine=sync

[seqwrite]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=1
numjobs=24
nrfiles=1
openfiles=1
filesize=512m
size=512m

job file for fio randread test (reads 25% of previously written data):
[global]
rw=randread
startdelay=0
ioengine=sync
direct=1
bs=4k

[randread]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=1
numjobs=24
nrfiles=1
openfiles=1
filesize=512m
size=512m
io_size=128m

Result for above case (24 concurrent jobs):
read: IOPS=22.9k, BW=89.5Mi (93.9M)(3072MiB/34309msec)

The IOPS number is in the ball park of what was reported in comment #27 for the single-client, single-brick case.

Result for 48 concurrent jobs:
[changed job file so numjobs=48, but filesize is proportionately reduced so that total data set size is 12g. Note: the brick is a 16g ram disk]
read: IOPS=23.1k, BW=90.4Mi (94.8M)(3072MiB/33985msec)

So we are getting only a fraction of the iops that the brick device is capable of and increasing number of jobs doesn't help -- IOPS stays at close to 23k. Tuning event-threads, io-thread-count does not help -- IOPS stays stuck at around 23k.
 
Question is, is the bottleneck on the brick-side or the client-side? Probably on the client-side. To test that, we'll run the test with 4 clients. If the brick is the bottleneck, IOPs should stay the same. If not, IOPs should increase, ideally to 4x.

Result with 4 clients, single brick:
[fio filesize, size and io_size are adjusted to keep the data set to 12g]
read: IOPS=66.4k, BW=259Mi (272M)(3072MiB/11845msec)
For this above test, I did change io-thread-count=8.

So, IOPs increases but not 4x. This tells me that the bottleneck responsible for 23k IOPs limit in the single-client, single-brick test is on the client side. There is a bottleneck at the brick as well, but we hit it at a higher IOPs. Had there been no bottleneck on the brick, the IOPs on the 4 client test would have been close to 90k. To confirm that, I did a test, but this time with 2 bricks (each on a separate ramdisk).

Results with 4 clients, 2 brick distribute volume):
[io-thread-count is set to 8 in this test, with io-thread-count at 4, I was getting only about 76k IOPs]
read: IOPS=87.8k, BW=343Mi (360M)(3072MiB/8953msec)

To summarize, from these tests it looks like the gluster stack has fairly low limits on how many IOPs it can push through, on both client and brick sides:

fuse-client limit: 23k IOPs
brick limit: 66k IOPs

Comment 36 Manoj Pillai 2017-10-27 06:35:27 UTC
While comment #35 separately establishes what the limits are on the client-side and brick-side, it does not require us to change course. The priority should be fixing the client-side bottleneck, which is what previous comments are dealing with anyway. But when we do analyze brick-side bottlenecks we should use a multi-client test to ensure enough IOPs are bring sent in to push the brick to the limit.

For the single-client, single-brick test, the best result I saw was with io-thread-count=2 (not much improvement from io-thread-count=4, but still). That ideal number might change as we remove existing bottlenecks, but that's where things stand as of now. In contrast, for the brick to reach its limit, io-thread-count needs to be at least 8. I think the investigation here would benefit from being able to separately control no. of io-threads on the client-side and brick i.e. from a fix for bz #1499649.

Comment 38 Manoj Pillai 2017-10-27 08:36:51 UTC
Switched to a 32g ramdisk (the server has 56g) so that I can have longer runs with larger data set of 24g instead of 12g in comment #35.

Repeated the 4 client, single brick run (io-thread-count=8, event-threads=4):
read: IOPS=59.0k, BW=234Mi (246M)(6144MiB/26220msec)
[IOPs dropped slightly with the longer run.] 

Output of "top -bH -d 10" ON THE BRICK during randread looks like this:

<quote>
  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2500 root      20   0 2199708  32556   4968 R 97.6  0.1   4:35.79 glusterrpcs+
 7040 root      20   0 2199708  32556   4968 S 74.3  0.1   0:19.62 glusteriotw+
 7036 root      20   0 2199708  32556   4968 S 73.5  0.1   0:24.13 glusteriotw+
 7039 root      20   0 2199708  32556   4968 S 73.0  0.1   0:23.73 glusteriotw+
 6854 root      20   0 2199708  32556   4968 S 72.7  0.1   0:35.91 glusteriotw+
 7035 root      20   0 2199708  32556   4968 S 72.7  0.1   0:23.99 glusteriotw+
 7038 root      20   0 2199708  32556   4968 R 72.5  0.1   0:23.75 glusteriotw+
 7034 root      20   0 2199708  32556   4968 S 72.3  0.1   0:23.60 glusteriotw+
 7037 root      20   0 2199708  32556   4968 R 72.3  0.1   0:23.42 glusteriotw+
 2510 root      20   0 2199708  32556   4968 S 34.0  0.1   1:28.11 glusterposi+
</quote>

pstack on the thread showing 97+% CPU utilization:

<quote>
# pstack 2500
Thread 1 (process 2500):
#0  0x00007f0301398945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f03022f733b in rpcsvc_request_handler (arg=0x7f02f003f530) at rpcsvc.c:1881
#2  0x00007f0301394e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0300c6134d in clone () from /lib64/libc.so.6
</quote>

Comment 39 Manoj Pillai 2017-10-30 03:16:31 UTC
(In reply to Manoj Pillai from comment #36)
> I think the investigation here would
> benefit from being able to separately control no. of io-threads on the
> client-side and brick i.e. from a fix for bz #1499649.

In the meantime, I'd suggest keeping io-thread-count=4. Here's why...

Result from 4 client, 1 brick test with io-thread-count=2:
read: IOPS=26.1k, BW=102Mi (107M)(6144MiB/60198msec)

And output from top on the brick:
<quote>
  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3900 root      20   0 1319128  30236   4968 R 99.2  0.1   1:13.16 glusteriotw+
 3345 root      20   0 1319128  30236   4968 R 98.2  0.1   1:56.45 glusteriotw+
 2500 root      20   0 1319128  30236   4968 S 36.0  0.1   1:20.12 glusterrpcs+
</quote>

So the brick can only deliver about 26k IOPs with this setting. That is close to the client-side limit of 23k IOPs. If we use io-thread-count=2, there is a risk that we'll discard valuable client-side enhancements thinking that they are not giving any benefits when in fact they might be. Also, we should re-evaluate any enhancements we have discarded in this way :).

Comment 40 Manoj Pillai 2017-11-01 08:01:05 UTC
Back to the client-side analysis. Single-client, single brick runs. Client system separate from server, 10GbE interconnect. io-thread-count=4.

Trying an approach where I'm doing runs with increasing number of concurrent jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in each case. Comparing mutrace output for the runs and looking for locks that rise up in the ranking of contended locks when concurrency increases. Based on this, I'm looking at the following as prime suspects:

Mutex #28293810 (0x0x7fdc240ca940) first referenced by:
        /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) [0x7fdc3744d95a]
        /usr/lib64/glusterfs/3.12.1/rpc-transport/socket.so(+0xb4c8) [0x7fdc2b66c4c8]

Mutex #27743163 (0x0x7fdc240c9890) first referenced by:
        /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) [0x7fdc3744d95a]
        /lib64/libgfrpc.so.0(rpc_clnt_new+0xf0) [0x7fdc36f3c840]


Maybe to a lesser extent, but also this one:
Mutex #21500558 (0x0x5608175976c8) first referenced by:
        /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) [0x7fdc3744d95a]
        /lib64/libglusterfs.so.0(cb_buffer_new+0x7c) [0x7fdc371ccafc]

Might be a good idea to have a discussion on these.

Comment 41 Krutika Dhananjay 2017-11-02 06:37:10 UTC
(In reply to Manoj Pillai from comment #40)
> Back to the client-side analysis. Single-client, single brick runs. Client
> system separate from server, 10GbE interconnect. io-thread-count=4.
> 
> Trying an approach where I'm doing runs with increasing number of concurrent
> jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in
> each case. Comparing mutrace output for the runs and looking for locks that
> rise up in the ranking of contended locks when concurrency increases. Based
> on this, I'm looking at the following as prime suspects:
> 
> Mutex #28293810 (0x0x7fdc240ca940) first referenced by:
>         /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae)
> [0x7fdc3744d95a]
>         /usr/lib64/glusterfs/3.12.1/rpc-transport/socket.so(+0xb4c8)
> [0x7fdc2b66c4c8]
> 
> Mutex #27743163 (0x0x7fdc240c9890) first referenced by:
>         /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae)
> [0x7fdc3744d95a]
>         /lib64/libgfrpc.so.0(rpc_clnt_new+0xf0) [0x7fdc36f3c840]
> 
> 
> Maybe to a lesser extent, but also this one:
> Mutex #21500558 (0x0x5608175976c8) first referenced by:
>         /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae)
> [0x7fdc3744d95a]
>         /lib64/libglusterfs.so.0(cb_buffer_new+0x7c) [0x7fdc371ccafc]
> 
> Might be a good idea to have a discussion on these.

So the last mutex (created in cb_buffer_new) is associated with event-history.
I would suggest you to try 3.12.2 which has event-history disabled and see what the new results are.

Comment 42 Manoj Pillai 2017-11-06 17:41:24 UTC
Created attachment 1348697 [details]
mutrace output for fio randread  24 concurrent jobs

Comment 43 Manoj Pillai 2017-11-06 17:42:49 UTC
Created attachment 1348698 [details]
mutrace output for fio randread  with 48 concurrent jobs

Comment 44 Manoj Pillai 2017-11-06 18:00:27 UTC
(In reply to Manoj Pillai from comment #40)
> Back to the client-side analysis. Single-client, single brick runs. Client
> system separate from server, 10GbE interconnect. io-thread-count=4.
> 
> Trying an approach where I'm doing runs with increasing number of concurrent
> jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in
> each case. Comparing mutrace output for the runs 

Looking at this more, I'm not convinced that lock contention is the root cause of the IOPs limit we are hitting...

fio output for run with 24 concurrent jobs:
   read: IOPS=23.7k, BW=92.4Mi (96.9M)(6144MiB/66483msec)
    clat (usec): min=209, max=2951, avg=1010.89, stdev=62.19
     lat (usec): min=209, max=2952, avg=1011.16, stdev=62.18

fio output for run with 48 concurrent jobs:
   read: IOPS=23.5k, BW=91.8Mi (96.3M)(6144MiB/66932msec)
    clat (usec): min=237, max=4431, avg=2038.93, stdev=120.72
     lat (usec): min=237, max=4431, avg=2039.21, stdev=120.71

IOPs is about the same but avg latency is double in the 48 jobs case, up by about 1ms. If lock contention is what is limiting IOPS, we should have seen a huge spike in contention times reported by mutrace for the run with 48 jobs. mutrace outputs are attached in comment #42 and comment #43. Contention times reported vary from run to run, but I'm not seeing the kind of increases I'd expect if lock contention was the root cause.

I think we need to look for other possible bottlenecks as well.

Comment 45 Manoj Pillai 2017-11-07 02:29:18 UTC
(In reply to Manoj Pillai from comment #44)
> (In reply to Manoj Pillai from comment #40)
> > Back to the client-side analysis. Single-client, single brick runs. Client
> > system separate from server, 10GbE interconnect. io-thread-count=4.
> > 

The data in comments 42-44 are actually from runs on a single system i.e. client and server co-located. Doesn't make any difference to the inference, but just to be accurate...

Comment 46 Krutika Dhananjay 2017-11-09 15:03:39 UTC
(In reply to Manoj Pillai from comment #44)
> (In reply to Manoj Pillai from comment #40)
> > Back to the client-side analysis. Single-client, single brick runs. Client
> > system separate from server, 10GbE interconnect. io-thread-count=4.
> > 
> > Trying an approach where I'm doing runs with increasing number of concurrent
> > jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in
> > each case. Comparing mutrace output for the runs 
> 
> Looking at this more, I'm not convinced that lock contention is the root
> cause of the IOPs limit we are hitting...
> 
> fio output for run with 24 concurrent jobs:
>    read: IOPS=23.7k, BW=92.4Mi (96.9M)(6144MiB/66483msec)
>     clat (usec): min=209, max=2951, avg=1010.89, stdev=62.19
>      lat (usec): min=209, max=2952, avg=1011.16, stdev=62.18
> 
> fio output for run with 48 concurrent jobs:
>    read: IOPS=23.5k, BW=91.8Mi (96.3M)(6144MiB/66932msec)
>     clat (usec): min=237, max=4431, avg=2038.93, stdev=120.72
>      lat (usec): min=237, max=4431, avg=2039.21, stdev=120.71
> 
> IOPs is about the same but avg latency is double in the 48 jobs case, up by
> about 1ms. If lock contention is what is limiting IOPS, we should have seen
> a huge spike in contention times reported by mutrace for the run with 48
> jobs. mutrace outputs are attached in comment #42 and comment #43.
> Contention times reported vary from run to run, but I'm not seeing the kind
> of increases I'd expect if lock contention was the root cause.
> 
> I think we need to look for other possible bottlenecks as well.

So I compared the mutrace output attachments from comment #42 and #43 and I see that the contention time of the lock in iobuf_pool_new() increased by 10 seconds upon increasing iodepth to 48 (with iodepth=24 it was 28283ms or 28s and with iodepth=48 it is 38450ms or 38s). This is for my own understanding - based on your calculations in comment #44, is a 10s spike not big enough for this to be taken seriously?

-Krutika

Comment 47 Manoj Pillai 2017-11-13 14:48:20 UTC
(In reply to Krutika Dhananjay from comment #46)

I am hedging my bets a bit with phrases like "not convinced" :). If you have a strong feeling about one of these locks being the root cause, don't let me talk you out of it :).

Looking at how mutrace calculates contention time (mutrace.c; functions pthread_mutex_lock and mutex_lock). Consider an example where a critical section with 1ms execution time is guarded by a mutex, and 10 concurrent requests are active. Total execution time should be ~10ms. Contention time as calculated by mutrace, IIUC, would be roughly 0+1+2+...+9=45ms. With 50 concurrently active requests, execution time would be ~50ms and contention time would be 49*(49+1)/2 or 24.5 times the execution time. 

In comment #44, fio output for the run with 48 jobs shows increase in avg latency of ~1ms compared to the run with 24 jobs. Total no. of requests is IOPs*run-time i.e. 23K*66. If the increase in latency for the 48-job run is coming solely due to lock contention, contention time reported could in theory go up by as much as 23*66 seconds.

Those are the reasons why I talked about expecting big increases in reported contention time if lock contention is indeed the culprit here.

But since you asked I looked at the mutrace runs again and saw some problems there. Not sure we can take decisions confidently based on the output we are getting. Let me talk about that in a separate comment.

Comment 48 Manoj Pillai 2017-11-13 16:40:21 UTC
(In reply to Manoj Pillai from comment #47)

for the mutrace runs:

fio output for run with 24 concurrent jobs:
   read: IOPS=2868, BW=11.2Mi (11.7M)(6144MiB/548349msec)
    clat (usec): min=1471, max=23086, avg=8363.15, stdev=4030.90
     lat (usec): min=1471, max=23087, avg=8363.44, stdev=4030.89

fio output for run with 48 concurrent jobs:
   read: IOPS=2833, BW=11.1Mi (11.6M)(6144MiB/555024msec)
    clat (usec): min=2975, max=44776, avg=16932.81, stdev=7990.35
     lat (usec): min=2975, max=44776, avg=16933.09, stdev=7990.35

When you run without mutrace, IOPs on these randread runs is ~23k. With mutrace it drops to less than 3k. For the normal run with 48 jobs, top shows fuse thread at <70% CPU utlization; for the mutrace run, top shows ~97% CPU utilization. That's a concern, because it could mean that the bottleneck is the fuse thread, and hence the lock contention results are not useful.

Comment 49 Worker Ant 2017-11-28 13:02:37 UTC
COMMIT: https://review.gluster.org/17958 committed in master by \"Krutika Dhananjay\" <kdhananj> with a commit message- rpc: Eliminate conn->lock contention by using more granular locks

rpc_clnt_submit() acquires conn->lock  before call to
rpc_transport_submit_request() and subsequent queuing of frame into
saved_frames list. However, as part of handling RPC_TRANSPORT_MSG_RECEIVED
and RPC_TRANSPORT_MSG_SENT notifications in rpc_clnt_notify(), conn->lock
is again used to atomically update conn->last_received and conn->last_sent
event timestamps.

So when conn->lock is acquired as part of submitting a request,
a parallel POLLIN notification gets blocked at rpc layer until the request
submission completes and the lock is released.

To get around this, this patch call clock_gettime (instead to call gettimeofday)
to update event timestamps in conn->last_received and conn->last_sent and to
call clock_gettime don't need to call mutex_lock because it (clock_gettime)
is thread safe call.

Note: Run fio on vm after apply the patch, iops is improved after apply
      the patch.

Change-Id: I347b5031d61c426b276bc5e07136a7172645d763
BUG: 1467614
Signed-off-by: Krutika Dhananjay <kdhananj>

Comment 50 Shyamsundar 2017-12-08 17:34:10 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 51 Raghavendra G 2017-12-20 04:28:21 UTC
(In reply to Manoj Pillai from comment #38)
> Switched to a 32g ramdisk (the server has 56g) so that I can have longer
> runs with larger data set of 24g instead of 12g in comment #35.
> 
> Repeated the 4 client, single brick run (io-thread-count=8, event-threads=4):
> read: IOPS=59.0k, BW=234Mi (246M)(6144MiB/26220msec)
> [IOPs dropped slightly with the longer run.] 
> 
> Output of "top -bH -d 10" ON THE BRICK during randread looks like this:
> 
> <quote>
>   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>  2500 root      20   0 2199708  32556   4968 R 97.6  0.1   4:35.79
> glusterrpcs+
>  7040 root      20   0 2199708  32556   4968 S 74.3  0.1   0:19.62
> glusteriotw+
>  7036 root      20   0 2199708  32556   4968 S 73.5  0.1   0:24.13
> glusteriotw+
>  7039 root      20   0 2199708  32556   4968 S 73.0  0.1   0:23.73
> glusteriotw+
>  6854 root      20   0 2199708  32556   4968 S 72.7  0.1   0:35.91
> glusteriotw+
>  7035 root      20   0 2199708  32556   4968 S 72.7  0.1   0:23.99
> glusteriotw+
>  7038 root      20   0 2199708  32556   4968 R 72.5  0.1   0:23.75
> glusteriotw+
>  7034 root      20   0 2199708  32556   4968 S 72.3  0.1   0:23.60
> glusteriotw+
>  7037 root      20   0 2199708  32556   4968 R 72.3  0.1   0:23.42
> glusteriotw+
>  2510 root      20   0 2199708  32556   4968 S 34.0  0.1   1:28.11
> glusterposi+
> </quote>
> 
> pstack on the thread showing 97+% CPU utilization:
> 
> <quote>
> # pstack 2500
> Thread 1 (process 2500):
> #0  0x00007f0301398945 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007f03022f733b in rpcsvc_request_handler (arg=0x7f02f003f530) at
> rpcsvc.c:1881
> #2  0x00007f0301394e25 in start_thread () from /lib64/libpthread.so.0
> #3  0x00007f0300c6134d in clone () from /lib64/libc.so.6
> </quote>

This thread was introduced as part of our efforts to not execute any of glusterfs program's code in event threads [1]. I had considered a multithreaded model, but preferred a single threaded model as we didn't see much of a performance regression in our limited performance tests and the former resulted in regression failures. I'll send a patch to make sure there is a 1:1 mapping between event threads and threads executing rpcsvc_request_handler.

A related comment from Manoj:

<comment>

one thing though. the thread is in the high 90s cpu utilization, not quite 100. it is possible there is another bottleneck, and at this time we won't see a benefit from the multithreaded model. but seems quite clear that the single-threaded model will not scale much beyond 60k. 

</comment>

[1] https://review.gluster.org/17105

Comment 52 Manoj Pillai 2017-12-21 06:19:50 UTC
We [Raghavendra, Krutika, Mohit, Milind, me] met up and Raghavendra went over the rpc layer details so we could brainstorm on potential inefficiencies that can explain the current low limit on IOPS on the client side.  We discussed the possibility that the current iops limit may be due to delay in adding back the socket for polling. Similar to https://review.gluster.org/#/c/17391/, but now the delay being due to socket_event_handler () in rpc/rpc-transport/socket.

One suggestion from Milind was to have multiple connections between client and brick. We can test this hypothesis with a plain distribute volume with multiple bricks, each on NVMe SSDs; if we see that going from single brick to multiple bricks improves IOPS, multiple connections per brick could be promising to have. IIRC, earlier tests with multiple bricks were not effective in scaling IOPS. However, would be good to check whether recent improvements have changed that.

The short of it is, yes, I saw IOPS go up from 34k with a single brick to 45.7k with a 3-brick distribute volume. That is a good bump in IOPS, maybe better than any single bump we have seen so far. At 45.7k, on my setup, the fuse thread looks to a bottleneck at 98% CPU utilization, so the gain could potentially have been higher. [In order to get this higher throughput, I had to bump up the no. of concurrent requests to 48].

Details::

verison:
glusterfs-3.13.0-1.el7.x86_64

fio job file for random read:
[global]
rw=randread
exitall
startdelay=0
ioengine=sync
direct=1
bs=4k

[randread]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=1
numjobs=48
nrfiles=1
openfiles=1
filesize=5g
size=5g
io_size=512m

results:
single-brick: read: IOPS=34.0k, BW=137Mi (143M)(23.0GiB/179719msec)
3-brick distribute volume: read: IOPS=45.7k, BW=178Mi (187M)(23.0GiB/132038msec)

volume info (options):
Options Reconfigured:
performance.strict-o-direct: on
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.io-cache: off
performance.client-io-threads: on
performance.io-thread-count: 8
performance.read-ahead: off
transport.address-family: inet
nfs.disable: on

client top output (thread-level) for 3-brick case:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24932 root      20   0 1284136  27576   3680 R 97.9  0.1   4:08.73 glusterfuse+
24925 root      20   0 1284136  27576   3680 R 72.5  0.1   2:26.06 glusterepol+
24928 root      20   0 1284136  27576   3680 S 72.5  0.1   2:26.01 glusterepol+
24927 root      20   0 1284136  27576   3680 R 72.4  0.1   2:25.79 glusterepol+
24929 root      20   0 1284136  27576   3680 S 72.4  0.1   2:26.17 glusterepol+
25093 root      20   0 1284136  27576   3680 R 41.4  0.1   1:07.35 glusteriotw+
25095 root      20   0 1284136  27576   3680 S 41.4  0.1   1:06.56 glusteriotw+

Conclusions:
1. results indicate that we are on the right track with suspecting the delayed add-back of socket to be one of the culprits for limited IOPS per client.
2. we need to deal with the fuse-thread bottleneck in order to make more progress here.

Comment 53 Mohit Agrawal 2017-12-21 07:36:12 UTC
Hi,

I think at fuse level we have two threads(glusterfuseproc,glusterfusenoti) those name start from glusterfuse, from the previous output it is difficult to assume which thread is consuming most CPU??

Can we run top for this specific mount process to see the cpu consumption for specific threads

top -H -p <mount_pid>

Regards
Mohit Agrawal

Comment 54 Manoj Pillai 2017-12-21 08:05:36 UTC
(In reply to Mohit Agrawal from comment #53)
> Hi,
> 
> I think at fuse level we have two threads(glusterfuseproc,glusterfusenoti)
> those name start from glusterfuse, from the previous output it is difficult
> to assume which thread is consuming most CPU??
> 
> Can we run top for this specific mount process to see the cpu consumption
> for specific threads
> 
> top -H -p <mount_pid>
> 
> Regards
> Mohit Agrawal

Here it is:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                             
27665 root      20   0 1284136  29340   3680 R 98.3  0.1   2:28.84 glusterfuseproc                     
27658 root      20   0 1284136  29340   3680 S 70.3  0.1   1:13.76 glusterepoll0                       
27660 root      20   0 1284136  29340   3680 R 70.3  0.1   1:13.78 glusterepoll1                       
27661 root      20   0 1284136  29340   3680 R 70.3  0.1   1:12.80 glusterepoll2                       
27662 root      20   0 1284136  29340   3680 R 70.3  0.1   1:12.99 glusterepoll3                       
27826 root      20   0 1284136  29340   3680 S 40.0  0.1   0:21.38 glusteriotwr1                       
27831 root      20   0 1284136  29340   3680 R 40.0  0.1   0:20.92 glusteriotwr5                       
27832 root      20   0 1284136  29340   3680 S 40.0  0.1   0:21.05 glusteriotwr6                       
27659 root      20   0 1284136  29340   3680 S 39.7  0.1   0:21.12 glusteriotwr0                       
27827 root      20   0 1284136  29340   3680 R 39.7  0.1   0:20.72 glusteriotwr2                       
27830 root      20   0 1284136  29340   3680 S 39.7  0.1   0:21.17 glusteriotwr4                       
27833 root      20   0 1284136  29340   3680 S 39.3  0.1   0:20.70 glusteriotwr7                       
27829 root      20   0 1284136  29340   3680 R 38.7  0.1   0:20.50 glusteriotwr3                       
27651 root      20   0 1284136  29340   3680 S  0.0  0.1   0:00.00 glusterfs                           
27652 root      20   0 1284136  29340   3680 S  0.0  0.1   0:00.00 glustertimer                        
27653 root      20   0 1284136  29340   3680 S  0.0  0.1   0:00.00 glustersigwait                      
27654 root      20   0 1284136  29340   3680 S  0.0  0.1   0:00.00 glustermemsweep                     
27655 root      20   0 1284136  29340   3680 S  0.0  0.1   0:00.00 glustersproc0                       
27656 root      20   0 1284136  29340   3680 S  0.0  0.1   0:00.00 glustersproc1                       
27666 root      20   0 1284136  29340   3680 S  0.0  0.1   0:08.27 glusterfusenoti

Comment 55 Mohit Agrawal 2017-12-21 09:10:52 UTC
Hi,

  I think we can try one way to limit cpu quota usage for fuse thread only.I am assuming if we do set cpu quota limit for fuse start thread then IOPS can improve because it could be happening other threads are not able to consume CPU(as we can see in previous output io thread
is not consuming more CPU and fuse thread is consuming high CPU) because most of the time CPU is consumed by fuse start threads.

  If we are interested to try this quick and dirty way then I can provide options to use this.

Regards
Mohit Agrawal

Comment 56 Manoj Pillai 2017-12-21 11:27:37 UTC
(In reply to Mohit Agrawal from comment #55)
> Hi,
> 
>   I think we can try one way to limit cpu quota usage for fuse thread only.I
> am assuming if we do set cpu quota limit for fuse start thread then IOPS can
> improve because it could be happening other threads are not able to consume
> CPU(as we can see in previous output io thread
> is not consuming more CPU and fuse thread is consuming high CPU) because
> most of the time CPU is consumed by fuse start threads.
> 
>   If we are interested to try this quick and dirty way then I can provide
> options to use this.
> 
> Regards
> Mohit Agrawal

I'll paste the top -H output so that it shows overall cpu utilization as well. us+sys is <32%; there is plenty of CPU cycles for other threads. The problem is that the fuse thread has too much "work" -- either real work or spinlock contention -- so it becomes the bottleneck. limiting its cpu usage doesn't seem like the right thing to do.

<quote>
top - 00:09:30 up 13 days, 17:47,  4 users,  load average: 37.42, 15.02, 5.71
Threads: 371 total,  12 running, 359 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.0 us, 11.7 sy,  0.0 ni, 28.4 id, 38.9 wa,  0.0 hi,  1.1 si,  0.0 st
KiB Mem : 49279472 total, 48222328 free,   604892 used,   452252 buff/cache
KiB Swap: 24772604 total, 24772604 free,        0 used. 48025148 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24932 root      20   0 1284136  27576   3680 R 97.9  0.1   4:08.73 glusterfuse+
24925 root      20   0 1284136  27576   3680 R 72.5  0.1   2:26.06 glusterepol+
24928 root      20   0 1284136  27576   3680 S 72.5  0.1   2:26.01 glusterepol+
24927 root      20   0 1284136  27576   3680 R 72.4  0.1   2:25.79 glusterepol+

</quote>

Comment 57 Mohit Agrawal 2018-01-02 05:28:17 UTC
Hi,

After use splice call(for github https://github.com/gluster/glusterfs/issues/372) on server side(only for write) I got some performance improvement in writing of, though it is initial level of testing but I hope we can get some improvement in IOPS if we use splice in both sides (client and server and for both fops read and write).



Without splice patch

for i in `(seq 1 5)`
do
rm -rf /mnt/rpmbuild/glusterfs-4.0dev1/*
echo 3 > /proc/sys/vm/drop_caches
time cp -rf rpmbuild_build/glusterfs-4.0dev1/* /mnt/rpmbuild/glusterfs-4.0dev1/
done

real	22m51.189s
user	0m0.086s
sys	0m1.511s

real	22m57.531s
user	0m0.090s
sys	0m2.025s

real	22m57.845s
user	0m0.115s
sys	0m1.834s

real	22m51.257s
user	0m0.113s
sys	0m1.966s

real	22m57.857s
user	0m0.113s
sys	0m1.966s

# After applying splice patch

for i in `(seq 1 5)`; do rm -rf /mnt/rpmbuild/glusterfs-4.0dev1/*; echo 3 > /proc/sys/vm/drop_caches; time cp -rf rpmbuild_build/glusterfs-4.0dev1/* /mnt/rpmbuild/glusterfs-4.0dev1/; done

real	17m51.073s
user	0m0.104s
sys	0m1.862s

real	17m50.057s
user	0m0.097s
sys	0m2.005s

real	17m50.022s
user	0m0.096s
sys	0m1.928s

real	17m49.073s
user	0m0.101s
sys	0m1.828s

real	17m47.594s
user	0m0.077s
sys	0m1.753s


Regards
Mohit Agrawal

Comment 58 Worker Ant 2018-01-03 12:15:04 UTC
REVIEW: https://review.gluster.org/19133 (protocol/client: multiple rpc connections between single client and brick) posted (#1) for review on master by Raghavendra G

Comment 59 Worker Ant 2018-01-08 09:50:09 UTC
REVIEW: https://review.gluster.org/19161 (Revert \"program/GF-DUMP: Shield ping processing from traffic to Glusterfs\") posted (#1) for review on master by Raghavendra G

Comment 60 Worker Ant 2018-01-18 09:03:35 UTC
REVIEW: https://review.gluster.org/19226 (WIP: fuse multi-threaded reader) posted (#1) for review on master by Krutika Dhananjay

Comment 61 Krutika Dhananjay 2018-01-29 07:46:02 UTC
I loaded io-stats below fuse-bridge and a couple of other places on the gluster stack to profile the fops at different levels (randrd workload).

And here is an excerpt from the profile taken just below gluster's fuse-bridge translator even with attribute-timeout set to 600s:

<excerpt>
...
...
Fop           Call Count    Avg-Latency    Min-Latency    Max-Latency
---           ----------    -----------    -----------    -----------
READ             3145728     1145.73 us       77.66 us    42739.02 us
FLUSH                 96        2.23 us        1.21 us       23.29 us
FSTAT            3145632        2.66 us        0.67 us     3435.86 us

...
...
</excerpt>

I also captured strace output of the application writing to the glusterfs mount - in this case fio and grep'd for number of fstats invoked by it and here are the counts of syscalls executed by fio:

[root@rhs-srv-07 tmp]# grep -iarT 'fstat' fio-strace.140* | wc -l
95
[root@rhs-srv-07 tmp]# grep -iarT 'lstat' fio-strace.140* | wc -l
3269
[root@rhs-srv-07 tmp]# grep -iarT '^stat' fio-strace.140* | wc -l
1132
[root@rhs-srv-07 tmp]# grep -iarT 'read' fio-strace.140* | wc -l
3146696

Based on this data we can conclude that neither the application nor glusterfs stack are winding as many fstats as the number of reads. So the only suspect left is kernel.

-Krutika

Comment 62 Raghavendra G 2018-01-29 08:10:19 UTC
(In reply to Worker Ant from comment #58)
> REVIEW: https://review.gluster.org/19133 (protocol/client: multiple rpc
> connections between single client and brick) posted (#1) for review on
> master by Raghavendra G

Credits for this patch should go to "Milind Changire" <mchangir>.

Comment 63 Shyamsundar 2018-03-15 11:17:12 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.0.0, please open a new bug report.

glusterfs-4.0.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-March/000092.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 65 Krutika Dhananjay 2018-05-24 10:38:06 UTC
So I wrote a quick and dirty patch in inode.c to minimize contention on itable->lock, particularly in inode_{ref,unref,lookup} codepaths.

With 8 fuse threads, I was getting ~60K iops in randrd fio test earlier (with all of the FOSDEM improvements put together).
Here's some mutrace data with contention time in itable->lock:

With 1 fuse reader thread I was seeing ~2K ms contention time.
With 8 fuse threads, this increased to ~33K ms.

With the itable->lock contention fix, iops slightly increased to 63K.
And as for mutrace, I see that the contention time has dropped to ~8K ms (with 8 reader threads).

So it doesn't seem like 3K is much of an improvement?

Manoj,
Do you think it would make sense for me to repeat the io-stats-loaded-at-multiple-points-on-the-stack experiment once more on top of all these patches (well, for want of other ideas more than anything else)?

-Krutika

Comment 66 Manoj Pillai 2018-05-24 11:20:41 UTC
(In reply to Krutika Dhananjay from comment #65)

> 
> Manoj,
> Do you think it would make sense for me to repeat the
> io-stats-loaded-at-multiple-points-on-the-stack experiment once more on top
> of all these patches (well, for want of other ideas more than anything else)?
> 
> -Krutika

A few suggestions before getting into that:

1. add a random write test in addition to random read, to see impact there.
2. turn off xlators other than stat-prefetch. that is the only one that adds value for this workload, IIRC, and that too only on the randread test.
3. what is the iodepth/#jobs? try increasing that no.

Comment 67 Yaniv Kaul 2018-08-20 12:33:24 UTC
What's the next step here?

Comment 68 Krutika Dhananjay 2018-08-20 13:53:59 UTC
(In reply to Yaniv Kaul from comment #67)
> What's the next step here?

I had a patch I was working on to optimize the way we read messages off socket before I got pulled into some customer issues. I'll resume work on that sometime in the next week or so and then share the results on this bz.

-Krutika

Comment 69 Yaniv Kaul 2018-10-04 19:53:56 UTC
(In reply to Krutika Dhananjay from comment #68)
> (In reply to Yaniv Kaul from comment #67)
> > What's the next step here?
> 
> I had a patch I was working on to optimize the way we read messages off
> socket before I got pulled into some customer issues. I'll resume work on
> that sometime in the next week or so and then share the results on this bz.
> 
> -Krutika

Any updates?

Comment 70 Krutika Dhananjay 2018-10-05 03:42:25 UTC
(In reply to Yaniv Kaul from comment #69)
> (In reply to Krutika Dhananjay from comment #68)
> > (In reply to Yaniv Kaul from comment #67)
> > > What's the next step here?
> > 
> > I had a patch I was working on to optimize the way we read messages off
> > socket before I got pulled into some customer issues. I'll resume work on
> > that sometime in the next week or so and then share the results on this bz.
> > 
> > -Krutika
> 
> Any updates?

No, not yet. Had to drop this in the middle in order to complete other tasks. I'll resume working on this from Oct 08.

-Krutika

Comment 71 Worker Ant 2018-12-21 04:33:01 UTC
REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock in incoming message handling) posted (#1) for review on master by Krutika Dhananjay

Comment 72 Yaniv Kaul 2018-12-24 12:53:21 UTC
(In reply to Worker Ant from comment #71)
> REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock
> in incoming message handling) posted (#1) for review on master by Krutika
> Dhananjay

Is this patch above enough to satisfy this BZ? The bot moved this BZ to POST automatically, but I'm not sure it was intended.

Comment 73 Krutika Dhananjay 2018-12-24 13:35:35 UTC
(In reply to Yaniv Kaul from comment #72)
> (In reply to Worker Ant from comment #71)
> > REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock
> > in incoming message handling) posted (#1) for review on master by Krutika
> > Dhananjay
> 
> Is this patch above enough to satisfy this BZ? The bot moved this BZ to POST
> automatically, but I'm not sure it was intended.

Not really. This is just an umbrella bz to capture issues as and when they're found and fixed.
POST is OK, I guess. The bug won't be moved to MODIFIED until a patch sender specifically says
this is their last patch against this bz. So I guess, we're good.

Comment 74 Worker Ant 2018-12-26 05:15:58 UTC
REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock in incoming message handling) posted (#4) for review on master by Amar Tumballi

Comment 75 Worker Ant 2019-01-08 10:04:37 UTC
REVIEW: https://review.gluster.org/21999 (socket: fix issue on concurrent socket event handle) posted (#1) for review on master by Zhang Huan

Comment 76 Worker Ant 2019-01-28 04:53:08 UTC
REVIEW: https://review.gluster.org/21999 (socket: fix issue on concurrent handle of a socket) merged (#4) on master by Amar Tumballi

Comment 77 Shyamsundar 2019-03-25 16:30:11 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/