Description of problem: For more information, refer to http://lists.gluster.org/pipermail/gluster-devel/2017-June/052993.html Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
REVIEW: https://review.gluster.org/17687 (socket: Use granular mutex locks during pollin and pollout event processing) posted (#4) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17688 (socket: Allow parallel processing of pollin and pollout events wherever possible) posted (#4) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17687 (socket: Use granular mutex locks during pollin and pollout event processing) posted (#5) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17688 (socket: Allow parallel processing of pollin and pollout events wherever possible) posted (#5) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17687 (socket: Use granular mutex locks during pollin and pollout event processing) posted (#6) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17688 (socket: Allow parallel processing of pollin and pollout events wherever possible) posted (#6) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17958 (rpc: Eliminate conn->lock contention by using more granular locks) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17958 (rpc: Eliminate conn->lock contention by using more granular locks) posted (#2) for review on master by MOHIT AGRAWAL (moagrawa)
Profiling a run for a simplified test: single-brick gluster volume mounted on same server; brick on nvme device; 4K random read fio test with direct=1. From: perf record -F 99 -p <glusterfs pid> <quote> Overhead Command Shared Object Symbol 11.34% glusterfs libpthread-2.17.so [.] pthread_mutex_lock 6.10% glusterfs libpthread-2.17.so [.] pthread_mutex_unlock 5.66% glusterfs libc-2.17.so [.] vfprintf 3.12% glusterfs [kernel.kallsyms] [k] try_to_wake_up 2.47% glusterfs libglusterfs.so.0.0.1 [.] _gf_msg 2.07% glusterfs libc-2.17.so [.] __memset_sse2 1.88% glusterfs libc-2.17.so [.] _IO_default_xsputn 1.82% glusterfs libc-2.17.so [.] _int_free 1.59% glusterfs [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 1.56% glusterfs libc-2.17.so [.] __libc_calloc 1.54% glusterfs libc-2.17.so [.] _int_malloc 1.41% glusterfs [kernel.kallsyms] [k] _raw_qspin_lock 1.35% glusterfs [kernel.kallsyms] [k] copy_user_enhanced_fast_string 1.14% glusterfs [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.04% glusterfs [kernel.kallsyms] [k] futex_wait_setup 1.01% glusterfs libglusterfs.so.0.0.1 [.] mem_get_from_pool 0.97% glusterfs fuse.so [.] fuse_attr_cbk 0.91% glusterfs [kernel.kallsyms] [k] futex_wake 0.90% glusterfs libpthread-2.17.so [.] pthread_spin_lock 0.78% glusterfs libglusterfs.so.0.0.1 [.] mem_put 0.77% glusterfs libglusterfs.so.0.0.1 [.] mem_get 0.73% glusterfs libc-2.17.so [.] _itoa_word ... </quote> From: perf record -F 99 -p <glusterfsd pid> <quote> Overhead Command Shared Object Symbol 6.25% glusterfsd libpthread-2.17.so [.] pthread_mutex_lock 4.48% glusterfsd libglusterfs.so.0.0.1 [.] _gf_msg 3.60% glusterfsd libpthread-2.17.so [.] pthread_mutex_unlock 3.10% glusterfsd libc-2.17.so [.] vfprintf 1.37% glusterfsd [kernel.kallsyms] [k] __bufio_new 1.37% glusterfsd [kernel.kallsyms] [k] copy_user_enhanced_fast_string 1.36% glusterfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.31% glusterfsd libc-2.17.so [.] _IO_default_xsputn 1.21% glusterfsd [kernel.kallsyms] [k] __d_lookup_rcu 1.12% glusterfsd [kernel.kallsyms] [k] blk_throtl_bio 1.10% glusterfsd [kernel.kallsyms] [k] _raw_qspin_lock 1.10% glusterfsd libc-2.17.so [.] __libc_calloc 0.92% glusterfsd [kernel.kallsyms] [k] avc_has_perm_noaudit 0.86% glusterfsd libpthread-2.17.so [.] pthread_getspecific 0.84% glusterfsd libc-2.17.so [.] __memset_sse2 0.82% glusterfsd [kernel.kallsyms] [k] try_to_wake_up 0.81% glusterfsd [kernel.kallsyms] [k] kmem_cache_alloc 0.79% glusterfsd libc-2.17.so [.] _int_malloc 0.75% glusterfsd libc-2.17.so [.] __strchrnul ... </quote> A lot of cycles being spent in pthread mutex lock/unlock. However, afaik, this doesn't tell us whether there is lock contention, and on which locks there is contention. Will need to use something like mutrace for that.
That's interesting. I actually ran pure randrd job again on Sanjay's setup (through vms) with the best configuration known yet and with mempool and mem-accounting disabled in code last week. And I saw that the FUSE thread is consistently at 95% CPU utilization. I'm running perf-record on this build right now to see where the thread is spending all its time. As for the perf record output above, I think using the "-call-graph=dwarf" option should also get us the backtrace, though this would require glusterfs-debuginfo package to be installed. -Krutika
(In reply to Krutika Dhananjay from comment #10) > That's interesting. > > I actually ran pure randrd job again on Sanjay's setup (through vms) with > the best configuration known yet and with mempool and mem-accounting > disabled in code last week. And I saw that the FUSE thread is consistently > at 95% CPU utilization. I'm running perf-record on this build right now to > see where the thread is spending all its time. > > As for the perf record output above, I think using the "-call-graph=dwarf" > option should also get us the backtrace, though this would require > glusterfs-debuginfo package to be installed. > > -Krutika The 95% CPU utilization was *with* client-io-threads enabled. Also double-checked to see that the translator does appear in the client volfile too.
(In reply to Krutika Dhananjay from comment #10) > That's interesting. > > I actually ran pure randrd job again on Sanjay's setup (through vms) with > the best configuration known yet and with mempool and mem-accounting > disabled in code last week. And I saw that the FUSE thread is consistently > at 95% CPU utilization. I'm running perf-record on this build right now to > see where the thread is spending all its time. What was the gain, if any, in IOPs over previous runs? FYI, the build I'm running with is a downstream build. This is from a few quick runs over the weekend on an existing setup that has other work in progress. The build has https://review.gluster.org/17391, and client-io-threads enabled. > > As for the perf record output above, I think using the "-call-graph=dwarf" > option should also get us the backtrace, though this would require > glusterfs-debuginfo package to be installed. > > -Krutika debuginfo package is installed, and I have some data captured with call-graph enabled. I found the above to be cleaner to just get the functions where it is spending most time.
(In reply to Manoj Pillai from comment #12) > (In reply to Krutika Dhananjay from comment #10) > > That's interesting. > > > > I actually ran pure randrd job again on Sanjay's setup (through vms) with > > the best configuration known yet and with mempool and mem-accounting > > disabled in code last week. And I saw that the FUSE thread is consistently > > at 95% CPU utilization. I'm running perf-record on this build right now to > > see where the thread is spending all its time. > > What was the gain, if any, in IOPs over previous runs? There was no gain or drop in IOPs. The only difference I noticed was that fuse thread used to be at 78% CPU utilization before; now with memacct and mempool disabled, it shot up to 95%. > > FYI, the build I'm running with is a downstream build. This is from a few > quick runs over the weekend on an existing setup that has other work in > progress. The build has https://review.gluster.org/17391, and > client-io-threads enabled. OK. > > > > > As for the perf record output above, I think using the "-call-graph=dwarf" > > option should also get us the backtrace, though this would require > > glusterfs-debuginfo package to be installed. > > > > -Krutika > > debuginfo package is installed, and I have some data captured with > call-graph enabled. I found the above to be cleaner to just get the > functions where it is spending most time.
Created attachment 1319327 [details] I attached one of the bricks to mutrace and ran the same fio randrd job again. Attached is the output of the run.
Created attachment 1320865 [details] iobuf.diff Changed the code in iobuf.c such that all iobuf allocations happen directly through std allocation (calloc) without going through the cycle of taking iobuf_pool->mutex and selecting the right iobuf_arena and then selecting an iobuf within it; and tested all of this on Sanjay's setup. Somehow the iops have dropped by 10K with this change and the CPU utilization of the fuse thread further increased to 97%. I don't have an explanation for what's happening and why. Raghavendra has found inode_ref() and inode_unref() as places where we can do away with inode_table->lock and replace it with inode->lock instead but that change alone is leading to deadlock in lots of places. So there's more that needs to be fixed there. Will test it thoroughly and then run it on Sanjay's setup. -Krutika
Some more updates since comment #15: NOTE: All of these tests were done on ovirt-gluster hyperconverged setup with 3 hosts, 9 vms, and bricks carved out of NVMe drives. * iobuf.diff had a bug, and had not eliminated one case of contention on iobuf_pool->mutex in __iobuf_put(). I corrected that and ran randrd tests again. This time iobuf_pool->mutex didn't show up in mutrace info captured on brick. But the iops continued to drop by 10K. * Replacing inode_table->lock with inode->lock will lead to correctness issues and possible corruption in inode_unref() since inode_unref() does not merely decrement ref count but also operates on inode_table and the lists within it. So it is necessary to acquire inode_table->lock when performing inode_unref(). The old (existing) code is correct. * Applied https://review.gluster.org/#/c/16832/ from facebook to eliminate lock contention on conf->mutex in io-threads. Somehow this brought the perf down. This was with the default value of fops-per-thread-ratio - 20. This actually caused iops to drop from ~60K to 30K. * Default io-thread-count is 16. So I decreased io-thread-count to 1 and yet there was no drop in IOPs. Only disabling client-io-threads altogether causes drop in performance. The last point is possibly means either that even the io-threads are getting held up at lower layers - perhaps epoll, rpc, etc or not enough parallel requests are received by client-io-threads for thread count to matter. * Most importantly I disabled fuse event history in the crudest manner possible (by deleting calls to fuse_log_eh() and fuse_log_eh_fop()) and I'm seeing that the IOPs increase by 5K and disk utilization is up to 88-90%. This change takes the total IOPs we are seeing in the best case scenario for randrds to 66K from 61K. That still did not bring down fuse thread's CPU utilization though! That's it for now.
> * Replacing inode_table->lock with inode->lock will lead to correctness issues and possible corruption in inode_unref() since inode_unref() does not merely decrement ref count but also operates on inode_table and the lists within it. So it is necessary to acquire inode_table->lock when performing inode_unref(). The old (existing) code is correct. We can acquire lock only when refcount becomes zero (that's when we modify various lists). I guess there are lots of inode_unrefs the execution of which don't result in refcount becoming 0.
Created attachment 1323547 [details] modified inode_unref to reduce lock contention on table->lock
(In reply to Raghavendra G from comment #18) > Created attachment 1323547 [details] > modified inode_unref to reduce lock contention on table->lock Similar improvements can be made to inode_ref, inode_ref_reduce_by_n. Key point is to follow the following pattern: * modify inode->ref and note the current ref under inode->lock. unlock post modification. * if the value of ref before or after modification results in modification of various lists of table, - acquire table->lock - acquire inode->lock - ref = inode->ref - release inode->lock - If the value of ref still indicates that lists of table need to modified, modify various lists - release table->lock
(In reply to Raghavendra G from comment #18) > Created attachment 1323547 [details] > modified inode_unref to reduce lock contention on table->lock Mind sending the patch on gerrit? It's easier to view the diff there. Thanks. -Krutika
REVIEW: https://review.gluster.org/18242 (mount/fuse: Make event-history feature configurable) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17687 committed in master by Jeff Darcy (jeff.us) ------ commit ecc5b30289be1a63e2616733a022b3a9a35f12b7 Author: Krutika Dhananjay <kdhananj> Date: Wed Jun 21 12:56:14 2017 +0530 socket: Use granular mutex locks during pollin and pollout event processing ... instead of one global lock. This is because pollin and pollout processing code operate on mutually exclusive members of socket_private_t. Keeping this in mind, this patch introduces the more granular priv->in_lock and priv->out_lock locks. For pollerr, which modifies both priv->incoming and priv->ioq, both locks need to be taken. Change-Id: Id7aeb608dc7755551b6b404470d5d80709c81960 BUG: 1467614 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17687 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp> Reviewed-by: mohammed rafi kc <rkavunga> CentOS-regression: Gluster Build System <jenkins.org>
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#1) for review on master by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#2) for review on master by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#3) for review on master by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#4) for review on master by Raghavendra G (rgowdapp)
(In reply to Krutika Dhananjay from comment #16) > Some more updates since comment #15: > > NOTE: All of these tests were done on ovirt-gluster hyperconverged setup > with 3 hosts, 9 vms, and bricks carved out of NVMe drives. > > * iobuf.diff had a bug, and had not eliminated one case of contention on > iobuf_pool->mutex in __iobuf_put(). I corrected that and ran randrd tests > again. This time iobuf_pool->mutex didn't show up in mutrace info captured > on brick. But the iops continued to drop by 10K. > > * Replacing inode_table->lock with inode->lock will lead to correctness > issues and possible corruption in inode_unref() since inode_unref() does not > merely decrement ref count but also operates on inode_table and the lists > within it. So it is necessary to acquire inode_table->lock when performing > inode_unref(). The old (existing) code is correct. > > * Applied https://review.gluster.org/#/c/16832/ from facebook to eliminate > lock contention on conf->mutex in io-threads. Somehow this brought the perf > down. This was with the default value of fops-per-thread-ratio - 20. This > actually caused iops to drop from ~60K to 30K. > > * Default io-thread-count is 16. So I decreased io-thread-count to 1 and yet > there was no drop in IOPs. Only disabling client-io-threads altogether > causes drop in performance. > > The last point is possibly means either that even the io-threads are getting > held up at lower layers - perhaps epoll, rpc, etc or not enough parallel > requests are received by client-io-threads for thread count to matter. > > * Most importantly I disabled fuse event history in the crudest manner > possible (by deleting calls to fuse_log_eh() and fuse_log_eh_fop()) and I'm > seeing that the IOPs increase by 5K and disk utilization is up to 88-90%. > This change takes the total IOPs we are seeing in the best case scenario for > randrds to 66K from 61K. That still did not bring down fuse thread's CPU > utilization though! I stand corrected. The patch I was using had a bug. I fixed that and ran the test again and now I see iops go up from 61K to ~73K. I saw disk utilization of up to 92%. Also, on a single brick distribute volume, I now see IOPs go up from ~18K to ~26K. The disk was 100% utilized (yes, I know that doesn't necessarily mean we've reached a dead end, just saying :)). All of this was with randrd.job. I am hoping some of the patches we'd tried earlier (such as fb's patch in io-threads) will be more relevant now, because turns out the lock contention due to event history was not only affecting the callbacks of fops in fuse-bridge but also the call path (see FUSE_FOP, in fact this was the bug I was mentioning above). More updates next week.. -Krutika > > That's it for now.
REVIEW: https://review.gluster.org/18242 (mount/fuse: Make event-history feature configurable) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/18242 (mount/fuse: Make event-history feature configurable) posted (#3) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/18242 committed in master by Amar Tumballi (amarts) ------ commit 956d43d6e89d40ee683547003b876f1f456f03b6 Author: Krutika Dhananjay <kdhananj> Date: Thu Sep 7 18:48:34 2017 +0530 mount/fuse: Make event-history feature configurable ... and disable it by default. This is because having it disabled seems to improve performance. This could be due to the lock contention by the different epoll threads on the circular buff lock in the fop cbks just before writing their response to /dev/fuse. Just to provide some data - wrt ovirt-gluster hyperconverged environment, I saw an increase in IOPs by 12K with event-history disabled for randrom read workload. Usage: mount -t glusterfs -o event-history=on $HOSTNAME:$VOLNAME $MOUNTPOINT OR glusterfs --event-history=on --volfile-server=$HOSTNAME --volfile-id=$VOLNAME $MOUNTPOINT Change-Id: Ia533788d309c78688a315dc8cd04d30fad9e9485 BUG: 1467614 Signed-off-by: Krutika Dhananjay <kdhananj>
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#5) for review on master by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#6) for review on master by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#7) for review on master by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/18255 (libglusterfs/inode: reduce lock contention on itable->lock) posted (#8) for review on master by Raghavendra G (rgowdapp)
Did some tests to understand the iops limits on the client-side and on the brick-side. Updating the bz with the results. Results are based on: glusterfs-3.12.1-2.el7.x86_64 First, given that NVMe equipped servers are not that easy to get in our lab, wanted to see if I could use a ram disk instead as a fast brick device to identify bottlenecks in the gluster software stack. That works well, IMO. Runs here were done with brick on ramdisk, and results are similar to what Krutika has been reporting on on NVMe drive. Server-side setup: # create the ramdisk of size 16g: modprobe brd rd_nr=1 rd_size=16777216 max_part=0 # check with "ls -l /dev/ram*" # create a single-brick volume on ramdisk mkfs.xfs -f -i size=512 /dev/ram0 mount -t xfs /dev/ram0 /mnt/rhs_brick1 gluster v create perfvol ${server}:/mnt/rhs_brick1 force gluster v start perfvol Since this is random I/O test, applying with these settings: performance.strict-o-direct: on network.remote-dio: disable cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 performance.io-cache: off performance.write-behind: off performance.client-io-threads: on performance.io-thread-count: 4 performance.read-ahead: off I have upto 4 clients that connect to the server over a 10GbE link. Test steps: * mount the volume on each client on /mnt/glustervol * create a separate directory for each client: /mnt/glustervol/<hostname> * prepare for fio dist test: "fio --server --daemonize=/var/run/fio-svr.pid" * create data set using fio seq. write test * sync and drop caches all around * perform fio randread test Single-client results: command and job file for fio seq. write test to create a 12g data set: fio --output=out.fio.write --client=<hosts file> job.fio.write [global] rw=write create_on_open=1 fsync_on_close=1 bs=1024k startdelay=0 ioengine=sync [seqwrite] directory=/mnt/glustervol/${HOSTNAME} filename_format=f.$jobnum.$filenum iodepth=1 numjobs=24 nrfiles=1 openfiles=1 filesize=512m size=512m job file for fio randread test (reads 25% of previously written data): [global] rw=randread startdelay=0 ioengine=sync direct=1 bs=4k [randread] directory=/mnt/glustervol/${HOSTNAME} filename_format=f.$jobnum.$filenum iodepth=1 numjobs=24 nrfiles=1 openfiles=1 filesize=512m size=512m io_size=128m Result for above case (24 concurrent jobs): read: IOPS=22.9k, BW=89.5Mi (93.9M)(3072MiB/34309msec) The IOPS number is in the ball park of what was reported in comment #27 for the single-client, single-brick case. Result for 48 concurrent jobs: [changed job file so numjobs=48, but filesize is proportionately reduced so that total data set size is 12g. Note: the brick is a 16g ram disk] read: IOPS=23.1k, BW=90.4Mi (94.8M)(3072MiB/33985msec) So we are getting only a fraction of the iops that the brick device is capable of and increasing number of jobs doesn't help -- IOPS stays at close to 23k. Tuning event-threads, io-thread-count does not help -- IOPS stays stuck at around 23k. Question is, is the bottleneck on the brick-side or the client-side? Probably on the client-side. To test that, we'll run the test with 4 clients. If the brick is the bottleneck, IOPs should stay the same. If not, IOPs should increase, ideally to 4x. Result with 4 clients, single brick: [fio filesize, size and io_size are adjusted to keep the data set to 12g] read: IOPS=66.4k, BW=259Mi (272M)(3072MiB/11845msec) For this above test, I did change io-thread-count=8. So, IOPs increases but not 4x. This tells me that the bottleneck responsible for 23k IOPs limit in the single-client, single-brick test is on the client side. There is a bottleneck at the brick as well, but we hit it at a higher IOPs. Had there been no bottleneck on the brick, the IOPs on the 4 client test would have been close to 90k. To confirm that, I did a test, but this time with 2 bricks (each on a separate ramdisk). Results with 4 clients, 2 brick distribute volume): [io-thread-count is set to 8 in this test, with io-thread-count at 4, I was getting only about 76k IOPs] read: IOPS=87.8k, BW=343Mi (360M)(3072MiB/8953msec) To summarize, from these tests it looks like the gluster stack has fairly low limits on how many IOPs it can push through, on both client and brick sides: fuse-client limit: 23k IOPs brick limit: 66k IOPs
While comment #35 separately establishes what the limits are on the client-side and brick-side, it does not require us to change course. The priority should be fixing the client-side bottleneck, which is what previous comments are dealing with anyway. But when we do analyze brick-side bottlenecks we should use a multi-client test to ensure enough IOPs are bring sent in to push the brick to the limit. For the single-client, single-brick test, the best result I saw was with io-thread-count=2 (not much improvement from io-thread-count=4, but still). That ideal number might change as we remove existing bottlenecks, but that's where things stand as of now. In contrast, for the brick to reach its limit, io-thread-count needs to be at least 8. I think the investigation here would benefit from being able to separately control no. of io-threads on the client-side and brick i.e. from a fix for bz #1499649.
Switched to a 32g ramdisk (the server has 56g) so that I can have longer runs with larger data set of 24g instead of 12g in comment #35. Repeated the 4 client, single brick run (io-thread-count=8, event-threads=4): read: IOPS=59.0k, BW=234Mi (246M)(6144MiB/26220msec) [IOPs dropped slightly with the longer run.] Output of "top -bH -d 10" ON THE BRICK during randread looks like this: <quote> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2500 root 20 0 2199708 32556 4968 R 97.6 0.1 4:35.79 glusterrpcs+ 7040 root 20 0 2199708 32556 4968 S 74.3 0.1 0:19.62 glusteriotw+ 7036 root 20 0 2199708 32556 4968 S 73.5 0.1 0:24.13 glusteriotw+ 7039 root 20 0 2199708 32556 4968 S 73.0 0.1 0:23.73 glusteriotw+ 6854 root 20 0 2199708 32556 4968 S 72.7 0.1 0:35.91 glusteriotw+ 7035 root 20 0 2199708 32556 4968 S 72.7 0.1 0:23.99 glusteriotw+ 7038 root 20 0 2199708 32556 4968 R 72.5 0.1 0:23.75 glusteriotw+ 7034 root 20 0 2199708 32556 4968 S 72.3 0.1 0:23.60 glusteriotw+ 7037 root 20 0 2199708 32556 4968 R 72.3 0.1 0:23.42 glusteriotw+ 2510 root 20 0 2199708 32556 4968 S 34.0 0.1 1:28.11 glusterposi+ </quote> pstack on the thread showing 97+% CPU utilization: <quote> # pstack 2500 Thread 1 (process 2500): #0 0x00007f0301398945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f03022f733b in rpcsvc_request_handler (arg=0x7f02f003f530) at rpcsvc.c:1881 #2 0x00007f0301394e25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f0300c6134d in clone () from /lib64/libc.so.6 </quote>
(In reply to Manoj Pillai from comment #36) > I think the investigation here would > benefit from being able to separately control no. of io-threads on the > client-side and brick i.e. from a fix for bz #1499649. In the meantime, I'd suggest keeping io-thread-count=4. Here's why... Result from 4 client, 1 brick test with io-thread-count=2: read: IOPS=26.1k, BW=102Mi (107M)(6144MiB/60198msec) And output from top on the brick: <quote> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3900 root 20 0 1319128 30236 4968 R 99.2 0.1 1:13.16 glusteriotw+ 3345 root 20 0 1319128 30236 4968 R 98.2 0.1 1:56.45 glusteriotw+ 2500 root 20 0 1319128 30236 4968 S 36.0 0.1 1:20.12 glusterrpcs+ </quote> So the brick can only deliver about 26k IOPs with this setting. That is close to the client-side limit of 23k IOPs. If we use io-thread-count=2, there is a risk that we'll discard valuable client-side enhancements thinking that they are not giving any benefits when in fact they might be. Also, we should re-evaluate any enhancements we have discarded in this way :).
Back to the client-side analysis. Single-client, single brick runs. Client system separate from server, 10GbE interconnect. io-thread-count=4. Trying an approach where I'm doing runs with increasing number of concurrent jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in each case. Comparing mutrace output for the runs and looking for locks that rise up in the ranking of contended locks when concurrency increases. Based on this, I'm looking at the following as prime suspects: Mutex #28293810 (0x0x7fdc240ca940) first referenced by: /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) [0x7fdc3744d95a] /usr/lib64/glusterfs/3.12.1/rpc-transport/socket.so(+0xb4c8) [0x7fdc2b66c4c8] Mutex #27743163 (0x0x7fdc240c9890) first referenced by: /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) [0x7fdc3744d95a] /lib64/libgfrpc.so.0(rpc_clnt_new+0xf0) [0x7fdc36f3c840] Maybe to a lesser extent, but also this one: Mutex #21500558 (0x0x5608175976c8) first referenced by: /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) [0x7fdc3744d95a] /lib64/libglusterfs.so.0(cb_buffer_new+0x7c) [0x7fdc371ccafc] Might be a good idea to have a discussion on these.
(In reply to Manoj Pillai from comment #40) > Back to the client-side analysis. Single-client, single brick runs. Client > system separate from server, 10GbE interconnect. io-thread-count=4. > > Trying an approach where I'm doing runs with increasing number of concurrent > jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in > each case. Comparing mutrace output for the runs and looking for locks that > rise up in the ranking of contended locks when concurrency increases. Based > on this, I'm looking at the following as prime suspects: > > Mutex #28293810 (0x0x7fdc240ca940) first referenced by: > /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) > [0x7fdc3744d95a] > /usr/lib64/glusterfs/3.12.1/rpc-transport/socket.so(+0xb4c8) > [0x7fdc2b66c4c8] > > Mutex #27743163 (0x0x7fdc240c9890) first referenced by: > /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) > [0x7fdc3744d95a] > /lib64/libgfrpc.so.0(rpc_clnt_new+0xf0) [0x7fdc36f3c840] > > > Maybe to a lesser extent, but also this one: > Mutex #21500558 (0x0x5608175976c8) first referenced by: > /usr/local/lib/libmutrace.so(pthread_mutex_init+0x1ae) > [0x7fdc3744d95a] > /lib64/libglusterfs.so.0(cb_buffer_new+0x7c) [0x7fdc371ccafc] > > Might be a good idea to have a discussion on these. So the last mutex (created in cb_buffer_new) is associated with event-history. I would suggest you to try 3.12.2 which has event-history disabled and see what the new results are.
Created attachment 1348697 [details] mutrace output for fio randread 24 concurrent jobs
Created attachment 1348698 [details] mutrace output for fio randread with 48 concurrent jobs
(In reply to Manoj Pillai from comment #40) > Back to the client-side analysis. Single-client, single brick runs. Client > system separate from server, 10GbE interconnect. io-thread-count=4. > > Trying an approach where I'm doing runs with increasing number of concurrent > jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in > each case. Comparing mutrace output for the runs Looking at this more, I'm not convinced that lock contention is the root cause of the IOPs limit we are hitting... fio output for run with 24 concurrent jobs: read: IOPS=23.7k, BW=92.4Mi (96.9M)(6144MiB/66483msec) clat (usec): min=209, max=2951, avg=1010.89, stdev=62.19 lat (usec): min=209, max=2952, avg=1011.16, stdev=62.18 fio output for run with 48 concurrent jobs: read: IOPS=23.5k, BW=91.8Mi (96.3M)(6144MiB/66932msec) clat (usec): min=237, max=4431, avg=2038.93, stdev=120.72 lat (usec): min=237, max=4431, avg=2039.21, stdev=120.71 IOPs is about the same but avg latency is double in the 48 jobs case, up by about 1ms. If lock contention is what is limiting IOPS, we should have seen a huge spike in contention times reported by mutrace for the run with 48 jobs. mutrace outputs are attached in comment #42 and comment #43. Contention times reported vary from run to run, but I'm not seeing the kind of increases I'd expect if lock contention was the root cause. I think we need to look for other possible bottlenecks as well.
(In reply to Manoj Pillai from comment #44) > (In reply to Manoj Pillai from comment #40) > > Back to the client-side analysis. Single-client, single brick runs. Client > > system separate from server, 10GbE interconnect. io-thread-count=4. > > The data in comments 42-44 are actually from runs on a single system i.e. client and server co-located. Doesn't make any difference to the inference, but just to be accurate...
(In reply to Manoj Pillai from comment #44) > (In reply to Manoj Pillai from comment #40) > > Back to the client-side analysis. Single-client, single brick runs. Client > > system separate from server, 10GbE interconnect. io-thread-count=4. > > > > Trying an approach where I'm doing runs with increasing number of concurrent > > jobs -- 24, 48 and 96 -- and attaching mutrace to the glusterfs client in > > each case. Comparing mutrace output for the runs > > Looking at this more, I'm not convinced that lock contention is the root > cause of the IOPs limit we are hitting... > > fio output for run with 24 concurrent jobs: > read: IOPS=23.7k, BW=92.4Mi (96.9M)(6144MiB/66483msec) > clat (usec): min=209, max=2951, avg=1010.89, stdev=62.19 > lat (usec): min=209, max=2952, avg=1011.16, stdev=62.18 > > fio output for run with 48 concurrent jobs: > read: IOPS=23.5k, BW=91.8Mi (96.3M)(6144MiB/66932msec) > clat (usec): min=237, max=4431, avg=2038.93, stdev=120.72 > lat (usec): min=237, max=4431, avg=2039.21, stdev=120.71 > > IOPs is about the same but avg latency is double in the 48 jobs case, up by > about 1ms. If lock contention is what is limiting IOPS, we should have seen > a huge spike in contention times reported by mutrace for the run with 48 > jobs. mutrace outputs are attached in comment #42 and comment #43. > Contention times reported vary from run to run, but I'm not seeing the kind > of increases I'd expect if lock contention was the root cause. > > I think we need to look for other possible bottlenecks as well. So I compared the mutrace output attachments from comment #42 and #43 and I see that the contention time of the lock in iobuf_pool_new() increased by 10 seconds upon increasing iodepth to 48 (with iodepth=24 it was 28283ms or 28s and with iodepth=48 it is 38450ms or 38s). This is for my own understanding - based on your calculations in comment #44, is a 10s spike not big enough for this to be taken seriously? -Krutika
(In reply to Krutika Dhananjay from comment #46) I am hedging my bets a bit with phrases like "not convinced" :). If you have a strong feeling about one of these locks being the root cause, don't let me talk you out of it :). Looking at how mutrace calculates contention time (mutrace.c; functions pthread_mutex_lock and mutex_lock). Consider an example where a critical section with 1ms execution time is guarded by a mutex, and 10 concurrent requests are active. Total execution time should be ~10ms. Contention time as calculated by mutrace, IIUC, would be roughly 0+1+2+...+9=45ms. With 50 concurrently active requests, execution time would be ~50ms and contention time would be 49*(49+1)/2 or 24.5 times the execution time. In comment #44, fio output for the run with 48 jobs shows increase in avg latency of ~1ms compared to the run with 24 jobs. Total no. of requests is IOPs*run-time i.e. 23K*66. If the increase in latency for the 48-job run is coming solely due to lock contention, contention time reported could in theory go up by as much as 23*66 seconds. Those are the reasons why I talked about expecting big increases in reported contention time if lock contention is indeed the culprit here. But since you asked I looked at the mutrace runs again and saw some problems there. Not sure we can take decisions confidently based on the output we are getting. Let me talk about that in a separate comment.
(In reply to Manoj Pillai from comment #47) for the mutrace runs: fio output for run with 24 concurrent jobs: read: IOPS=2868, BW=11.2Mi (11.7M)(6144MiB/548349msec) clat (usec): min=1471, max=23086, avg=8363.15, stdev=4030.90 lat (usec): min=1471, max=23087, avg=8363.44, stdev=4030.89 fio output for run with 48 concurrent jobs: read: IOPS=2833, BW=11.1Mi (11.6M)(6144MiB/555024msec) clat (usec): min=2975, max=44776, avg=16932.81, stdev=7990.35 lat (usec): min=2975, max=44776, avg=16933.09, stdev=7990.35 When you run without mutrace, IOPs on these randread runs is ~23k. With mutrace it drops to less than 3k. For the normal run with 48 jobs, top shows fuse thread at <70% CPU utlization; for the mutrace run, top shows ~97% CPU utilization. That's a concern, because it could mean that the bottleneck is the fuse thread, and hence the lock contention results are not useful.
COMMIT: https://review.gluster.org/17958 committed in master by \"Krutika Dhananjay\" <kdhananj> with a commit message- rpc: Eliminate conn->lock contention by using more granular locks rpc_clnt_submit() acquires conn->lock before call to rpc_transport_submit_request() and subsequent queuing of frame into saved_frames list. However, as part of handling RPC_TRANSPORT_MSG_RECEIVED and RPC_TRANSPORT_MSG_SENT notifications in rpc_clnt_notify(), conn->lock is again used to atomically update conn->last_received and conn->last_sent event timestamps. So when conn->lock is acquired as part of submitting a request, a parallel POLLIN notification gets blocked at rpc layer until the request submission completes and the lock is released. To get around this, this patch call clock_gettime (instead to call gettimeofday) to update event timestamps in conn->last_received and conn->last_sent and to call clock_gettime don't need to call mutex_lock because it (clock_gettime) is thread safe call. Note: Run fio on vm after apply the patch, iops is improved after apply the patch. Change-Id: I347b5031d61c426b276bc5e07136a7172645d763 BUG: 1467614 Signed-off-by: Krutika Dhananjay <kdhananj>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report. glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html [2] https://www.gluster.org/pipermail/gluster-users/
(In reply to Manoj Pillai from comment #38) > Switched to a 32g ramdisk (the server has 56g) so that I can have longer > runs with larger data set of 24g instead of 12g in comment #35. > > Repeated the 4 client, single brick run (io-thread-count=8, event-threads=4): > read: IOPS=59.0k, BW=234Mi (246M)(6144MiB/26220msec) > [IOPs dropped slightly with the longer run.] > > Output of "top -bH -d 10" ON THE BRICK during randread looks like this: > > <quote> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2500 root 20 0 2199708 32556 4968 R 97.6 0.1 4:35.79 > glusterrpcs+ > 7040 root 20 0 2199708 32556 4968 S 74.3 0.1 0:19.62 > glusteriotw+ > 7036 root 20 0 2199708 32556 4968 S 73.5 0.1 0:24.13 > glusteriotw+ > 7039 root 20 0 2199708 32556 4968 S 73.0 0.1 0:23.73 > glusteriotw+ > 6854 root 20 0 2199708 32556 4968 S 72.7 0.1 0:35.91 > glusteriotw+ > 7035 root 20 0 2199708 32556 4968 S 72.7 0.1 0:23.99 > glusteriotw+ > 7038 root 20 0 2199708 32556 4968 R 72.5 0.1 0:23.75 > glusteriotw+ > 7034 root 20 0 2199708 32556 4968 S 72.3 0.1 0:23.60 > glusteriotw+ > 7037 root 20 0 2199708 32556 4968 R 72.3 0.1 0:23.42 > glusteriotw+ > 2510 root 20 0 2199708 32556 4968 S 34.0 0.1 1:28.11 > glusterposi+ > </quote> > > pstack on the thread showing 97+% CPU utilization: > > <quote> > # pstack 2500 > Thread 1 (process 2500): > #0 0x00007f0301398945 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f03022f733b in rpcsvc_request_handler (arg=0x7f02f003f530) at > rpcsvc.c:1881 > #2 0x00007f0301394e25 in start_thread () from /lib64/libpthread.so.0 > #3 0x00007f0300c6134d in clone () from /lib64/libc.so.6 > </quote> This thread was introduced as part of our efforts to not execute any of glusterfs program's code in event threads [1]. I had considered a multithreaded model, but preferred a single threaded model as we didn't see much of a performance regression in our limited performance tests and the former resulted in regression failures. I'll send a patch to make sure there is a 1:1 mapping between event threads and threads executing rpcsvc_request_handler. A related comment from Manoj: <comment> one thing though. the thread is in the high 90s cpu utilization, not quite 100. it is possible there is another bottleneck, and at this time we won't see a benefit from the multithreaded model. but seems quite clear that the single-threaded model will not scale much beyond 60k. </comment> [1] https://review.gluster.org/17105
We [Raghavendra, Krutika, Mohit, Milind, me] met up and Raghavendra went over the rpc layer details so we could brainstorm on potential inefficiencies that can explain the current low limit on IOPS on the client side. We discussed the possibility that the current iops limit may be due to delay in adding back the socket for polling. Similar to https://review.gluster.org/#/c/17391/, but now the delay being due to socket_event_handler () in rpc/rpc-transport/socket. One suggestion from Milind was to have multiple connections between client and brick. We can test this hypothesis with a plain distribute volume with multiple bricks, each on NVMe SSDs; if we see that going from single brick to multiple bricks improves IOPS, multiple connections per brick could be promising to have. IIRC, earlier tests with multiple bricks were not effective in scaling IOPS. However, would be good to check whether recent improvements have changed that. The short of it is, yes, I saw IOPS go up from 34k with a single brick to 45.7k with a 3-brick distribute volume. That is a good bump in IOPS, maybe better than any single bump we have seen so far. At 45.7k, on my setup, the fuse thread looks to a bottleneck at 98% CPU utilization, so the gain could potentially have been higher. [In order to get this higher throughput, I had to bump up the no. of concurrent requests to 48]. Details:: verison: glusterfs-3.13.0-1.el7.x86_64 fio job file for random read: [global] rw=randread exitall startdelay=0 ioengine=sync direct=1 bs=4k [randread] directory=/mnt/glustervol/${HOSTNAME} filename_format=f.$jobnum.$filenum iodepth=1 numjobs=48 nrfiles=1 openfiles=1 filesize=5g size=5g io_size=512m results: single-brick: read: IOPS=34.0k, BW=137Mi (143M)(23.0GiB/179719msec) 3-brick distribute volume: read: IOPS=45.7k, BW=178Mi (187M)(23.0GiB/132038msec) volume info (options): Options Reconfigured: performance.strict-o-direct: on cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 performance.io-cache: off performance.client-io-threads: on performance.io-thread-count: 8 performance.read-ahead: off transport.address-family: inet nfs.disable: on client top output (thread-level) for 3-brick case: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24932 root 20 0 1284136 27576 3680 R 97.9 0.1 4:08.73 glusterfuse+ 24925 root 20 0 1284136 27576 3680 R 72.5 0.1 2:26.06 glusterepol+ 24928 root 20 0 1284136 27576 3680 S 72.5 0.1 2:26.01 glusterepol+ 24927 root 20 0 1284136 27576 3680 R 72.4 0.1 2:25.79 glusterepol+ 24929 root 20 0 1284136 27576 3680 S 72.4 0.1 2:26.17 glusterepol+ 25093 root 20 0 1284136 27576 3680 R 41.4 0.1 1:07.35 glusteriotw+ 25095 root 20 0 1284136 27576 3680 S 41.4 0.1 1:06.56 glusteriotw+ Conclusions: 1. results indicate that we are on the right track with suspecting the delayed add-back of socket to be one of the culprits for limited IOPS per client. 2. we need to deal with the fuse-thread bottleneck in order to make more progress here.
Hi, I think at fuse level we have two threads(glusterfuseproc,glusterfusenoti) those name start from glusterfuse, from the previous output it is difficult to assume which thread is consuming most CPU?? Can we run top for this specific mount process to see the cpu consumption for specific threads top -H -p <mount_pid> Regards Mohit Agrawal
(In reply to Mohit Agrawal from comment #53) > Hi, > > I think at fuse level we have two threads(glusterfuseproc,glusterfusenoti) > those name start from glusterfuse, from the previous output it is difficult > to assume which thread is consuming most CPU?? > > Can we run top for this specific mount process to see the cpu consumption > for specific threads > > top -H -p <mount_pid> > > Regards > Mohit Agrawal Here it is: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27665 root 20 0 1284136 29340 3680 R 98.3 0.1 2:28.84 glusterfuseproc 27658 root 20 0 1284136 29340 3680 S 70.3 0.1 1:13.76 glusterepoll0 27660 root 20 0 1284136 29340 3680 R 70.3 0.1 1:13.78 glusterepoll1 27661 root 20 0 1284136 29340 3680 R 70.3 0.1 1:12.80 glusterepoll2 27662 root 20 0 1284136 29340 3680 R 70.3 0.1 1:12.99 glusterepoll3 27826 root 20 0 1284136 29340 3680 S 40.0 0.1 0:21.38 glusteriotwr1 27831 root 20 0 1284136 29340 3680 R 40.0 0.1 0:20.92 glusteriotwr5 27832 root 20 0 1284136 29340 3680 S 40.0 0.1 0:21.05 glusteriotwr6 27659 root 20 0 1284136 29340 3680 S 39.7 0.1 0:21.12 glusteriotwr0 27827 root 20 0 1284136 29340 3680 R 39.7 0.1 0:20.72 glusteriotwr2 27830 root 20 0 1284136 29340 3680 S 39.7 0.1 0:21.17 glusteriotwr4 27833 root 20 0 1284136 29340 3680 S 39.3 0.1 0:20.70 glusteriotwr7 27829 root 20 0 1284136 29340 3680 R 38.7 0.1 0:20.50 glusteriotwr3 27651 root 20 0 1284136 29340 3680 S 0.0 0.1 0:00.00 glusterfs 27652 root 20 0 1284136 29340 3680 S 0.0 0.1 0:00.00 glustertimer 27653 root 20 0 1284136 29340 3680 S 0.0 0.1 0:00.00 glustersigwait 27654 root 20 0 1284136 29340 3680 S 0.0 0.1 0:00.00 glustermemsweep 27655 root 20 0 1284136 29340 3680 S 0.0 0.1 0:00.00 glustersproc0 27656 root 20 0 1284136 29340 3680 S 0.0 0.1 0:00.00 glustersproc1 27666 root 20 0 1284136 29340 3680 S 0.0 0.1 0:08.27 glusterfusenoti
Hi, I think we can try one way to limit cpu quota usage for fuse thread only.I am assuming if we do set cpu quota limit for fuse start thread then IOPS can improve because it could be happening other threads are not able to consume CPU(as we can see in previous output io thread is not consuming more CPU and fuse thread is consuming high CPU) because most of the time CPU is consumed by fuse start threads. If we are interested to try this quick and dirty way then I can provide options to use this. Regards Mohit Agrawal
(In reply to Mohit Agrawal from comment #55) > Hi, > > I think we can try one way to limit cpu quota usage for fuse thread only.I > am assuming if we do set cpu quota limit for fuse start thread then IOPS can > improve because it could be happening other threads are not able to consume > CPU(as we can see in previous output io thread > is not consuming more CPU and fuse thread is consuming high CPU) because > most of the time CPU is consumed by fuse start threads. > > If we are interested to try this quick and dirty way then I can provide > options to use this. > > Regards > Mohit Agrawal I'll paste the top -H output so that it shows overall cpu utilization as well. us+sys is <32%; there is plenty of CPU cycles for other threads. The problem is that the fuse thread has too much "work" -- either real work or spinlock contention -- so it becomes the bottleneck. limiting its cpu usage doesn't seem like the right thing to do. <quote> top - 00:09:30 up 13 days, 17:47, 4 users, load average: 37.42, 15.02, 5.71 Threads: 371 total, 12 running, 359 sleeping, 0 stopped, 0 zombie %Cpu(s): 20.0 us, 11.7 sy, 0.0 ni, 28.4 id, 38.9 wa, 0.0 hi, 1.1 si, 0.0 st KiB Mem : 49279472 total, 48222328 free, 604892 used, 452252 buff/cache KiB Swap: 24772604 total, 24772604 free, 0 used. 48025148 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24932 root 20 0 1284136 27576 3680 R 97.9 0.1 4:08.73 glusterfuse+ 24925 root 20 0 1284136 27576 3680 R 72.5 0.1 2:26.06 glusterepol+ 24928 root 20 0 1284136 27576 3680 S 72.5 0.1 2:26.01 glusterepol+ 24927 root 20 0 1284136 27576 3680 R 72.4 0.1 2:25.79 glusterepol+ </quote>
Hi, After use splice call(for github https://github.com/gluster/glusterfs/issues/372) on server side(only for write) I got some performance improvement in writing of, though it is initial level of testing but I hope we can get some improvement in IOPS if we use splice in both sides (client and server and for both fops read and write). Without splice patch for i in `(seq 1 5)` do rm -rf /mnt/rpmbuild/glusterfs-4.0dev1/* echo 3 > /proc/sys/vm/drop_caches time cp -rf rpmbuild_build/glusterfs-4.0dev1/* /mnt/rpmbuild/glusterfs-4.0dev1/ done real 22m51.189s user 0m0.086s sys 0m1.511s real 22m57.531s user 0m0.090s sys 0m2.025s real 22m57.845s user 0m0.115s sys 0m1.834s real 22m51.257s user 0m0.113s sys 0m1.966s real 22m57.857s user 0m0.113s sys 0m1.966s # After applying splice patch for i in `(seq 1 5)`; do rm -rf /mnt/rpmbuild/glusterfs-4.0dev1/*; echo 3 > /proc/sys/vm/drop_caches; time cp -rf rpmbuild_build/glusterfs-4.0dev1/* /mnt/rpmbuild/glusterfs-4.0dev1/; done real 17m51.073s user 0m0.104s sys 0m1.862s real 17m50.057s user 0m0.097s sys 0m2.005s real 17m50.022s user 0m0.096s sys 0m1.928s real 17m49.073s user 0m0.101s sys 0m1.828s real 17m47.594s user 0m0.077s sys 0m1.753s Regards Mohit Agrawal
REVIEW: https://review.gluster.org/19133 (protocol/client: multiple rpc connections between single client and brick) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19161 (Revert \"program/GF-DUMP: Shield ping processing from traffic to Glusterfs\") posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19226 (WIP: fuse multi-threaded reader) posted (#1) for review on master by Krutika Dhananjay
I loaded io-stats below fuse-bridge and a couple of other places on the gluster stack to profile the fops at different levels (randrd workload). And here is an excerpt from the profile taken just below gluster's fuse-bridge translator even with attribute-timeout set to 600s: <excerpt> ... ... Fop Call Count Avg-Latency Min-Latency Max-Latency --- ---------- ----------- ----------- ----------- READ 3145728 1145.73 us 77.66 us 42739.02 us FLUSH 96 2.23 us 1.21 us 23.29 us FSTAT 3145632 2.66 us 0.67 us 3435.86 us ... ... </excerpt> I also captured strace output of the application writing to the glusterfs mount - in this case fio and grep'd for number of fstats invoked by it and here are the counts of syscalls executed by fio: [root@rhs-srv-07 tmp]# grep -iarT 'fstat' fio-strace.140* | wc -l 95 [root@rhs-srv-07 tmp]# grep -iarT 'lstat' fio-strace.140* | wc -l 3269 [root@rhs-srv-07 tmp]# grep -iarT '^stat' fio-strace.140* | wc -l 1132 [root@rhs-srv-07 tmp]# grep -iarT 'read' fio-strace.140* | wc -l 3146696 Based on this data we can conclude that neither the application nor glusterfs stack are winding as many fstats as the number of reads. So the only suspect left is kernel. -Krutika
(In reply to Worker Ant from comment #58) > REVIEW: https://review.gluster.org/19133 (protocol/client: multiple rpc > connections between single client and brick) posted (#1) for review on > master by Raghavendra G Credits for this patch should go to "Milind Changire" <mchangir>.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.0.0, please open a new bug report. glusterfs-4.0.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2018-March/000092.html [2] https://www.gluster.org/pipermail/gluster-users/
So I wrote a quick and dirty patch in inode.c to minimize contention on itable->lock, particularly in inode_{ref,unref,lookup} codepaths. With 8 fuse threads, I was getting ~60K iops in randrd fio test earlier (with all of the FOSDEM improvements put together). Here's some mutrace data with contention time in itable->lock: With 1 fuse reader thread I was seeing ~2K ms contention time. With 8 fuse threads, this increased to ~33K ms. With the itable->lock contention fix, iops slightly increased to 63K. And as for mutrace, I see that the contention time has dropped to ~8K ms (with 8 reader threads). So it doesn't seem like 3K is much of an improvement? Manoj, Do you think it would make sense for me to repeat the io-stats-loaded-at-multiple-points-on-the-stack experiment once more on top of all these patches (well, for want of other ideas more than anything else)? -Krutika
(In reply to Krutika Dhananjay from comment #65) > > Manoj, > Do you think it would make sense for me to repeat the > io-stats-loaded-at-multiple-points-on-the-stack experiment once more on top > of all these patches (well, for want of other ideas more than anything else)? > > -Krutika A few suggestions before getting into that: 1. add a random write test in addition to random read, to see impact there. 2. turn off xlators other than stat-prefetch. that is the only one that adds value for this workload, IIRC, and that too only on the randread test. 3. what is the iodepth/#jobs? try increasing that no.
What's the next step here?
(In reply to Yaniv Kaul from comment #67) > What's the next step here? I had a patch I was working on to optimize the way we read messages off socket before I got pulled into some customer issues. I'll resume work on that sometime in the next week or so and then share the results on this bz. -Krutika
(In reply to Krutika Dhananjay from comment #68) > (In reply to Yaniv Kaul from comment #67) > > What's the next step here? > > I had a patch I was working on to optimize the way we read messages off > socket before I got pulled into some customer issues. I'll resume work on > that sometime in the next week or so and then share the results on this bz. > > -Krutika Any updates?
(In reply to Yaniv Kaul from comment #69) > (In reply to Krutika Dhananjay from comment #68) > > (In reply to Yaniv Kaul from comment #67) > > > What's the next step here? > > > > I had a patch I was working on to optimize the way we read messages off > > socket before I got pulled into some customer issues. I'll resume work on > > that sometime in the next week or so and then share the results on this bz. > > > > -Krutika > > Any updates? No, not yet. Had to drop this in the middle in order to complete other tasks. I'll resume working on this from Oct 08. -Krutika
REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock in incoming message handling) posted (#1) for review on master by Krutika Dhananjay
(In reply to Worker Ant from comment #71) > REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock > in incoming message handling) posted (#1) for review on master by Krutika > Dhananjay Is this patch above enough to satisfy this BZ? The bot moved this BZ to POST automatically, but I'm not sure it was intended.
(In reply to Yaniv Kaul from comment #72) > (In reply to Worker Ant from comment #71) > > REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock > > in incoming message handling) posted (#1) for review on master by Krutika > > Dhananjay > > Is this patch above enough to satisfy this BZ? The bot moved this BZ to POST > automatically, but I'm not sure it was intended. Not really. This is just an umbrella bz to capture issues as and when they're found and fixed. POST is OK, I guess. The bug won't be moved to MODIFIED until a patch sender specifically says this is their last patch against this bz. So I guess, we're good.
REVIEW: https://review.gluster.org/21910 (socket: Remove redundant in_lock in incoming message handling) posted (#4) for review on master by Amar Tumballi
REVIEW: https://review.gluster.org/21999 (socket: fix issue on concurrent socket event handle) posted (#1) for review on master by Zhang Huan
REVIEW: https://review.gluster.org/21999 (socket: fix issue on concurrent handle of a socket) merged (#4) on master by Amar Tumballi
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report. glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html [2] https://www.gluster.org/pipermail/gluster-users/