+++ This bug was initially created as a clone of Bug #1448364 +++ +++ This bug was initially created as a clone of Bug #1358606 +++ Description of problem: The test uses a 1x(4+2) disperse volume created on 6 servers, where each brick is on a RAID-6 device capable of delivering high throughput. A simple iozone write test reports low throughput. However, when the volume is changed to 4x(4+2) by carving 4 bricks on each h/w RAID device, the throughput improves significantly -- close to the network line speed. This indicates some inefficiencies within gluster that is limiting throughput in the 1x(4+2) case. Version-Release number of selected component (if applicable): glusterfs*-3.8.0-1.el7.x86_64 (on both clients and servers) RHEL 7.1 (clients) RHEL 7.2 (servers) How reproducible: Consistently Actual results: command: iozone -i 0 -w -+n -c -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz result for 1x(4+2) volume: throughput for 8 initial writers = 268871.84 kB/sec result for 4x(4+2) volume (identical h/w setup): throughput for 8 initial writers = 755955.41 kB/sec Additional info: From gluster vol info: <quote> [...] Options Reconfigured: cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 performance.client-io-threads: on transport.address-family: inet performance.readdir-ahead: on </quote> --- Additional comment from Manoj Pillai on 2016-07-21 03:22:29 EDT --- Based on a conversation with Pranith, re-ran the test by creating the 1x(4+2) volume (as before, each brick on a RAID-6 device, 6 servers, 1 brick per server). But this time, used multiple mount points on the client: <quote> [...] gprfs045-10ge:/perfvol 4.0T 135M 4.0T 1% /mnt/glustervol1 gprfs045-10ge:/perfvol 4.0T 135M 4.0T 1% /mnt/glustervol2 gprfs045-10ge:/perfvol 4.0T 135M 4.0T 1% /mnt/glustervol3 gprfs045-10ge:/perfvol 4.0T 135M 4.0T 1% /mnt/glustervol4 </quote> command: iozone -i 0 -w -+n -c -e -s 10g -r 64k -t 8 -F /mnt/glustervol1/f1.ioz /mnt/glustervol1/f2.ioz /mnt/glustervol2/f3.ioz /mnt/glustervol2/f4.ioz /mnt/glustervol3/f5.ioz /mnt/glustervol3/f6.ioz /mnt/glustervol4/f7.ioz /mnt/glustervol4/f8.ioz result: throughput for 8 initial writers = 791249.52 kB/sec So when the volume is mounted multiple times on the client or when then are multiple bricks on the server, the performance is good. Pranith has a hypothesis for what is going on here, which is what led to the multiple-mount test. I'll let him explain. --- Additional comment from Raghavendra G on 2016-07-22 01:54:08 EDT --- From the discussion I had with Pranith and Manoj, the issue seems to be the way sockets are added/removed to/from polling for events while higher layers (socket, programs like glusterfs client) are processing a single event. The current workflow when an event is received is: 1. There is an incoming msg on socket s. A pollin event is received in epoll layer. 2. epoll removes the socket from the poll_array. So, we don't receive any more events on socket "s", till it is added back for polling. 3. Now the handler for pollin is called. This handler 3a. reads the entire msg from socket (we assume the entire rpc msg is available). 3b. Invokes a notification function to higher layers that a msg is received. Once this this notification call returns handler returns control to epoll layer. However as part of this notification function higher layers (like EC) can do significant work, making the time taken by handler a significant chunk. 4. Once handler returns epoll layer adds the socket back for polling and new events on the socket "s" can be received. The hypothesis is that "handler" in step 3 is taking more time and hence delaying reading of more responses on the same socket "s". If there are more files on the brick socket "s" is connecting to, there is a contention resulting in the slow down of file operations. I'll send out a patch to make sure that the socket is added back for polling just after reading the msg from socket, but before we hand over the msg to higher layers for processing. regards, Raghavendra --- Additional comment from Vijay Bellur on 2016-07-28 08:36:00 EDT --- REVIEW: http://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#1) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Vijay Bellur on 2016-07-29 06:20:50 EDT --- REVIEW: http://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#2) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Vijay Bellur on 2016-07-29 07:55:29 EDT --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#1) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Vijay Bellur on 2016-08-01 07:49:00 EDT --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#2) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Raghavendra G on 2016-08-23 07:15:36 EDT --- I've built rpms with patch [1] on rhel-7.1. They can be found at: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11645502 Though [1] has not passed regression yet, we can use these rpms to validate whether it increases performance (provided tests pass through :) ). [1] http://review.gluster.org/15046 regards, Raghavendra --- Additional comment from Raghavendra G on 2016-08-23 07:16:34 EDT --- (In reply to Raghavendra G from comment #7) > I've built rpms with patch [1] on rhel-7.1. They can be found at: Credits to Krutika for assisting on rpms. I had some problems building them. regards, Raghavendra --- Additional comment from Manoj Pillai on 2016-08-23 13:24:55 EDT --- Comparing glusterfs-3.8.2-1 with a private build that incorporates the logic in comment #2. Config: (like before) 1x(4+2) disperse volume on 6 servers, 1 RAID-6 brick per server with client-io-threads enabled; event-threads set to 4; lookup-optimize on Test: single client iozone write test: iozone -i 0 -w -+n -c -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz glusterfs-3.8.2-1: throughput for 8 initial writers = 281965.19 kB/sec private build: throughput for 8 initial writers = 760830.82 kB/sec That's exactly the kind of improvement we were expecting, based on comment #0 and comment #1. So that's great news -- the RCA is comment #2 appears correct. As I understand from Raghavendra, the patch is failing regressions, so it might need to be reworked and we cannot declare victory yet. But good to know that we are on the right track. --- Additional comment from Niels de Vos on 2016-09-12 01:37:54 EDT --- All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html --- Additional comment from Worker Ant on 2016-11-03 06:49:50 EDT --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#3) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-07 09:15:16 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#4) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 01:50:40 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#5) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 01:50:45 EST --- REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#1) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 02:39:00 EST --- REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#2) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 02:39:04 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#6) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 03:27:27 EST --- REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#3) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 03:27:31 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#7) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 03:38:30 EST --- REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#4) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 03:38:34 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#8) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 05:37:50 EST --- REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#5) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-08 05:37:54 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#9) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-10 00:28:53 EST --- REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#6) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-10 00:28:57 EST --- REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#10) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2016-11-10 00:29:03 EST --- REVIEW: http://review.gluster.org/15815 (mount/fuse: Handle racing notify on more than one graph properly) posted (#1) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Manoj Pillai on 2017-04-20 00:14:53 EDT --- We hit this bug in an "EC over RAID-6 bricks" configuration. But I expect that it will hurt us badly in an EC+JBOD+brick-multiplexing config as well. Do you folks agree? --- Additional comment from Zhang Huan on 2017-04-29 21:27:13 EDT --- FYI. We also hit this issue in our test. Instead of testing glusterfs, we are testing glusterfs with samba. We have a distribute gluster volume with 4 servers and with 1 brick each. We use samba to export the volume to windows and test is done on windows. Of course, glusterfs plugin is used by samba. io-thread xlator is disabled in our test. Here is the performance difference. These number is from 72 streamings using a single windows, and samba has aio option enabled. Thus there is only one glusterfs client running in the cluster. read write w/o patch 1.07GB/s 1.20GB/s w/ patch 1.68GB/s 1.33GB/s I wrote a patch to fix this, before I found that it has already been working. My patch is using a similar method as Raghavendra does but has a little different implementation. I just put it here FYI. https://github.com/zhanghuan/glusterfs-1/commit/f21595d8ac5623cad6c8e6c7079684f6c29365c9 --- Additional comment from Worker Ant on 2017-05-02 06:29:35 EDT --- REVIEW: https://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#11) for review on release-3.8 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 01:42:23 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#3) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 01:42:28 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#1) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 02:11:33 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#2) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 02:11:38 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#4) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 02:31:53 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#3) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 02:31:58 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#5) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 03:02:49 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#4) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 03:02:54 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#6) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 10:19:56 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#5) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-08 10:20:01 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#7) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-09 07:29:46 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#6) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-09 07:29:52 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#8) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-09 07:39:02 EDT --- REVIEW: https://review.gluster.org/17200 (mount/fuse: Handle racing notify on more than one graph properly) posted (#7) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-09 07:39:06 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#9) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-10 05:14:33 EDT --- COMMIT: https://review.gluster.org/17200 committed in master by Raghavendra G (rgowdapp) ------ commit e71119e942eb016ba5a11c3f986715f16da10b82 Author: Raghavendra G <rgowdapp> Date: Thu Nov 10 10:56:26 2016 +0530 mount/fuse: Handle racing notify on more than one graph properly Make sure that we always use latest graph as a candidate for active-subvol. Change-Id: Ie37c818366f28ba6b1570d65a9eb17697d38a6c5 BUG: 1448364 Signed-off-by: Raghavendra G <rgowdapp> Reviewed-on: https://review.gluster.org/17200 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Amar Tumballi <amarts> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Jeff Darcy <jeff.us> CentOS-regression: Gluster Build System <jenkins.org> --- Additional comment from Worker Ant on 2017-05-10 06:16:57 EDT --- REVIEW: https://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#10) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-10 06:17:02 EDT --- REVIEW: https://review.gluster.org/17234 (tests/lock_revocation: mark as bad) posted (#1) for review on master by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-10 10:56:57 EDT --- COMMIT: https://review.gluster.org/17234 committed in master by Jeff Darcy (jeff.us) ------ commit d5865881de5653a0e810093a9867ab3962d00f67 Author: Raghavendra G <rgowdapp> Date: Wed May 10 15:44:33 2017 +0530 tests/lock_revocation: mark as bad The test is failing in master. see gluster-devel for more details. Change-Id: I7a589ad2c54bd55d62f4e66fdf8037c19fc123ea BUG: 1448364 Signed-off-by: Raghavendra G <rgowdapp> Reviewed-on: https://review.gluster.org/17234 NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jeff.us> --- Additional comment from Worker Ant on 2017-05-12 01:26:45 EDT --- COMMIT: https://review.gluster.org/15036 committed in master by Raghavendra G (rgowdapp) ------ commit cea8b702506ff914deadd056f4b7dd20a3ca7670 Author: Raghavendra G <rgowdapp> Date: Fri May 5 15:21:30 2017 +0530 event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire Currently socket is added back for future events after higher layers (rpc, xlators etc) have processed the message. If message processing involves signficant delay (as in writev replies processed by Erasure Coding), performance takes hit. Hence this patch modifies transport/socket to add back the socket for polling of events immediately after reading the entire rpc message, but before notification to higher layers. credits: Thanks to "Kotresh Hiremath Ravishankar" <khiremat> for assitance in fixing a regression in bitrot caused by this patch. Change-Id: I04b6b9d0b51a1cfb86ecac3c3d87a5f388cf5800 BUG: 1448364 Signed-off-by: Raghavendra G <rgowdapp> Reviewed-on: https://review.gluster.org/15036 CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Amar Tumballi <amarts> --- Additional comment from Worker Ant on 2017-05-28 03:39:29 EDT --- REVIEW: https://review.gluster.org/17391 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#2) for review on release-3.11 by Raghavendra G (rgowdapp) --- Additional comment from Worker Ant on 2017-05-28 06:28:52 EDT --- REVIEW: https://review.gluster.org/17391 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#3) for review on release-3.11 by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/17391 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#4) for review on release-3.11 by Atin Mukherjee (amukherj)
*** Bug 1450355 has been marked as a duplicate of this bug. ***
REVIEW: https://review.gluster.org/17391 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#5) for review on release-3.11 by Raghavendra G (rgowdapp)
REVIEW: https://review.gluster.org/17391 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#6) for review on release-3.11 by Raghavendra G (rgowdapp)
COMMIT: https://review.gluster.org/17391 committed in release-3.11 by Shyamsundar Ranganathan (srangana) ------ commit a8971426fe8e3f49f5670e4f5d6d9b7192bd455f Author: Raghavendra G <rgowdapp> Date: Fri May 5 15:21:30 2017 +0530 event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire Currently socket is added back for future events after higher layers (rpc, xlators etc) have processed the message. If message processing involves signficant delay (as in writev replies processed by Erasure Coding), performance takes hit. Hence this patch modifies transport/socket to add back the socket for polling of events immediately after reading the entire rpc message, but before notification to higher layers. credits: Thanks to "Kotresh Hiremath Ravishankar" <khiremat> for assitance in fixing a regression in bitrot caused by this patch. >Reviewed-on: https://review.gluster.org/15036 >CentOS-regression: Gluster Build System <jenkins.org> >NetBSD-regression: NetBSD Build System <jenkins.org> >Smoke: Gluster Build System <jenkins.org> >Reviewed-by: Amar Tumballi <amarts> Change-Id: I04b6b9d0b51a1cfb86ecac3c3d87a5f388cf5800 BUG: 1456259 Signed-off-by: Raghavendra G <rgowdapp> Reviewed-on: https://review.gluster.org/17391 NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.1, please open a new bug report. glusterfs-3.11.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-June/000074.html [2] https://www.gluster.org/pipermail/gluster-users/