1358606 – limited throughput with disperse volume over small number of bricks

Bug 1358606 - limited throughput with disperse volume over small number of bricks

Summary: limited throughput with disperse volume over small number of bricks

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	rpc
Sub Component:
Version:	3.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1450357 (view as bug list)
Depends On:	1448364 1456259
Blocks:	1420796
TreeView+	depends on / blocked

Reported:	2016-07-21 06:47 UTC by Manoj Pillai
Modified:	2023-09-14 03:28 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Clones:	1420796 1448364 (view as bug list)
Environment:
Last Closed:	2017-08-21 10:10:48 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Manoj Pillai 2016-07-21 06:47:05 UTC

Description of problem:

The test uses a 1x(4+2) disperse volume created on 6 servers, where each brick is on a RAID-6 device capable of delivering high throughput. A simple iozone write test reports low throughput.

However, when the volume is changed to 4x(4+2) by carving 4 bricks on each h/w RAID device, the throughput improves significantly -- close to the network line speed.

This indicates some inefficiencies within gluster that is limiting throughput in the 1x(4+2) case.

Version-Release number of selected component (if applicable):
glusterfs*-3.8.0-1.el7.x86_64 (on both clients and servers)
RHEL 7.1 (clients)
RHEL 7.2 (servers)

How reproducible:
Consistently

Actual results:

command:
iozone -i 0 -w -+n -c -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz

result for 1x(4+2) volume:
throughput for  8 initial writers  =  268871.84 kB/sec

result for 4x(4+2) volume (identical h/w setup):
throughput for  8 initial writers  =  755955.41 kB/sec

Additional info:

From gluster vol info:
<quote>
[...]
Options Reconfigured:
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.client-io-threads: on
transport.address-family: inet
performance.readdir-ahead: on
</quote>

Comment 1 Manoj Pillai 2016-07-21 07:22:29 UTC

Based on a conversation with Pranith, re-ran the test by creating the 1x(4+2) volume (as before, each brick on a RAID-6 device, 6 servers, 1 brick per server). But this time, used multiple mount points on the client:

<quote>
[...]
gprfs045-10ge:/perfvol          4.0T  135M  4.0T   1% /mnt/glustervol1
gprfs045-10ge:/perfvol          4.0T  135M  4.0T   1% /mnt/glustervol2
gprfs045-10ge:/perfvol          4.0T  135M  4.0T   1% /mnt/glustervol3
gprfs045-10ge:/perfvol          4.0T  135M  4.0T   1% /mnt/glustervol4
</quote>

command:
iozone -i 0 -w -+n -c -e -s 10g -r 64k -t 8 -F /mnt/glustervol1/f1.ioz /mnt/glustervol1/f2.ioz /mnt/glustervol2/f3.ioz /mnt/glustervol2/f4.ioz /mnt/glustervol3/f5.ioz /mnt/glustervol3/f6.ioz /mnt/glustervol4/f7.ioz /mnt/glustervol4/f8.ioz

result:
throughput for  8 initial writers  =  791249.52 kB/sec

So when the volume is mounted multiple times on the client or when then are multiple bricks on the server, the performance is good.

Pranith has a hypothesis for what is going on here, which is what led to the multiple-mount test. I'll let him explain.

Comment 2 Raghavendra G 2016-07-22 05:54:08 UTC

From the discussion I had with Pranith and Manoj, the issue seems to be the way sockets are added/removed to/from polling for events while higher layers (socket, programs like glusterfs client) are processing a single event. The current workflow when an event is received is:

1. There is an incoming msg on socket s. A pollin event is received in epoll layer.
2. epoll removes the socket from the poll_array. So, we don't receive any more events on socket "s", till it is added back for polling.
3. Now the handler for pollin is called. This handler
   3a. reads the entire msg from socket (we assume the entire rpc msg is available).
   3b. Invokes a notification function to higher layers that a msg is received. Once this this notification call returns handler returns control to epoll layer. However as part of this notification function higher layers (like EC) can do significant work, making the time taken by handler a significant chunk.
4. Once handler returns epoll layer adds the socket back for polling and new events on the socket "s" can be received.

The hypothesis is that "handler" in step 3 is taking more time and hence delaying reading of more responses on the same socket "s". If there are more files on the brick socket "s" is connecting to, there is a contention resulting in the slow down of file operations.

I'll send out a patch to make sure that the socket is added back for polling just after reading the msg from socket, but before we hand over the msg to higher layers for processing.

regards,
Raghavendra

Comment 3 Vijay Bellur 2016-07-28 12:36:00 UTC

REVIEW: http://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#1) for review on master by Raghavendra G (rgowdapp)

Comment 4 Vijay Bellur 2016-07-29 10:20:50 UTC

REVIEW: http://review.gluster.org/15036 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 5 Vijay Bellur 2016-07-29 11:55:29 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#1) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 6 Vijay Bellur 2016-08-01 11:49:00 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#2) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 9 Manoj Pillai 2016-08-23 17:24:55 UTC

Comparing glusterfs-3.8.2-1 with a private build that incorporates the logic in comment #2.

Config: (like before)
1x(4+2) disperse volume on 6 servers,  1 RAID-6 brick per server
with client-io-threads enabled; event-threads set to 4; lookup-optimize on

Test: single client iozone write test:
iozone -i 0 -w -+n -c -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz

glusterfs-3.8.2-1:
throughput for  8 initial writers  =  281965.19 kB/sec

private build:
throughput for  8 initial writers  =  760830.82 kB/sec

That's exactly the kind of improvement we were expecting, based on comment #0 and comment #1.

So that's great news -- the RCA is comment #2 appears correct. As I understand from Raghavendra, the patch is failing regressions, so it might need to be reworked and we cannot declare victory yet. But good to know that we are on the right track.

Comment 10 Niels de Vos 2016-09-12 05:37:54 UTC

All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html

Comment 11 Worker Ant 2016-11-03 10:49:50 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#3) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 12 Worker Ant 2016-11-07 14:15:16 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#4) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 13 Worker Ant 2016-11-08 06:50:40 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#5) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 14 Worker Ant 2016-11-08 06:50:45 UTC

REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#1) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 15 Worker Ant 2016-11-08 07:39:00 UTC

REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#2) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 16 Worker Ant 2016-11-08 07:39:04 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#6) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 17 Worker Ant 2016-11-08 08:27:27 UTC

REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#3) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 18 Worker Ant 2016-11-08 08:27:31 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#7) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 19 Worker Ant 2016-11-08 08:38:30 UTC

REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#4) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 20 Worker Ant 2016-11-08 08:38:34 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#8) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 21 Worker Ant 2016-11-08 10:37:50 UTC

REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#5) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 22 Worker Ant 2016-11-08 10:37:54 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#9) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 23 Worker Ant 2016-11-10 05:28:53 UTC

REVIEW: http://review.gluster.org/15793 (cluster/dht: Fix memory corruption during reconfigure) posted (#6) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 24 Worker Ant 2016-11-10 05:28:57 UTC

REVIEW: http://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#10) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 25 Worker Ant 2016-11-10 05:29:03 UTC

REVIEW: http://review.gluster.org/15815 (mount/fuse: Handle racing notify on more than one graph properly) posted (#1) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 26 Manoj Pillai 2017-04-20 04:14:53 UTC

We hit this bug in an "EC over RAID-6 bricks" configuration. But I expect that it will hurt us badly in an EC+JBOD+brick-multiplexing config as well.

Do you folks agree?

Comment 27 Zhang Huan 2017-04-30 01:27:13 UTC

FYI. We also hit this issue in our test. Instead of testing glusterfs, we are testing glusterfs with samba. We have a distribute gluster volume with 4 servers and with 1 brick each. We use samba to export the volume to windows and test is done on windows. Of course, glusterfs plugin is used by samba. io-thread xlator is disabled in our test.

Here is the performance difference. These number is from 72 streamings using a single windows, and samba has aio option enabled. Thus there is only one glusterfs client running in the cluster.
                   read           write
w/o patch          1.07GB/s       1.20GB/s
w/ patch           1.68GB/s       1.33GB/s

I wrote a patch to fix this, before I found that it has already been working. My patch is using a similar method as Raghavendra does but has a little different implementation. I just put it here FYI.
https://github.com/zhanghuan/glusterfs-1/commit/f21595d8ac5623cad6c8e6c7079684f6c29365c9

Comment 28 Worker Ant 2017-05-02 10:29:35 UTC

REVIEW: https://review.gluster.org/15046 (event/epoll: Add back socket for polling of events immediately after reading the entire rpc message from the wire) posted (#11) for review on release-3.8 by Raghavendra G (rgowdapp)

Comment 29 Raghavendra G 2017-05-19 07:00:51 UTC

(In reply to Manoj Pillai from comment #26)
> We hit this bug in an "EC over RAID-6 bricks" configuration. But I expect
> that it will hurt us badly in an EC+JBOD+brick-multiplexing config as well.
> 
> Do you folks agree?

Need some time to go through brick-multiplexing. Will have a definitive answer sometime next week.

Comment 30 Raghavendra G 2017-05-19 10:14:25 UTC

(In reply to Manoj Pillai from comment #26)
> We hit this bug in an "EC over RAID-6 bricks" configuration. But I expect
> that it will hurt us badly in an EC+JBOD+brick-multiplexing config as well.
> 
> Do you folks agree?

Manoj,

Based on our discussions, you were concerned about connections being shared by multiple bricks. But, each brick gets an exclusive socket (not shared by any other brick) for each client connected to it (this is true without brick-mux and brick-mux doesn't change it). This socket is used for data (I/O) traffic. If we consider just that parameter Brick-multiplexing should've no effect (neither positive/nor negative). However, other resources like threads in server process, server process memory, OS, physical network, disks etc are still shared by bricks. Not sure whether that will have any effects on performance.

regards,
Raghavendra

Comment 31 Niels de Vos 2017-05-28 08:31:14 UTC

We need a VERY strong argument for including this in the stable 3.8 release. My current preference is to keep this only in the master branch for now.

More details at https://lists.gluster.org/pipermail/maintainers/2017-May/002617.html

Comment 32 Niels de Vos 2017-05-28 08:54:18 UTC

*** Bug 1450357 has been marked as a duplicate of this bug. ***

Comment 33 Raghavendra G 2017-08-10 11:07:59 UTC

(In reply to Niels de Vos from comment #31)
> We need a VERY strong argument for including this in the stable 3.8 release.
> My current preference is to keep this only in the master branch for now.
> 
> More details at
> https://lists.gluster.org/pipermail/maintainers/2017-May/002617.html

Can we close this bug then?

Comment 34 Milind Changire 2017-08-21 10:10:48 UTC

As per comment #33 - NEXTRELEASE
Closing bug.

Comment 35 Red Hat Bugzilla 2023-09-14 03:28:27 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.