Bug 799177

Summary: reads not balanced across servers for 2-replica 2-server filesystem
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ben England <bengland>
Component: glusterfsAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED EOL QA Contact: Ben England <bengland>
Severity: high Docs Contact:
Priority: medium    
Version: 1.0CC: bengland, bmarson, bturner, gluster-bugs, jeder, matt, mpillai, pkarampu, rhs-bugs, rwheeler, sdharane, vbellur
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-03 17:22:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
OpenOffice Calc spreadsheet containing graph of read throughput by server none

Description Ben England 2012-03-02 02:54:00 UTC
Description of problem:

Gluster sequential read throughput is significantly lower for 2-replica filesystem than for identical 1-replica filesystem.

Version-Release number of selected component (if applicable):

server: appliance linux 2.6.32-131.17.1.el6 glusterfs-core-3.2.5-2.el6.x86_64
clients: RHEL6.2 + glusterfs-core-3.2.5-2.el6.x86_64

How reproducible:

in this configuration every time.  2 servers, 4 clients, 10-GbE network.
see http://perf1.lab.bos.redhat.com/bengland/laptop/matte/2repl-problem for 
iozone config file data

Steps to Reproduce:
1. using 2 servers, create a 2-replica gluster filesystem
   mount the filesystem from all 4 clients
2. using 4 clients, 1 thread/client, write 4 files to the filesystem
3. using 4 clients, 1 thread/client, read same 4 files 

Sometimes this happens with as many as 4 files/client.
  
Actual results:

write generated network receive traffic on both servers' p2p1 (10-GbE) network interface, but read only generated network transmit traffic on a single server's network interface.  If replica selection by client is random, you would expect that at least one client would have chosen a different server than the others, particularly if the test is repeated over and over again.

So I was seeing server-cached read throughput of 500 MB/s, when it should be close to 2000 MB/s using both servers.

However, if you delete files and recreate, and do this often enough with enough files, load balancing happens sometimes.  It's not a network problem because writes always go to both servers.

Expected results:

at least some of the time the clients should have balanced their load across both servers.

With 16 files, this still happens,the odds are negligible that a random algorithm would get all 16 files from one server while the 2nd one was idle.

Additional info:

See the above URL for details.

Comment 1 Ben England 2012-04-16 15:36:21 UTC
cannot reproduce in Gluster V3.3 beta qa33.

Comment 2 Ben England 2013-01-20 14:28:12 UTC
Created attachment 683565 [details]
OpenOffice Calc spreadsheet containing graph of read throughput by server

Comment 3 Ben England 2013-01-20 16:35:22 UTC
problem: With 2-replica volume, if there is only one file/client, then only half of the Gluster servers are used to process reads.  

questions: what real-world applications does this impact?  What happens if a single thread on the client accesses multiple files in a sequence?  
  
This problem has  been reproduced with iozone using 8 servers, 512 KVM guest clients, and 512 threads. result is the same, only 4 out of 8 servers return data, reducing throughput by 50% and creating massive I/O imbalance.  I came up with a simpler test case below:

8 servers, 16 KVM hosts, 1 KVM guest/host, each guest is a gluster client.

I reproduced this by using bare-metal Gluster clients also.
(You could also reproduce this problem with 2 servers and 1 KVM host, probably with 1 GbE)

Each guest VM mounts the Gluster volume with the native client, All the files are opened before any of them are read.  (Consequently response-time-based load balancing won't work).

The preceding attachment contains the data showing the imbalance.  Each curve represents throughput of a single Gluster server.  There are 3 test cases: 1, 2, and 4 iozone threads/client.  Each test case is run two times in a row.  with 1 thread/client, 4 of the 8 servers are totally unused. With 2 threads/client, all 8 are used, and with 4 threads/client we finally get to line speed.  

The lesser difference between 2 threads/client and 4 threads/client is not as significant a problem -- in these two cases both NICs are being used, but with 2 threads/client the distribution of files across servers is not as even because of consistent hashing algorithm,


______Software versions:

RHS servers: 
- RHEL 6.2, kernel 2.6.32-220.23.1.el6.x86_64
- glusterfs-3.3.0.5rhs-40.el6rhs.x86_64

KVM hosts:
- RHEL 6.3, kernel 2.6.32-279.el6.x86_64
- glusterfs-3.3.0.5rhs-40.el6rhs.x86_64

KVM guests:
- RHEL 6.3, kernel 2.6.32-279.el6.x86_64
- glusterfs-3.3.0.5rhs-40.el6rhs.x86_64

______Configuration:

10-GbE network uses single Cisco switch with jumbo frames.  Expected per-server read speed is therefore around 1000 MB/s.

[root@gprfs029 ~]# gluster volume info
 
Volume Name: kvmfs
Type: Distributed-Replicate
Volume ID: ce4cdea2-ee7d-409c-8c3d-dfa726d4e719
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gprfs025-10ge:/mnt/brick0
Brick2: gprfs026-10ge:/mnt/brick0
Brick3: gprfs027-10ge:/mnt/brick0
Brick4: gprfs028-10ge:/mnt/brick0
Brick5: gprfs029-10ge:/mnt/brick0
Brick6: gprfs030-10ge:/mnt/brick0
Brick7: gprfs031-10ge:/mnt/brick0
Brick8: gprfs032-10ge:/mnt/brick0
Options Reconfigured:
cluster.eager-lock: enable
storage.linux-aio: disable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
network.remote-dio: on

____example workloads with no virtualization: 

with 1 thread/client:
[root@gprfc032-10ge read-prob]# iozone -+m clients -w -c -e -i 1 -+n -r 16384k -s 4g -t 16
        Children see throughput for 16 readers          = 5849412.03 KB/sec

with 4 threads/client:
[root@gprfc032-10ge read-prob]# iozone -+m clients -w -c -e -i 1 -+n -r 16384k -s 4g -t 64
        Children see throughput for 64 readers          = 9334280.93 KB/sec

[root@gprfc032-10ge read-prob]#  cat clients
gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
...
gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
...
gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
...
gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
...
gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone

Comment 5 Ben England 2014-01-20 22:01:27 UTC
In the workload in comment 3, iozone opens all of its files before it reads any of them, this means that the glusterfs doesn't have a way to tell how to balance reads across the available replicas. 

We do see this problem EVERY TIME when using libgfapi on virtual machine disk images (think RHEV and OpenStack), because libgfapi is in effect mounting the volume every time you start a Gluster application.
 
One workaround is cluster.read-hash-mode=2.  from gluster volume set help: "Description: inode-read fops happen only on one of the bricks in replicate. AFR will prefer the one computed using the method specified using this option0 = first responder, 1 = hash by GFID of file (all clients use same subvolume), 2 = hash by GFID of file and client PID".  

We could also test this behavior with smallfile benchmark, which creates more than one file per thread, and if we do not see the imbalance when using it, then this is an artifact of running a large-file read workload on a volume that has just been mounted (with libgfapi or with FUSE).   

One possible solution might be to randomly select the replica in the absence of any response time data from the servers. Perhaps we could add a small random number (say between 0 and 100 nanosec) to the observed server response time in the client.  If there are no observed server response times yet, then presumably the server response times are zero and the replica selection is random, but if response times have been observed from the server, then replica selection is directed to the least busy server.

Comment 6 Pranith Kumar K 2015-03-17 10:14:11 UTC
Ben
There is an option in afr to do the distribution of workload in afr-v2 based on gfids which is enabled by default. Please let us know if that addresses this bug.

Pranith

Comment 7 Ben England 2015-03-20 18:12:19 UTC
Pranith, That should fix it, I assume it hashes gfid to replica, easy to test with typical BAGL config, just unmount and remount before each iozone test.

I think you said AFR V2 was in Everglades, not in RHS 3.0.4, right?  Everglades uses gluster 3.7 right?  I don't see any 3.7 builds in 

http://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL

thx

-ben

Comment 8 Ben England 2015-05-04 14:49:45 UTC
version: glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64

Command line used: iozone -t 8 -w -c -e -i 1 -+n -r 64k -s 8g -F /mnt/re
pl3/ioz/f1.ioz /mnt/repl3/ioz/f2.ioz /mnt/repl3/ioz/f3.ioz /mnt/repl3/ioz/f4.ioz
 /mnt/repl3/ioz/f5.ioz /mnt/repl3/ioz/f6.ioz /mnt/repl3/ioz/f7.ioz /mnt/repl3/io
z/f8.ioz

We got to ~1000 MB/s but balancing isn't perfect, if number of files concurrently accessed isn't >= 4x number of hosts, then you tend to find hosts that aren't involved in reads.  The question is whether or not this is just a consequence of consistent hashing.    

For example, we had 6 servers, (1, gprfs040, was down), and 2 DHT subvolumes, gprfs{022,024,037} and gprfs{038,039,040}.  There were 5 files in the first subvolume and 3 in the 2nd.  The 2nd is where you see a host that has zero throughput.  This seems unlikely with 3 files read from 2 hosts (1/8 probability).  

gprfs022 2664
gprfs024 1200
gprfs037 2083
gprfs038 2800
gprfs039 ~0
gprfs040 down

I'll try creating a volume with just 3 hosts to more clearly identify whether there is a problem here or not.

BTW, for the corresponding write workload, we hit 10-GbE line speed from the client to the servers, and throughput was 400 MB/s.

Comment 9 Ben England 2015-05-04 16:56:12 UTC
So I created a 3-server volume with 3-way JBOD replication and then tried 8-thread iozone sequential read, results are:

host MBit/sec
gprfs022 2302
gprfs024 3200
gprfs037 3300

Because this was a JBOD config, disks were pegged  at ~130 MB/s. 

pbench dataset http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/gl37-3way-seqrd/

So balancing is quite good across servers.  On a test with 2 way servers, here's my workload:

mtpt=/mnt/two ; pfa svrs.list 'echo 1 > /proc/sys/vm/drop_caches' ; umount $mtpt ; mount -t glusterfs gprfs022-10ge:/two $mtpt ; sleep 1 ;  user-benchmark "iozone -t 8 -w -c -e -i 1 -+n -r 64k -s 8g -F `echo $mtpt/ioz/f{1,2,3,4,5,6,7,8}.ioz` " 

and here's the balancing:

host MBit/sec
gprfs022 4551
gprfs024 4400

Here's the data: http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/gl37-repl2-jbod-2svr-1clnt/

The same test ran at 10-GbE LINE SPEED on the client when cache was not dropped on server.

Looks pretty good to me so far. So remaining balancing problems should be with insufficient threads (not engaging enough disks) or not enough threads for DHT to evenly spread load across its subvolumes using a random uniform distribution.  The first problem could be helped by a good tier translator, and the 2nd problem can be avoided through use of RAID if workload is primarily very large files.

Comment 10 Vivek Agarwal 2015-12-03 17:22:37 UTC
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.