Description of problem: Gluster sequential read throughput is significantly lower for 2-replica filesystem than for identical 1-replica filesystem. Version-Release number of selected component (if applicable): server: appliance linux 2.6.32-131.17.1.el6 glusterfs-core-3.2.5-2.el6.x86_64 clients: RHEL6.2 + glusterfs-core-3.2.5-2.el6.x86_64 How reproducible: in this configuration every time. 2 servers, 4 clients, 10-GbE network. see http://perf1.lab.bos.redhat.com/bengland/laptop/matte/2repl-problem for iozone config file data Steps to Reproduce: 1. using 2 servers, create a 2-replica gluster filesystem mount the filesystem from all 4 clients 2. using 4 clients, 1 thread/client, write 4 files to the filesystem 3. using 4 clients, 1 thread/client, read same 4 files Sometimes this happens with as many as 4 files/client. Actual results: write generated network receive traffic on both servers' p2p1 (10-GbE) network interface, but read only generated network transmit traffic on a single server's network interface. If replica selection by client is random, you would expect that at least one client would have chosen a different server than the others, particularly if the test is repeated over and over again. So I was seeing server-cached read throughput of 500 MB/s, when it should be close to 2000 MB/s using both servers. However, if you delete files and recreate, and do this often enough with enough files, load balancing happens sometimes. It's not a network problem because writes always go to both servers. Expected results: at least some of the time the clients should have balanced their load across both servers. With 16 files, this still happens,the odds are negligible that a random algorithm would get all 16 files from one server while the 2nd one was idle. Additional info: See the above URL for details.
cannot reproduce in Gluster V3.3 beta qa33.
Created attachment 683565 [details] OpenOffice Calc spreadsheet containing graph of read throughput by server
problem: With 2-replica volume, if there is only one file/client, then only half of the Gluster servers are used to process reads. questions: what real-world applications does this impact? What happens if a single thread on the client accesses multiple files in a sequence? This problem has been reproduced with iozone using 8 servers, 512 KVM guest clients, and 512 threads. result is the same, only 4 out of 8 servers return data, reducing throughput by 50% and creating massive I/O imbalance. I came up with a simpler test case below: 8 servers, 16 KVM hosts, 1 KVM guest/host, each guest is a gluster client. I reproduced this by using bare-metal Gluster clients also. (You could also reproduce this problem with 2 servers and 1 KVM host, probably with 1 GbE) Each guest VM mounts the Gluster volume with the native client, All the files are opened before any of them are read. (Consequently response-time-based load balancing won't work). The preceding attachment contains the data showing the imbalance. Each curve represents throughput of a single Gluster server. There are 3 test cases: 1, 2, and 4 iozone threads/client. Each test case is run two times in a row. with 1 thread/client, 4 of the 8 servers are totally unused. With 2 threads/client, all 8 are used, and with 4 threads/client we finally get to line speed. The lesser difference between 2 threads/client and 4 threads/client is not as significant a problem -- in these two cases both NICs are being used, but with 2 threads/client the distribution of files across servers is not as even because of consistent hashing algorithm, ______Software versions: RHS servers: - RHEL 6.2, kernel 2.6.32-220.23.1.el6.x86_64 - glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 KVM hosts: - RHEL 6.3, kernel 2.6.32-279.el6.x86_64 - glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 KVM guests: - RHEL 6.3, kernel 2.6.32-279.el6.x86_64 - glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 ______Configuration: 10-GbE network uses single Cisco switch with jumbo frames. Expected per-server read speed is therefore around 1000 MB/s. [root@gprfs029 ~]# gluster volume info Volume Name: kvmfs Type: Distributed-Replicate Volume ID: ce4cdea2-ee7d-409c-8c3d-dfa726d4e719 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: gprfs025-10ge:/mnt/brick0 Brick2: gprfs026-10ge:/mnt/brick0 Brick3: gprfs027-10ge:/mnt/brick0 Brick4: gprfs028-10ge:/mnt/brick0 Brick5: gprfs029-10ge:/mnt/brick0 Brick6: gprfs030-10ge:/mnt/brick0 Brick7: gprfs031-10ge:/mnt/brick0 Brick8: gprfs032-10ge:/mnt/brick0 Options Reconfigured: cluster.eager-lock: enable storage.linux-aio: disable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off network.remote-dio: on ____example workloads with no virtualization: with 1 thread/client: [root@gprfc032-10ge read-prob]# iozone -+m clients -w -c -e -i 1 -+n -r 16384k -s 4g -t 16 Children see throughput for 16 readers = 5849412.03 KB/sec with 4 threads/client: [root@gprfc032-10ge read-prob]# iozone -+m clients -w -c -e -i 1 -+n -r 16384k -s 4g -t 64 Children see throughput for 64 readers = 9334280.93 KB/sec [root@gprfc032-10ge read-prob]# cat clients gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone
In the workload in comment 3, iozone opens all of its files before it reads any of them, this means that the glusterfs doesn't have a way to tell how to balance reads across the available replicas. We do see this problem EVERY TIME when using libgfapi on virtual machine disk images (think RHEV and OpenStack), because libgfapi is in effect mounting the volume every time you start a Gluster application. One workaround is cluster.read-hash-mode=2. from gluster volume set help: "Description: inode-read fops happen only on one of the bricks in replicate. AFR will prefer the one computed using the method specified using this option0 = first responder, 1 = hash by GFID of file (all clients use same subvolume), 2 = hash by GFID of file and client PID". We could also test this behavior with smallfile benchmark, which creates more than one file per thread, and if we do not see the imbalance when using it, then this is an artifact of running a large-file read workload on a volume that has just been mounted (with libgfapi or with FUSE). One possible solution might be to randomly select the replica in the absence of any response time data from the servers. Perhaps we could add a small random number (say between 0 and 100 nanosec) to the observed server response time in the client. If there are no observed server response times yet, then presumably the server response times are zero and the replica selection is random, but if response times have been observed from the server, then replica selection is directed to the least busy server.
Ben There is an option in afr to do the distribution of workload in afr-v2 based on gfids which is enabled by default. Please let us know if that addresses this bug. Pranith
Pranith, That should fix it, I assume it hashes gfid to replica, easy to test with typical BAGL config, just unmount and remount before each iozone test. I think you said AFR V2 was in Everglades, not in RHS 3.0.4, right? Everglades uses gluster 3.7 right? I don't see any 3.7 builds in http://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL thx -ben
version: glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64 Command line used: iozone -t 8 -w -c -e -i 1 -+n -r 64k -s 8g -F /mnt/re pl3/ioz/f1.ioz /mnt/repl3/ioz/f2.ioz /mnt/repl3/ioz/f3.ioz /mnt/repl3/ioz/f4.ioz /mnt/repl3/ioz/f5.ioz /mnt/repl3/ioz/f6.ioz /mnt/repl3/ioz/f7.ioz /mnt/repl3/io z/f8.ioz We got to ~1000 MB/s but balancing isn't perfect, if number of files concurrently accessed isn't >= 4x number of hosts, then you tend to find hosts that aren't involved in reads. The question is whether or not this is just a consequence of consistent hashing. For example, we had 6 servers, (1, gprfs040, was down), and 2 DHT subvolumes, gprfs{022,024,037} and gprfs{038,039,040}. There were 5 files in the first subvolume and 3 in the 2nd. The 2nd is where you see a host that has zero throughput. This seems unlikely with 3 files read from 2 hosts (1/8 probability). gprfs022 2664 gprfs024 1200 gprfs037 2083 gprfs038 2800 gprfs039 ~0 gprfs040 down I'll try creating a volume with just 3 hosts to more clearly identify whether there is a problem here or not. BTW, for the corresponding write workload, we hit 10-GbE line speed from the client to the servers, and throughput was 400 MB/s.
So I created a 3-server volume with 3-way JBOD replication and then tried 8-thread iozone sequential read, results are: host MBit/sec gprfs022 2302 gprfs024 3200 gprfs037 3300 Because this was a JBOD config, disks were pegged at ~130 MB/s. pbench dataset http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/gl37-3way-seqrd/ So balancing is quite good across servers. On a test with 2 way servers, here's my workload: mtpt=/mnt/two ; pfa svrs.list 'echo 1 > /proc/sys/vm/drop_caches' ; umount $mtpt ; mount -t glusterfs gprfs022-10ge:/two $mtpt ; sleep 1 ; user-benchmark "iozone -t 8 -w -c -e -i 1 -+n -r 64k -s 8g -F `echo $mtpt/ioz/f{1,2,3,4,5,6,7,8}.ioz` " and here's the balancing: host MBit/sec gprfs022 4551 gprfs024 4400 Here's the data: http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/gl37-repl2-jbod-2svr-1clnt/ The same test ran at 10-GbE LINE SPEED on the client when cache was not dropped on server. Looks pretty good to me so far. So remaining balancing problems should be with insufficient threads (not engaging enough disks) or not enough threads for DHT to evenly spread load across its subvolumes using a random uniform distribution. The first problem could be helped by a good tier translator, and the 2nd problem can be avoided through use of RAID if workload is primarily very large files.
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.