Bug 799177
Summary: | reads not balanced across servers for 2-replica 2-server filesystem | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ben England <bengland> | ||||
Component: | glusterfs | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | ||||
Status: | CLOSED EOL | QA Contact: | Ben England <bengland> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 1.0 | CC: | bengland, bmarson, bturner, gluster-bugs, jeder, matt, mpillai, pkarampu, rhs-bugs, rwheeler, sdharane, vbellur | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-12-03 17:22:37 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ben England
2012-03-02 02:54:00 UTC
cannot reproduce in Gluster V3.3 beta qa33. Created attachment 683565 [details]
OpenOffice Calc spreadsheet containing graph of read throughput by server
problem: With 2-replica volume, if there is only one file/client, then only half of the Gluster servers are used to process reads. questions: what real-world applications does this impact? What happens if a single thread on the client accesses multiple files in a sequence? This problem has been reproduced with iozone using 8 servers, 512 KVM guest clients, and 512 threads. result is the same, only 4 out of 8 servers return data, reducing throughput by 50% and creating massive I/O imbalance. I came up with a simpler test case below: 8 servers, 16 KVM hosts, 1 KVM guest/host, each guest is a gluster client. I reproduced this by using bare-metal Gluster clients also. (You could also reproduce this problem with 2 servers and 1 KVM host, probably with 1 GbE) Each guest VM mounts the Gluster volume with the native client, All the files are opened before any of them are read. (Consequently response-time-based load balancing won't work). The preceding attachment contains the data showing the imbalance. Each curve represents throughput of a single Gluster server. There are 3 test cases: 1, 2, and 4 iozone threads/client. Each test case is run two times in a row. with 1 thread/client, 4 of the 8 servers are totally unused. With 2 threads/client, all 8 are used, and with 4 threads/client we finally get to line speed. The lesser difference between 2 threads/client and 4 threads/client is not as significant a problem -- in these two cases both NICs are being used, but with 2 threads/client the distribution of files across servers is not as even because of consistent hashing algorithm, ______Software versions: RHS servers: - RHEL 6.2, kernel 2.6.32-220.23.1.el6.x86_64 - glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 KVM hosts: - RHEL 6.3, kernel 2.6.32-279.el6.x86_64 - glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 KVM guests: - RHEL 6.3, kernel 2.6.32-279.el6.x86_64 - glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 ______Configuration: 10-GbE network uses single Cisco switch with jumbo frames. Expected per-server read speed is therefore around 1000 MB/s. [root@gprfs029 ~]# gluster volume info Volume Name: kvmfs Type: Distributed-Replicate Volume ID: ce4cdea2-ee7d-409c-8c3d-dfa726d4e719 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: gprfs025-10ge:/mnt/brick0 Brick2: gprfs026-10ge:/mnt/brick0 Brick3: gprfs027-10ge:/mnt/brick0 Brick4: gprfs028-10ge:/mnt/brick0 Brick5: gprfs029-10ge:/mnt/brick0 Brick6: gprfs030-10ge:/mnt/brick0 Brick7: gprfs031-10ge:/mnt/brick0 Brick8: gprfs032-10ge:/mnt/brick0 Options Reconfigured: cluster.eager-lock: enable storage.linux-aio: disable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off network.remote-dio: on ____example workloads with no virtualization: with 1 thread/client: [root@gprfc032-10ge read-prob]# iozone -+m clients -w -c -e -i 1 -+n -r 16384k -s 4g -t 16 Children see throughput for 16 readers = 5849412.03 KB/sec with 4 threads/client: [root@gprfc032-10ge read-prob]# iozone -+m clients -w -c -e -i 1 -+n -r 16384k -s 4g -t 64 Children see throughput for 64 readers = 9334280.93 KB/sec [root@gprfc032-10ge read-prob]# cat clients gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone gprfc033 /mnt/glusterfs/iozone.d /usr/local/bin/iozone ... gprfc048 /mnt/glusterfs/iozone.d /usr/local/bin/iozone In the workload in comment 3, iozone opens all of its files before it reads any of them, this means that the glusterfs doesn't have a way to tell how to balance reads across the available replicas. We do see this problem EVERY TIME when using libgfapi on virtual machine disk images (think RHEV and OpenStack), because libgfapi is in effect mounting the volume every time you start a Gluster application. One workaround is cluster.read-hash-mode=2. from gluster volume set help: "Description: inode-read fops happen only on one of the bricks in replicate. AFR will prefer the one computed using the method specified using this option0 = first responder, 1 = hash by GFID of file (all clients use same subvolume), 2 = hash by GFID of file and client PID". We could also test this behavior with smallfile benchmark, which creates more than one file per thread, and if we do not see the imbalance when using it, then this is an artifact of running a large-file read workload on a volume that has just been mounted (with libgfapi or with FUSE). One possible solution might be to randomly select the replica in the absence of any response time data from the servers. Perhaps we could add a small random number (say between 0 and 100 nanosec) to the observed server response time in the client. If there are no observed server response times yet, then presumably the server response times are zero and the replica selection is random, but if response times have been observed from the server, then replica selection is directed to the least busy server. Ben There is an option in afr to do the distribution of workload in afr-v2 based on gfids which is enabled by default. Please let us know if that addresses this bug. Pranith Pranith, That should fix it, I assume it hashes gfid to replica, easy to test with typical BAGL config, just unmount and remount before each iozone test. I think you said AFR V2 was in Everglades, not in RHS 3.0.4, right? Everglades uses gluster 3.7 right? I don't see any 3.7 builds in http://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL thx -ben version: glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64 Command line used: iozone -t 8 -w -c -e -i 1 -+n -r 64k -s 8g -F /mnt/re pl3/ioz/f1.ioz /mnt/repl3/ioz/f2.ioz /mnt/repl3/ioz/f3.ioz /mnt/repl3/ioz/f4.ioz /mnt/repl3/ioz/f5.ioz /mnt/repl3/ioz/f6.ioz /mnt/repl3/ioz/f7.ioz /mnt/repl3/io z/f8.ioz We got to ~1000 MB/s but balancing isn't perfect, if number of files concurrently accessed isn't >= 4x number of hosts, then you tend to find hosts that aren't involved in reads. The question is whether or not this is just a consequence of consistent hashing. For example, we had 6 servers, (1, gprfs040, was down), and 2 DHT subvolumes, gprfs{022,024,037} and gprfs{038,039,040}. There were 5 files in the first subvolume and 3 in the 2nd. The 2nd is where you see a host that has zero throughput. This seems unlikely with 3 files read from 2 hosts (1/8 probability). gprfs022 2664 gprfs024 1200 gprfs037 2083 gprfs038 2800 gprfs039 ~0 gprfs040 down I'll try creating a volume with just 3 hosts to more clearly identify whether there is a problem here or not. BTW, for the corresponding write workload, we hit 10-GbE line speed from the client to the servers, and throughput was 400 MB/s. So I created a 3-server volume with 3-way JBOD replication and then tried 8-thread iozone sequential read, results are: host MBit/sec gprfs022 2302 gprfs024 3200 gprfs037 3300 Because this was a JBOD config, disks were pegged at ~130 MB/s. pbench dataset http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/gl37-3way-seqrd/ So balancing is quite good across servers. On a test with 2 way servers, here's my workload: mtpt=/mnt/two ; pfa svrs.list 'echo 1 > /proc/sys/vm/drop_caches' ; umount $mtpt ; mount -t glusterfs gprfs022-10ge:/two $mtpt ; sleep 1 ; user-benchmark "iozone -t 8 -w -c -e -i 1 -+n -r 64k -s 8g -F `echo $mtpt/ioz/f{1,2,3,4,5,6,7,8}.ioz` " and here's the balancing: host MBit/sec gprfs022 4551 gprfs024 4400 Here's the data: http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/gl37-repl2-jbod-2svr-1clnt/ The same test ran at 10-GbE LINE SPEED on the client when cache was not dropped on server. Looks pretty good to me so far. So remaining balancing problems should be with insufficient threads (not engaging enough disks) or not enough threads for DHT to evenly spread load across its subvolumes using a random uniform distribution. The first problem could be helped by a good tier translator, and the 2nd problem can be avoided through use of RAID if workload is primarily very large files. Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release. |