Bug 831219
Summary: | Hadoop/Gluster reads perform poorly with 2-replica filesystem | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Ben England <bengland> |
Component: | gluster-hadoop | Assignee: | Diane Feddema <dfeddema> |
Status: | CLOSED EOL | QA Contact: | hcfs-gluster-bugs |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | mainline | CC: | bugs, cww, eboyd, enakai, esammons, gluster-bugs, matt, mbukatov, perfbz, poelstra, rwheeler |
Target Milestone: | --- | Keywords: | Reopened, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-10-22 15:46:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ben England
2012-06-12 14:00:35 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. Not a priority for RHS 2.1.0, but we need this bug open nevertheless. I still see this bug in glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 using a simple iozone test on the mountpoint, without using Hadoop. It is not easy to see -- the first test I ran did not show the problem, but after a remount, it did show up. Hypothesis: Perhaps order in which responses are received from glusterfsd brick servers determines which replica is selected. Because 10-GbE is fast, the responses can come in from a remote server before they are received from the local server. And glusterfs does not have logic to be patient and wait for 2nd response to arrive? In light of design goal to collapse Hadoop and virt tier into storage tier, we need to have Gluster try harder to select the local replica. If we see that the file is large (> 10 MB), can we defer the selection process for a couple of milliseconds? Why? Chances are that a the increased latency of the open() call will not matter with a large file, but the increased network traffic caused by selecting the wrong replica will matter a great deal. It should be easy to return the file size with the response if it is not already in the cache. ---------- and here is annotated log of test reproducing problem -------- ---- Delete the old files ---- [root@gprfs025 ~]# rm -fv /mnt/glusterfs/ioz/f*ioz removed `/mnt/glusterfs/ioz/f1.ioz' removed `/mnt/glusterfs/ioz/f2.ioz' removed `/mnt/glusterfs/ioz/f3.ioz' removed `/mnt/glusterfs/ioz/f4.ioz' ----- (Re-)Create the files [root@gprfs025 ~]# iozone -t 4 -w -c -e -i 0 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz Children see throughput for 4 initial writers = 650307.14 KB/sec ---- read them, only local replicas are read, no network traffic ------- iozone test complete. [root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz Children see throughput for 4 readers = 1128869.44 KB/sec ----- read them again, same result ------ [root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz Children see throughput for 4 readers = 1172931.12 KB/sec ----- unmount and remount the filesystem [root@gprfs025 ~]# umount /mnt/glusterfs && mount -t glusterfs gprfs025-10ge:/kvmfs /mnt/glusterfs ------ now read them again, WRONG REPLICA selected for subset of files ----- [root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz Children see throughput for 4 readers = 1058206.78 KB/sec ------ and problem PERSISTS after first read ------- [root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz Children see throughput for 4 readers = 1011318.64 KB/sec -------- here's network activity while the read took place -------- [root@gprfs025 ~]# ./net-poller.py 2 p1p1 checking network interfaces: ['p1p1'] found interface p1p1 poll rate = 2, network interfaces = ['p1p1'] p1p1_tx_MB/s, p1p1_rcv_MB/s, p1p1_tx_pkts/s, p1p1_rcv_pkts/s, 0.00, 0.00, 0.50, 2.50, 0.00, 0.00, 1.00, 0.00, 0.13, 0.00, 1045.50, 1.00, 1.29, 352.93, 11386.00, 13017.50, 1.33, 519.47, 11637.50, 19293.00, 1.25, 503.54, 10973.00, 18767.00, 1.02, 573.09, 8730.00, 20984.50, 0.00, 106.18, 1.00, 3894.50, 0.00, 0.00, 1.00, 1.00, Here is what goes on inside glusterfs when selecting the replica, I see no attempt to select a local replica here. [root@gprfs025 ~]# iozone -t 1 -w -c -e -i 0 -+n -r 4k -s 16k -F /mnt/glusterfs/ioz/f1.ioz ... [2013-02-16 12:04:58.359349] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ] [2013-02-16 12:04:58.359381] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ] [2013-02-16 12:04:58.359399] D [afr-self-heal-common.c:829:afr_mark_sources] 0-kvmfs-replicate-0: Number of sources: 0 [2013-02-16 12:04:58.359412] D [afr-self-heal-data.c:863:afr_lookup_select_read_child_by_txn_type] 0-kvmfs-replicate-0: returning read_child: 1 [2013-02-16 12:04:58.359420] D [afr-common.c:1294:afr_lookup_select_read_child] 0-kvmfs-replicate-0: Source selected as 1 for / [2013-02-16 12:04:58.359428] D [afr-common.c:1097:afr_lookup_build_response_params] 0-kvmfs-replicate-0: Building lookup response from 1 [2013-02-16 12:04:58.360073] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ] [2013-02-16 12:04:58.360095] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ] [2013-02-16 12:04:58.360107] D [afr-self-heal-common.c:829:afr_mark_sources] 0-kvmfs-replicate-0: Number of sources: 0 [2013-02-16 12:04:58.360118] D [afr-self-heal-data.c:863:afr_lookup_select_read_child_by_txn_type] 0-kvmfs-replicate-0: returning read_child: 0 [2013-02-16 12:04:58.360130] D [afr-common.c:1294:afr_lookup_select_read_child] 0-kvmfs-replicate-0: Source selected as 0 for /ioz [2013-02-16 12:04:58.360144] D [afr-common.c:1097:afr_lookup_build_response_params] 0-kvmfs-replicate-0: Building lookup response from 0 [2013-02-16 12:04:58.360849] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ] [2013-02-16 12:04:58.360871] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ] [2013-02-16 12:04:58.360879] D [afr-self-heal-common.c:829:afr_mark_sources] 0-kvmfs-replicate-0: Number of sources: 0 [2013-02-16 12:04:58.360886] D [afr-self-heal-data.c:863:afr_lookup_select_read_child_by_txn_type] 0-kvmfs-replicate-0: returning read_child: 1 [2013-02-16 12:04:58.360894] D [afr-common.c:1294:afr_lookup_select_read_child] 0-kvmfs-replicate-0: Source selected as 1 for /ioz/f1.ioz [2013-02-16 12:04:58.360902] D [afr-common.c:1097:afr_lookup_build_response_params] 0-kvmfs-replicate-0: Building lookup response from 1 [2013-02-16 12:04:58.362620] D [afr-transaction.c:1035:afr_post_nonblocking_inodelk_cbk] 0-kvmfs-replicate-0: Non blocking inodelks done. Proceeding to FOP per bens "hypothesis": Is gluster implementing a "greedy" algorithm , where it simply waits for the "first" server to respond, and starts reading immediately? If so, is there a way to configure gluster to always try to read from the closest or local server, rather than implementing a greedy "read from the first server that responds " ? (In reply to Jay Vyas from comment #6) > per bens "hypothesis": Is gluster implementing a "greedy" algorithm , where > it > simply waits for the "first" server to respond, and starts reading > immediately? > > > If so, is there a way to configure gluster to always try to read from the > closest or local server, rather than implementing a greedy "read from the > first > server that responds " ? That would be called "cluster.read-subvolume" and "cluster.read-subvolume-index", this would set the read_child field in "replicate" which could be configured to use "closest" or "local server" This hasn't been documented since currently that option is volume level at not useful from "Gluster CLI" It is generally nice to use it as a client side option during mount for example # mount -t glusterfs -oxlator-option=read-subvolume=<subvolume-name> subvolume-name : is of type <volname>-client-<0..N-1> Personally haven't tested this yet, but it would be worth while test for Hadoop workloads Gluster should be able to determine which replica is the local one automatically, and in fact the default behavior should be to choose the local replica to read in Gluster 3.4, always. Doesn't Hadoop work with RHS 2.1 now? RHS 2.1 is based on Gluster 3.4 . The original post for this bug was filed against RHS 2.0, which was based on Gluster 3.3, yes? [root@ben-test-driver2-10ge glusterfs-3.4.0.24rhs]# rpm -q glusterfs glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64 [root@gprfs020 ~]# gluster v set help | grep local Option: cluster.choose-local Description: Choose a local subvolume(i.e. Brick) to read from if read-subvolume is not explicitly set. in gluster 3.4 source tree from glusterfs-3.4.0.24rhs source RPM: ./xlators/cluster/afr/src/afr.c { .key = {"choose-local" }, .type = GF_OPTION_TYPE_BOOL, .default_value = "true", .description = "Choose a local subvolume(i.e. Brick) to read from if " "read-subvolume is not explicitly set.", }, Note that it is on by default. If we have evidence that this is not working, it is a bug. You should be able to see it choosing the local replica. from line 2183 of afr_common.c in same directory: gf_log (this->name, GF_LOG_INFO, "selecting local read_child %s", priv->children[child_index]->name); I just used "sar -n DEV 2" and FUSE mounted a 1 GB file from the 2 servers in the replication pair and a server outside the replication pair. The 2 servers in the replication pair never read the file over the network -- 10-GbE network traffic was zero while I continually read the file. The server outside the replication pair always read the file over the network of course and you could see the traffic leaving one of the servers in the replication pair. So unless you have some evidence that Gluster is not choosing the local replica in RHS 2.1, I suggest we mark this fixed, since this bug was targeted at problems involving choice of non-local replica. because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it. |