Bug 831219

Summary:	Hadoop/Gluster reads perform poorly with 2-replica filesystem
Product:	[Community] GlusterFS	Reporter:	Ben England <bengland>
Component:	gluster-hadoop	Assignee:	Diane Feddema <dfeddema>
Status:	CLOSED EOL	QA Contact:	hcfs-gluster-bugs
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	bugs, cww, eboyd, enakai, esammons, gluster-bugs, matt, mbukatov, perfbz, poelstra, rwheeler
Target Milestone:	---	Keywords:	Reopened, Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-10-22 15:46:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben England 2012-06-12 14:00:35 UTC

Description of problem:

Hadoop applications performing reads read a large fraction of data from non-local replicas, EVEN if the network interfaces are run at 1-GbE speed instead of 10-GbE speed.  This defeats a key design principle of Hadoop, to be able to do map-reduce applications without transferring large amounts of data over the network.   Note that this problem does NOT occur in a 1-replica filesystem, meaning that Hadoop plugin did its job but the Gluster FUSE client did not.  Also, HDFS does not have this behavior with 2-replica configuration.

Version-Release number of selected component (if applicable):

RHS 2.0 RC1

How reproducible:

Very

Steps to Reproduce:
1. load a few hundred 60-MB files into Hadoop
2. run a hadoop grep job that matches at most a few records within these files.
3. observe network performance
  
Actual results:

There is a large amount of network throughput (see Tim Wilkinson's data at 
http://irish.lab.bos.redhat.com/pub/tim/projects/gluster/hadoop/hdfs-glfs.ods )

Expected results:

Gluster plugin for Hadoop should help Hadoop run job on local replicas, and Gluster FUSE client should select a local replica when available.

Additional info:

Comment 2 RHEL Program Management 2012-07-20 06:56:00 UTC

Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 3 Vidya Sakar 2012-07-20 10:23:01 UTC

Not a priority for RHS 2.1.0, but we need this bug open nevertheless.

Comment 5 Ben England 2013-02-16 17:46:15 UTC

I still see this bug in glusterfs-3.3.0.5rhs-40.el6rhs.x86_64 using a simple iozone test on the mountpoint, without using Hadoop.  It is not easy to see -- the first test I ran did not show the problem, but after a remount, it did show up.  Hypothesis: Perhaps order in which responses are received from glusterfsd brick servers determines which replica is selected.  Because 10-GbE is fast, the responses can come in from a remote server before they are received from the local server.  And glusterfs does not have logic to be patient and wait for 2nd response to arrive?

In light of design goal to collapse Hadoop and virt tier into storage tier, we need to have Gluster try harder to select the local replica.  If we see that the file is large (> 10 MB), can we defer the selection process for a couple of milliseconds?  Why?  Chances are that a the increased latency of the open() call will not matter with a large file, but the increased network traffic caused by selecting the wrong replica will matter a great deal.

It should be easy to return the file size with the response if it is not already in the cache.



---------- and here is annotated log of test reproducing problem --------

---- Delete the old files ----

[root@gprfs025 ~]# rm -fv /mnt/glusterfs/ioz/f*ioz
removed `/mnt/glusterfs/ioz/f1.ioz'
removed `/mnt/glusterfs/ioz/f2.ioz'
removed `/mnt/glusterfs/ioz/f3.ioz'
removed `/mnt/glusterfs/ioz/f4.ioz'

----- (Re-)Create the files

[root@gprfs025 ~]# iozone -t 4 -w -c -e -i 0 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz
        Children see throughput for  4 initial writers  =  650307.14 KB/sec


---- read them, only local replicas are read, no network traffic -------
iozone test complete.
[root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz
        Children see throughput for  4 readers          = 1128869.44 KB/sec

----- read them again, same result ------

[root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz
        Children see throughput for  4 readers          = 1172931.12 KB/sec

----- unmount and remount the filesystem

[root@gprfs025 ~]# umount /mnt/glusterfs && mount -t glusterfs gprfs025-10ge:/kvmfs /mnt/glusterfs 

------ now read them again, WRONG REPLICA selected for subset of files -----

[root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz
        Children see throughput for  4 readers          = 1058206.78 KB/sec

------  and problem PERSISTS after first read -------

[root@gprfs025 ~]# iozone -t 4 -w -c -e -i 1 -+n -r 64k -s 2g -F /mnt/glusterfs/ioz/f{1,2,3,4}.ioz
        Children see throughput for  4 readers          = 1011318.64 KB/sec

-------- here's network activity while the read took place --------

[root@gprfs025 ~]# ./net-poller.py 2 p1p1
checking network interfaces: ['p1p1']
found interface p1p1
poll rate = 2, network interfaces = ['p1p1']
 p1p1_tx_MB/s, p1p1_rcv_MB/s, p1p1_tx_pkts/s, p1p1_rcv_pkts/s,
      0.00,      0.00,      0.50,      2.50,
      0.00,      0.00,      1.00,      0.00,
      0.13,      0.00,   1045.50,      1.00,
      1.29,    352.93,  11386.00,  13017.50,
      1.33,    519.47,  11637.50,  19293.00,
      1.25,    503.54,  10973.00,  18767.00,
      1.02,    573.09,   8730.00,  20984.50,
      0.00,    106.18,      1.00,   3894.50,
      0.00,      0.00,      1.00,      1.00,



Here is what goes on inside glusterfs when selecting the replica, I see no attempt to select a local replica here.

[root@gprfs025 ~]# iozone -t 1 -w -c -e -i 0 -+n -r 4k -s 16k -F /mnt/glusterfs/ioz/f1.ioz
...
[2013-02-16 12:04:58.359349] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ]
[2013-02-16 12:04:58.359381] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ]
[2013-02-16 12:04:58.359399] D [afr-self-heal-common.c:829:afr_mark_sources] 0-kvmfs-replicate-0: Number of sources: 0
[2013-02-16 12:04:58.359412] D [afr-self-heal-data.c:863:afr_lookup_select_read_child_by_txn_type] 0-kvmfs-replicate-0: returning read_child: 1
[2013-02-16 12:04:58.359420] D [afr-common.c:1294:afr_lookup_select_read_child] 0-kvmfs-replicate-0: Source selected as 1 for /
[2013-02-16 12:04:58.359428] D [afr-common.c:1097:afr_lookup_build_response_params] 0-kvmfs-replicate-0: Building lookup response from 1
[2013-02-16 12:04:58.360073] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ]
[2013-02-16 12:04:58.360095] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ]
[2013-02-16 12:04:58.360107] D [afr-self-heal-common.c:829:afr_mark_sources] 0-kvmfs-replicate-0: Number of sources: 0
[2013-02-16 12:04:58.360118] D [afr-self-heal-data.c:863:afr_lookup_select_read_child_by_txn_type] 0-kvmfs-replicate-0: returning read_child: 0
[2013-02-16 12:04:58.360130] D [afr-common.c:1294:afr_lookup_select_read_child] 0-kvmfs-replicate-0: Source selected as 0 for /ioz
[2013-02-16 12:04:58.360144] D [afr-common.c:1097:afr_lookup_build_response_params] 0-kvmfs-replicate-0: Building lookup response from 0
[2013-02-16 12:04:58.360849] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ]
[2013-02-16 12:04:58.360871] D [afr-self-heal-common.c:139:afr_sh_print_pending_matrix] 0-kvmfs-replicate-0: pending_matrix: [ 0 0 ]
[2013-02-16 12:04:58.360879] D [afr-self-heal-common.c:829:afr_mark_sources] 0-kvmfs-replicate-0: Number of sources: 0
[2013-02-16 12:04:58.360886] D [afr-self-heal-data.c:863:afr_lookup_select_read_child_by_txn_type] 0-kvmfs-replicate-0: returning read_child: 1
[2013-02-16 12:04:58.360894] D [afr-common.c:1294:afr_lookup_select_read_child] 0-kvmfs-replicate-0: Source selected as 1 for /ioz/f1.ioz
[2013-02-16 12:04:58.360902] D [afr-common.c:1097:afr_lookup_build_response_params] 0-kvmfs-replicate-0: Building lookup response from 1
[2013-02-16 12:04:58.362620] D [afr-transaction.c:1035:afr_post_nonblocking_inodelk_cbk] 0-kvmfs-replicate-0: Non blocking inodelks done. Proceeding to FOP

Comment 6 Jay Vyas 2013-11-19 01:27:11 UTC

per bens "hypothesis": Is gluster implementing a "greedy" algorithm , where it 
simply waits for the "first" server to respond, and starts reading immediately? 
 

If so, is there a way to configure gluster to always try to read from the 
closest or local server, rather than implementing a greedy "read from the first 
server that responds " ?

Comment 7 Harshavardhana 2013-11-19 02:03:56 UTC

(In reply to Jay Vyas from comment #6)
> per bens "hypothesis": Is gluster implementing a "greedy" algorithm , where
> it 
> simply waits for the "first" server to respond, and starts reading
> immediately? 
>  
> 
> If so, is there a way to configure gluster to always try to read from the 
> closest or local server, rather than implementing a greedy "read from the
> first 
> server that responds " ?

That would be called "cluster.read-subvolume" and "cluster.read-subvolume-index", this would set the read_child field in "replicate" which could be configured to use "closest" or "local server" 

This hasn't been documented since currently that option is volume level at not useful from "Gluster CLI" 

It is generally nice to use it as a client side option during mount for example

# mount -t glusterfs -oxlator-option=read-subvolume=<subvolume-name> 

subvolume-name : is of type <volname>-client-<0..N-1>

Personally haven't tested this yet, but it would be worth while test for Hadoop workloads

Comment 8 Ben England 2013-12-04 20:01:03 UTC

Gluster should be able to determine which replica is the local one automatically, and in fact the default behavior should be to choose the local replica to read in Gluster 3.4, always.

Doesn't Hadoop work with RHS 2.1 now? RHS 2.1 is based on Gluster 3.4 . The original post for this bug was filed against RHS 2.0, which was based on Gluster 3.3, yes?

[root@ben-test-driver2-10ge glusterfs-3.4.0.24rhs]# rpm -q glusterfs
glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64

[root@gprfs020 ~]# gluster v set help | grep local
Option: cluster.choose-local
Description: Choose a local subvolume(i.e. Brick) to read from if read-subvolume is not explicitly set.

in gluster 3.4 source tree from glusterfs-3.4.0.24rhs source RPM:

./xlators/cluster/afr/src/afr.c

{ .key = {"choose-local" },
.type = GF_OPTION_TYPE_BOOL,
.default_value = "true",
.description = "Choose a local subvolume(i.e. Brick) to read from if "
"read-subvolume is not explicitly set.",
},

Note that it is on by default. If we have evidence that this is not working, it is a bug. You should be able to see it choosing the local replica.

from line 2183 of afr_common.c in same directory:

gf_log (this->name, GF_LOG_INFO,
"selecting local read_child %s",
priv->children[child_index]->name);

I just used "sar -n DEV 2" and FUSE mounted a 1 GB file from the 2 servers in the replication pair and a server outside the replication pair. The 2 servers in the replication pair never read the file over the network -- 10-GbE network traffic was zero while I continually read the file. The server outside the replication pair always read the file over the network of course and you could see the traffic leaving one of the servers in the replication pair.

So unless you have some evidence that Gluster is not choosing the local replica in RHS 2.1, I suggest we mark this fixed, since this bug was targeted at problems involving choice of non-local replica.

Comment 9 Kaleb KEITHLEY 2015-10-22 15:46:38 UTC

because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.