Bug 1278565 - samba-vfs-glusterfs reads all from a single brick in gluster replica
Summary: samba-vfs-glusterfs reads all from a single brick in gluster replica
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.6.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-05 19:49 UTC by Adam Neale
Modified: 2016-06-22 10:19 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-06-22 10:19:39 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
testparm output for samba configuration used (1.70 KB, text/plain)
2015-11-05 19:49 UTC, Adam Neale
no flags Details

Description Adam Neale 2015-11-05 19:49:24 UTC
Created attachment 1090321 [details]
testparm output for samba configuration used

Description of problem:

Using samba-vfs-glusterfs with gluster set to replica 4 all read calls (multiple reads to/from multiple files) are served from a single brick. When the same test is run using a fuse mount instead it reads from all replicas resulting in higher ops/s.

samba-vfs-glusterfs:
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
Brick: lzreptest001:/export/brick2/lzone
Brick: lzreptest002:/export/brick1/lzone
Brick: lzreptest002:/export/brick2/lzone
      1.57      51.21 us      21.00 us    2974.00 us          13815        READ

Fuse mount:
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
      0.95     159.62 us      22.00 us   14080.00 us           7887        READ
Brick: lzreptest001:/export/brick2/lzone
      0.94     170.46 us      20.00 us    2985.00 us           7490        READ
Brick: lzreptest002:/export/brick1/lzone
      0.88     267.29 us      26.00 us   17699.00 us           7853        READ
Brick: lzreptest002:/export/brick2/lzone
      0.90     258.96 us      25.00 us   53134.00 us           7903        READ

Version-Release number of selected component (if applicable):
glusterfs-3.6.6-1.el6.x86_64
glusterfs-api-3.6.6-1.el6.x86_64
glusterfs-cli-3.6.6-1.el6.x86_64
glusterfs-fuse-3.6.6-1.el6.x86_64
glusterfs-geo-replication-3.6.6-1.el6.x86_64
glusterfs-libs-3.6.6-1.el6.x86_64
glusterfs-server-3.6.6-1.el6.x86_64

samba-4.1.17-4.el6rhs.x86_64
samba-client-4.1.17-4.el6rhs.x86_64
samba-common-4.1.17-4.el6rhs.x86_64
samba-libs-4.1.17-4.el6rhs.x86_64
samba-vfs-glusterfs-4.1.17-4.el6rhs.x86_64

How reproducible:
Every time.

Steps to Reproduce:
1. Create cluster with 2 servers, 4 bricks - replica 4 
2. Setup samba with gluster vfs 
3. Turn on gluster profile for the volume
4. mount samba on remote box and run "filebench" with the fileserver profile
5. observe results in gluster profile

Actual results:
Only a single brick used for READ calls.
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
Brick: lzreptest001:/export/brick2/lzone
Brick: lzreptest002:/export/brick1/lzone
Brick: lzreptest002:/export/brick2/lzone
      1.57      51.21 us      21.00 us    2974.00 us          13815        READ

Expected results:
All bricks used for READ calls.
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
      0.95     159.62 us      22.00 us   14080.00 us           7887        READ
Brick: lzreptest001:/export/brick2/lzone
      0.94     170.46 us      20.00 us    2985.00 us           7490        READ
Brick: lzreptest002:/export/brick1/lzone
      0.88     267.29 us      26.00 us   17699.00 us           7853        READ
Brick: lzreptest002:/export/brick2/lzone
      0.90     258.96 us      25.00 us   53134.00 us           7903        READ

Additional info:

Comment 1 Michael Adam 2015-11-13 13:33:11 UTC
Samba does not anything with bricks.
Might be a libgfapi bug.

Comment 2 Niels de Vos 2015-11-13 15:06:23 UTC
On a replicated volume READs are served by the brick that returned from a LOOKUP first (decided by the AFR-xlator). To explain the reported behaviour, it is important to know where the volume/share is mounted, and what server was used for mounting.

For example, I guess you have a setup like this:
- 4 gluster storage servers
    - one of these storage servers exports the volume over Samba
- 1 client system (not on a storage server)

AFR takes care of the replication, and talks to the brick processes running on all the storage servers. This is all done in the Gluster client side (fuse mount, or vfs_glusterfs/libgfapi). When a file is opened for reading, a LOOKUP is done as the first step, this is sent to all the bricks in the volume.

When mounting the Gluster volume over FUSE on a client system, AFR runs on the client-side too. All storage servers are connected equally over a network. Whichever storage server replies to the LOOKUP first, will be used to READ from. The 1st replies will more or less come randomly from the different storage servers, and the load that READ procedures cause are distributed relatively evenly.

If the Gluster client (vfs_glusterfs/libgfapi) is running local on a Gluster server, the brick on 'localhost' will most of the times be quickest in replying to the LOOKUP. The other bricks are located over a network connection, and will normally need more time to reply. You will see that in the "gluster volume profile" output the server running Samba handles most of the READ requests.


I hope that this explains it well. Please let us know if my assumptions are incorrect. The developers working on AFR have been added as CC on this bug and they will be able to answer more details.

We might want to place a description like this in our documentation on https://gluster.readthedocs.org/ too, but I'm leaving that for others to do (feel free to copy/paste).

Comment 3 Pranith Kumar K 2016-06-22 10:19:39 UTC
"gluster volume set <volname> read-hash-mode 1" will serve reads based on gfid hash which will distribute the reads.


Note You need to log in before you can comment on or make changes to this bug.