Bug 1278565

Summary:

samba-vfs-glusterfs reads all from a single brick in gluster replica

Product:

[Community] GlusterFS

Reporter:

Adam Neale <aneale>

Component:

replicate

Assignee:

bugs <bugs>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.6.6

CC:

atalur, bugs, kdhananj, madam, pkarampu, ravishankar

Target Milestone:

---

Keywords:

Triaged

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-06-22 10:19:39 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
testparm output for samba configuration used	none

Description Adam Neale 2015-11-05 19:49:24 UTC

Created attachment 1090321 [details]
testparm output for samba configuration used

Description of problem:

Using samba-vfs-glusterfs with gluster set to replica 4 all read calls (multiple reads to/from multiple files) are served from a single brick. When the same test is run using a fuse mount instead it reads from all replicas resulting in higher ops/s.

samba-vfs-glusterfs:
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
Brick: lzreptest001:/export/brick2/lzone
Brick: lzreptest002:/export/brick1/lzone
Brick: lzreptest002:/export/brick2/lzone
      1.57      51.21 us      21.00 us    2974.00 us          13815        READ

Fuse mount:
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
      0.95     159.62 us      22.00 us   14080.00 us           7887        READ
Brick: lzreptest001:/export/brick2/lzone
      0.94     170.46 us      20.00 us    2985.00 us           7490        READ
Brick: lzreptest002:/export/brick1/lzone
      0.88     267.29 us      26.00 us   17699.00 us           7853        READ
Brick: lzreptest002:/export/brick2/lzone
      0.90     258.96 us      25.00 us   53134.00 us           7903        READ

Version-Release number of selected component (if applicable):
glusterfs-3.6.6-1.el6.x86_64
glusterfs-api-3.6.6-1.el6.x86_64
glusterfs-cli-3.6.6-1.el6.x86_64
glusterfs-fuse-3.6.6-1.el6.x86_64
glusterfs-geo-replication-3.6.6-1.el6.x86_64
glusterfs-libs-3.6.6-1.el6.x86_64
glusterfs-server-3.6.6-1.el6.x86_64

samba-4.1.17-4.el6rhs.x86_64
samba-client-4.1.17-4.el6rhs.x86_64
samba-common-4.1.17-4.el6rhs.x86_64
samba-libs-4.1.17-4.el6rhs.x86_64
samba-vfs-glusterfs-4.1.17-4.el6rhs.x86_64

How reproducible:
Every time.

Steps to Reproduce:
1. Create cluster with 2 servers, 4 bricks - replica 4 
2. Setup samba with gluster vfs 
3. Turn on gluster profile for the volume
4. mount samba on remote box and run "filebench" with the fileserver profile
5. observe results in gluster profile

Actual results:
Only a single brick used for READ calls.
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
Brick: lzreptest001:/export/brick2/lzone
Brick: lzreptest002:/export/brick1/lzone
Brick: lzreptest002:/export/brick2/lzone
      1.57      51.21 us      21.00 us    2974.00 us          13815        READ

Expected results:
All bricks used for READ calls.
gluster volume profile lzone info | egrep "Brick|READ$"
Brick: lzreptest001:/export/brick1/lzone
      0.95     159.62 us      22.00 us   14080.00 us           7887        READ
Brick: lzreptest001:/export/brick2/lzone
      0.94     170.46 us      20.00 us    2985.00 us           7490        READ
Brick: lzreptest002:/export/brick1/lzone
      0.88     267.29 us      26.00 us   17699.00 us           7853        READ
Brick: lzreptest002:/export/brick2/lzone
      0.90     258.96 us      25.00 us   53134.00 us           7903        READ

Additional info:

Comment 1 Michael Adam 2015-11-13 13:33:11 UTC

Samba does not anything with bricks.
Might be a libgfapi bug.

Comment 2 Niels de Vos 2015-11-13 15:06:23 UTC

On a replicated volume READs are served by the brick that returned from a LOOKUP first (decided by the AFR-xlator). To explain the reported behaviour, it is important to know where the volume/share is mounted, and what server was used for mounting.

For example, I guess you have a setup like this:
- 4 gluster storage servers
- one of these storage servers exports the volume over Samba
- 1 client system (not on a storage server)

AFR takes care of the replication, and talks to the brick processes running on all the storage servers. This is all done in the Gluster client side (fuse mount, or vfs_glusterfs/libgfapi). When a file is opened for reading, a LOOKUP is done as the first step, this is sent to all the bricks in the volume.

When mounting the Gluster volume over FUSE on a client system, AFR runs on the client-side too. All storage servers are connected equally over a network. Whichever storage server replies to the LOOKUP first, will be used to READ from. The 1st replies will more or less come randomly from the different storage servers, and the load that READ procedures cause are distributed relatively evenly.

If the Gluster client (vfs_glusterfs/libgfapi) is running local on a Gluster server, the brick on 'localhost' will most of the times be quickest in replying to the LOOKUP. The other bricks are located over a network connection, and will normally need more time to reply. You will see that in the "gluster volume profile" output the server running Samba handles most of the READ requests.

I hope that this explains it well. Please let us know if my assumptions are incorrect. The developers working on AFR have been added as CC on this bug and they will be able to answer more details.

We might want to place a description like this in our documentation on https://gluster.readthedocs.org/ too, but I'm leaving that for others to do (feel free to copy/paste).

Comment 3 Pranith Kumar K 2016-06-22 10:19:39 UTC

"gluster volume set <volname> read-hash-mode 1" will serve reads based on gfid hash which will distribute the reads.