948086 – concurrent ls on NFS results in inconsistent views

Bug 948086 - concurrent ls on NFS results in inconsistent views

Summary: concurrent ls on NFS results in inconsistent views

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	posix
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Vijay Bellur
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	948087 948088 (view as bug list)
Depends On:
Blocks:	952693
TreeView+	depends on / blocked

Reported:	2013-04-03 23:49 UTC by Anand Avati
Modified:	2015-09-01 23:06 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:09:12 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Anand Avati 2013-04-03 23:49:40 UTC

I'm seeing a problem on my fairly fresh RHEL gluster install. Smells to me like a parallelism problem on the server.

If I mount a gluster volume via NFS (using glusterd's internal NFS server, nfs-kernel-server) and read a directory from multiple clients *in parallel*, I get inconsistent results across servers. Some files are missing from the directory listing, some may be present twice!

Exactly which files (or directories!) are missing/duplicated varies each time. But I can very consistently reproduce the behaviour.

You can see a screenshot here: http://imgur.com/JU8AFrt

The replication steps are:
* clusterssh to each NFS client
* unmount /gv0 (to clear cache)
* mount /gv0 [1]
* ls -al /gv0/common/apache-jmeter-2.9/bin (which is where I first noticed this)

Here's the rub: if, instead of doing the 'ls' in parallel, I do it in series, it works just fine (consistent correct results everywhere). But hitting the gluster server from multiple clients at the same time causes problems.

I can still stat() and open() the files missing from the directory listing, they just don't show up in an enumeration.

Mounting gv0 as a gluster client filesystem works just fine.

Details of my setup:
2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL 6.4 64-bit, glusterfs-server-3.3.1-1.el6.x86_64 (from EPEL)
4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7 64-bit, glusterfs-3.3.1-11.el5 (from kkeithley's repo, only used for testing)
gv0 volume information is below
bricks are 400GB SSDs with ext4[2]
common network is 10GbE, replication between servers happens over direct 10GbE link.

I will be testing on xfs/btrfs/zfs eventually, but for now I'm on ext4. 

Also attached is my chatlog from asking about this in #gluster

[1]: fstab line is: fearless1:/gv0 /gv0 nfs defaults,sync,tcp,wsize=8192,rsize=8192 0 0
[2]: yes, I've turned off dir_index to avoid That Bug. I've run the d_off test, results are here: http://pastebin.com/zQt5gZnZ

----
gluster> volume info gv0
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 20117b48-7f88-4f16-9490-a0349afacf71
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata
Brick2: fearless2:/export/bricks/500117310007a674/glusterdata
Brick3: fearless1:/export/bricks/500117310007a714/glusterdata
Brick4: fearless2:/export/bricks/500117310007a684/glusterdata
Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata
Brick6: fearless2:/export/bricks/500117310007a694/glusterdata
Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata
Brick8: fearless2:/export/bricks/500117310007a720/glusterdata
Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata
Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata
Brick11: fearless1:/export/bricks/500117310007a838/glusterdata
Brick12: fearless2:/export/bricks/500117310007a814/glusterdata
Brick13: fearless1:/export/bricks/500117310007a850/glusterdata
Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata
Brick15: fearless1:/export/bricks/500117310007a858/glusterdata
Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.disable: off

Comment 1 Anand Avati 2013-04-03 23:59:48 UTC

*** Bug 948088 has been marked as a duplicate of this bug. ***

Comment 2 Vijay Bellur 2013-04-04 08:30:04 UTC

*** Bug 948087 has been marked as a duplicate of this bug. ***

Comment 3 Anand Avati 2013-05-07 12:38:37 UTC

REVIEW: http://review.gluster.org/4963 (posix: fix dangerous "sharing" of fd in readdir between two requests) posted (#1) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 4 Anand Avati 2013-05-07 18:55:03 UTC

COMMIT: http://review.gluster.org/4963 committed in release-3.4 by Anand Avati (avati) 
------
commit 5ac55756cd923e4bb1e5b5df50aeaf198d5531b7
Author: Anand Avati <avati>
Date:   Wed Apr 3 16:31:07 2013 -0700

    posix: fix dangerous "sharing" of fd in readdir between two requests
    
    posix_fill_readdir() is a multi-step function which performs many
    readdir() calls, and expects the directory cursor to have not
    "seeked away" elsewhere between two successive iterations. Usually
    this is not a problem as each opendir() from an application has its
    own backend fd, and there is nobody else to "seek away" the directory
    cursor. However in case of NFS's use of anonymous fd, the same fd_t
    is shared between all NFS readdir requests, and two readdir loops can
    be executing in parallel on the same dir dragging away the cursor in
    a chaotic manner.
    
    The fix in this patch is to lock on the fd around the loop. Another
    approach could be to reimplement posix_fill_readdir() with a single
    getdents() call, but that's for another day.
    
    Change-Id: Ia42e9c7fbcde43af4c0d08c20cc0f7419b98bd3f
    BUG: 948086
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/4774
    Reviewed-by: Jeff Darcy <jdarcy>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-on: http://review.gluster.org/4963

Note You need to log in before you can comment on or make changes to this bug.