Bug 1070539

Summary:	Very slow Samba Directory Listing when many files or sub-directories
Product:	[Community] GlusterFS	Reporter:	Jeff Byers <jbyers>
Component:	gluster-smb	Assignee:	Ira Cooper <ira>
Status:	CLOSED EOL	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.4.2	CC:	bengland, bugs, gluster-bugs, ira, jarrpa, mpillai, pb, vagarwal, zab
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1397179 (view as bug list)		Environment:
Last Closed:	2015-10-07 13:49:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Byers 2014-02-27 03:46:36 UTC

I have a problem with very slow Windows Explorer browsing
when there are a large number of directories/files.

In this case, the top level folder has almost 6000 directories,
admittedly large, but it works almost instantaneously when a
Windows Server share was being used.

Migrating to a Samba/GlusterFS share, there is almost a 20
second delay while the explorer window populates the list.
This leaves a bad impression on the storage performance. The
systems are otherwise idle.

To isolate the cause, I've eliminated everything, from
networking, Windows, and have narrowed in on GlusterFS
being the sole cause of most of the directory lag.

I was optimistic on using the GlusterFS VFS libgfapi instead
of FUSE with Samba, and it does help performance
dramatically in some cases, but it does not help (and
sometimes hurts) when compared to the CIFS FUSE mount
for directory listings.

NFS for directory listings, and small I/O's seems to be
better, but I cannot use NFS, as I need to use CIFS for
Windows clients, need ACL's, Active Directory, etc.

Versions:
    CentOS release 6.5 (Final)
    # glusterd -V
    glusterfs 3.4.2 built on Jan  6 2014 14:31:51
    # smbd -V
    Version 4.1.4

Note that everything is done on the same box, so the
networking is all virtual, through the 'lo' device.

For testing, I've got a single GlusterFS volume, with a
single ext4 brick, being accessed locally:

# gluster volume info nas-cbs-0005
Volume Name: nas-cbs-0005
Type: Distribute
Volume ID: 5068e9a5-d60f-439c-b319-befbf9a73a50
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 192.168.5.181:/exports/nas-segment-0004/nas-cbs-0005
Options Reconfigured:
server.allow-insecure: on
nfs.rpc-auth-allow: *
nfs.disable: off
nfs.addr-namelookup: off

The Samba share options are:

[nas-cbs-0005]
    path = /samba/nas-cbs-0005/cifs_share
    admin users = "localadmin"
    valid users = "localadmin"
    invalid users =
    read list =
     write list = "localadmin"
    guest ok = yes
    read only = no
    hide unreadable = yes
    hide dot files = yes
    available = yes

[nas-cbs-0005-vfs]
    path = /
    vfs objects = glusterfs
     glusterfs:volume = nas-cbs-0005
    kernel share modes = No
    use sendfile = false
    admin users = "localadmin"
    valid users = "localadmin"
    invalid users =
    read list =
     write list = "localadmin"
    guest ok = yes
    read only = no
    hide unreadable = yes
    hide dot files = yes
    available = yes

I've locally mounted the volume three ways, with NFS, Samba
CIFS through a GlusterFS FUSE mount, and VFS libgfapi mount:

# mount
/dev/sdr on /exports/nas-segment-0004 type ext4 (rw,noatime,auto_da_alloc,barrier,nodelalloc,journal_checksum,acl,user_xattr)
/var/lib/glusterd/vols/nas-cbs-0005/nas-cbs-0005-fuse.vol on /samba/nas-cbs-0005 type fuse.glusterfs (rw,allow_other,max_read=131072)
//10.10.200.181/nas-cbs-0005 on /mnt/nas-cbs-0005-cifs type cifs (rw,username=localadmin,password=localadmin)
10.10.200.181:/nas-cbs-0005 on /mnt/nas-cbs-0005 type nfs (rw,addr=10.10.200.181)
//10.10.200.181/nas-cbs-0005-vfs on /mnt/nas-cbs-0005-cifs-vfs type cifs (rw,username=localadmin,password=localadmin)

Directory listing 6000 empty directories benchmark results:

    Directory listing the ext4 mount directly is almost
    instantaneous of course.

    Directory listing the NFS mount is also very fast, less than a second.

    Directory listing the CIFS FUSE mount is so slow, almost 16
     seconds!

    Directory listing the CIFS VFS libgfapi mount is about twice
    as fast as FUSE, but still slow at 8 seconds.

Unfortunately, due to:

    Bug 1004327 - New files are not inheriting ACL from parent
                   directory unless "stat-prefetch" is off for
                  the respective gluster volume
    https://bugzilla.redhat.com/show_bug.cgi?id=1004327

I need to have 'stat-prefetch' off. Retesting with this
setting.

Directory listing 6000 empty directories benchmark results
('stat-prefetch' is off):

    Accessing the ext4 mount directly is almost
    instantaneous of course.

    Accessing the NFS mount is still very fast, less than a second.

    Accessing the CIFS FUSE mount is slow, almost 14
    seconds, but slightly faster than when 'stat-prefetch' was
    on?

    Accessing the CIFS VFS libgfapi mount is now about twice
    as slow as FUSE, at almost 26 seconds, I guess due
    to 'stat- prefetch' being off!

To see if the directory listing problem was due to file
system metadata handling, or small I/O's, did some simple
small block file I/O benchmarks with the same configuration.

    64KB Sequential Writes:

    NFS small block writes seem slow at about 50 MB/sec.

    CIFS FUSE small block writes are more than twice as fast as
    NFS, at about 118 MB/sec.

    CIFS VFS libgfapi small block writes are very fast, about
    twice as fast as CIFS FUSE, at about 232 MB/sec.

    64KB Sequential Reads:

    NFS small block reads are very fast, at about 334 MB/sec.

    CIFS FUSE small block reads are half of NFS, at about 124
    MB/sec.

    CIFS VFS libgfapi small block reads are about the same as
    CIFS FUSE, at about 127 MB/sec.

    4KB Sequential Writes:

    NFS very small block writes are very slow at about 4 MB/sec.

    CIFS FUSE very small block writes are faster, at about 11
    MB/sec.

    CIFS VFS libgfapi very small block writes are twice as fast
    as CIFS FUSE, at about 22 MB/sec.

    4KB Sequential Reads:

    NFS very small block reads are very fast at about 346
    MB/sec.

    CIFS FUSE very small block reads are less than half as fast
    as NFS, at about 143 MB/sec.

    CIFS VFS libgfapi very small block reads a slight bit slower
    than CIFS FUSE, at about 137 MB/sec.

I'm not quite sure how interpret these results. Write
caching is playing a part for sure, but it should apply
equally for both NFS and CIFS I would think. With small file
I/O's, NFS is better at reading than CIFS, and CIFS VFS is
twice as good at writing as CIFS FUSE. Sadly, CIFS VFS is
about the same as CIFS FUSE at reading.

Regarding the directory listing lag problem, I've tried most
of the the GlusterFS volume options that seemed like they
might help, but nothing really did.

Gluster having 'stat-prefetch' on helps, but has to be off
for the bug.

BTW: I've repeated some tests with empty files instead of
directories, and the results were similar. The issue is not
specific to directories only.

I know that small file reads and file-system metadata
handling is not GlusterFS's strong suit, but is there
*anything* that can be done to help it out? Any ideas?

Should I hope/expect for GlusterFS 3.5.x to improve this
any?

Raw data is below.

Any advice is appreciated. Thanks.

~ Jeff Byers ~

##########################

The first test I did was to make sure that this was not just
a Samba/CIFS issue. To do this, I make a CIFS mount directly
to the storage brick/segment, bypassing GlusterFS, and
mounted it.

Note that there is neither the long cache population time on
the first run, nor the very long delays, although there is a
consistent 1.8 second delay which must be attributed to
Samba/CIFS itself:

[nas-cbs-0005-seg]
    path = /exports/nas-segment-0004/nas-cbs-0005

    admin users = "localadmin"
    valid users = "localadmin"
    invalid users =
    read list =
    write list = "localadmin"
    guest ok = yes
    read only = no
    hide unreadable = yes
    hide dot files = yes
    available = yes

# mount |grep seg
//10.10.200.181/nas-cbs-0005-seg on /mnt/nas-cbs-0005-seg type cifs (rw,username=localadmin,password=localadmin)

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-seg/cifs_share/manyfiles/ >/dev/null
real    0m1.745s
# time ls -l /mnt/nas-cbs-0005-seg/cifs_share/manyfiles/ >/dev/null
real    0m1.819s
# time ls -l /mnt/nas-cbs-0005-seg/cifs_share/manyfiles/ >/dev/null
real    0m1.781s

##########################

Directory listing of 6000 empty directories ('stat-prefetch'
is on):

Directory listing the ext4 mount directly is almost
instantaneous of course.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m41.235s (Throw away first time for ext4 FS cache population?)
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.110s
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.109s

Directory listing the NFS mount is also very fast.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m44.352s (Throw away first time for ext4 FS cache population?)
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.471s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.114s

Directory listing the CIFS FUSE mount is so slow, almost 16
seconds!

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real    0m56.573s (Throw away first time for ext4 FS cache population?)
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real    0m16.101s
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real    0m15.986s

Directory listing the CIFS VFS libgfapi mount is about twice
as fast as FUSE, but still slow at 8 seconds.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
real    0m48.839s (Throw away first time for ext4 FS cache population?)
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
real    0m8.197s
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
real    0m8.450s

####################

Retesting directory list with Gluster default settings,
including 'stat-prefetch' off, due to:

    Bug 1004327 - New files are not inheriting ACL from parent directory
                  unless "stat-prefetch" is off for the respective gluster
                  volume
    https://bugzilla.redhat.com/show_bug.cgi?id=1004327

# gluster volume info nas-cbs-0005

Volume Name: nas-cbs-0005
Type: Distribute
Volume ID: 5068e9a5-d60f-439c-b319-befbf9a73a50
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 192.168.5.181:/exports/nas-segment-0004/nas-cbs-0005
Options Reconfigured:
performance.stat-prefetch: off
server.allow-insecure: on
nfs.rpc-auth-allow: *
nfs.disable: off
nfs.addr-namelookup: off

Directory listing of 6000 empty directories ('stat-prefetch'
is off):

Accessing the ext4 mount directly is almost instantaneous of
course.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m39.483s (Throw away first time for ext4 FS cache population?)
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.136s
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.109s

Accessing the NFS mount is also very fast.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m43.819s (Throw away first time for ext4 FS cache population?)
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.342s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.200s

Accessing the CIFS FUSE mount is slow, almost 14 seconds!

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real    0m55.759s (Throw away first time for ext4 FS cache population?)
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real    0m13.458s
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real    0m13.665s

Accessing the CIFS VFS libgfapi mount is now about twice as
slow as FUSE, at almost 26 seconds due to 'stat-prefetch'
being off!

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
real    1m2.821s (Throw away first time for ext4 FS cache population?)
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
real    0m25.563s
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
real    0m26.949s

####################

64KB Writes:

NFS small block writes seem slow at about 50 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 27.249756 secs, 49.25 MB/sec
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 25.893526 secs, 51.83 MB/sec

CIFS FUSE small block writes are more than twice as fast as NFS, at about 118 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 11.509077 secs, 116.62 MB/sec
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 11.223902 secs, 119.58 MB/sec

CIFS VFS libgfapi small block writes are very fast, about
twice as fast as CIFS FUSE, at about 232 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 5.704753 secs, 235.27 MB/sec
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 5.862486 secs, 228.94 MB/sec

64KB Reads:

NFS small block reads are very fast, at about 334 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 3.972426 secs, 337.87 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 4.066978 secs, 330.02 MB/sec

CIFS FUSE small block reads are half of NFS, at about 124
MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 10.837072 secs, 123.85 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 10.716980 secs, 125.24 MB/sec

CIFS VFS libgfapi small block reads are about the same as
CIFS FUSE, at about 127 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 10.397888 secs, 129.08 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 10.696802 secs, 125.47 MB/sec

4KB Writes:

NFS very small block writes are very slow at about 4 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 20.450521 secs, 4.10 MB/sec
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 19.669923 secs, 4.26 MB/sec

CIFS FUSE very small block writes are faster, at about 11
MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 7.247578 secs, 11.57 MB/sec
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 7.422002 secs, 11.30 MB/sec

CIFS VFS libgfapi very small block writes are twice as fast
as CIFS FUSE, at about 22 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 3.766179 secs, 22.27 MB/sec
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 3.761176 secs, 22.30 MB/sec

4KB Reads:

NFS very small block reads are very fast at about 346
MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 0.244960 secs, 342.45 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
time to transfer data was 0.240472 secs, 348.84 MB/sec

CIFS FUSE very small block reads are less than half as fast
as NFS, at about 143 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 0.606534 secs, 138.30 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 0.576185 secs, 145.59 MB/sec

CIFS VFS libgfapi very small block reads a slight bit slower
than CIFS FUSE, at about 137 MB/sec.

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 0.611328 secs, 137.22 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
time to transfer data was 0.615834 secs, 136.22 MB/sec

EOM

Comment 1 Peter B. 2014-03-15 00:43:00 UTC

I didn't yet have the time to investigate it as thoroughly as Jeff did, but we're experiencing the same behavior at the setup in our institution.

Comment 2 Nickolas Wood 2014-03-25 18:45:05 UTC

I have seen this as well and have narrowed it down to performing a stat call. You can see this by stracing ls calls. 

Essentially, unalias your ls call (\ls) and omit all options to ls. This effectively tells ls to simply do a readdir, which is very fast, even from gluster. Using ls with options like --color (common default alias FYI) or -l tells ls to stat everything it finds to determine what type of thing it is or get additional data about each thing. That stat call is apparently incredibly expensive within gluster.

Because of this, we have re-architected the systems that tie into the datastore to avoid performing stat calls whenever possible.

CAVEAT: bypassing stat calls means that you will not perform dynamic healing as, I believe, gluster ties into stat calls in order to check replica consistency.

Comment 3 Peter B. 2014-03-25 19:36:34 UTC

Thanks!
Happy to hear that you found something.

If I understood you correctly, the change you've proposed, to avoid performing stat calls whenever possible, does not affect gluster in distributed mode, right?

Comment 4 Nickolas Wood 2014-03-25 19:39:56 UTC

No, my environment is a distributed, triple replicated volume spanning 24 raided bricks across 4 nodes. All told, 56TB usable. We have a custom map-reduce implementation that makes heavy use of gluster while avoided stat calls. I haven't seen any issue with it.

Comment 5 Peter B. 2014-03-25 19:54:00 UTC

I see.
But for a simple gluster setup in distributed mode, without any replication, there would be no dynamic healing, if I understood it correctly.

Comment 6 Nickolas Wood 2014-03-26 21:31:29 UTC

I would imagine so yes. A distribute only volume has no ability to heal. A dev should answer this though as I do not know what the stat call hooks gluster uses actually do in a distribute only volume. I am only assuming the hooks not only exist but also do something because avoiding them helps directory listing performance.

Comment 8 Ben England 2014-12-01 14:50:49 UTC

Added Manoj and Ira to cc list.

So does this customer need to use ACLs?  This may be part of the reason that it's so slow.   Gluster implemented a READDIRPLUS FOP that was intended to speed up precisely this case.  However, I don't think READDIRPLUS includes ACL info and extended attr info (can a developer please confirm?).  So if CIFS requires that ACL info or extended attributes be read before the listing can be completed, then you still have the same problem we had before READDIRPLUS, namely >= 1 round trip per file. 

By the way, READDIRPLUS does not return many files in one round trip, certainly nowhere near as many as it needs to in this case.  But I don't think that's the cause of this problem.  

To confirm the analysis, could someone get a tcpdump file from the SMB server with

# tcpdump -i any -w /tmp/a.tcpdump -s 9000 -c 100000
# gzip /tmp/a.tcpdump

And post it in this bz as an attachment or in Red Hat's FTP dropbox site?

Did any of the above tests turn off ACLs?

Comment 9 Ben England 2014-12-01 14:59:18 UTC

Another way to confirm it is to use profiling commands in Gluster while you are running a browser test:

 gluster volume profile your-volume start
 gluster volume profile your-volume info > /tmp/junk.tmp
 for pass in `seq 1 20` ; do \
   sleep 5 ; \
   gluster volume profile your-volume info ; \
 done > gvp.log

And attach gvp.log to this bz.   I have a python script that can reduce gvp.log to a spreadsheet which can show rates for Gluster RPC call types over time, we can then see more about how efficiently Gluster handled this workload and where the bottlenecks might be.

Comment 10 Niels de Vos 2015-05-17 21:58:51 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 11 Ben England 2015-05-18 16:07:54 UTC

AFAIK this problem has not been fixed, but a fix is feasible if the SMB plugin requests the xattrs it needs in READDIRPLUS FOP.  Apparently the READDIRPLUS FOP does support fetching additional xattrs.

http://www.gluster.org/community/documentation/index.php/Features/composite-operations#READDIRPLUS_used_to_prefetch_xattrs

Comment 12 Kaleb KEITHLEY 2015-10-07 13:49:43 UTC

GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.

Comment 13 Kaleb KEITHLEY 2015-10-07 13:50:53 UTC

GlusterFS 3.4.x has reached end-of-life.\                                                   \                                                                               If this bug still exists in a later release please reopen this and change the version or open a new bug.