Bug 1219637 - Gluster small-file creates do not scale with brick count
Summary: Gluster small-file creates do not scale with brick count
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
Assignee: Shyamsundar
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1220064
TreeView+ depends on / blocked
 
Reported: 2015-05-07 20:44 UTC by Shyamsundar
Modified: 2016-06-16 12:59 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.8rc2
Doc Type: Bug Fix
Doc Text:
Clone Of: 1156637
: 1220064 (view as bug list)
Environment:
Last Closed: 2016-06-16 12:59:33 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Shyamsundar 2015-05-07 20:44:52 UTC
+++ This bug was initially created as a clone of Bug #1156637 +++

Description of problem:

Gluster small-file creates have negative scalability with brick count.  This prevents Gluster from getting reasonable small-file create performance with 

a) JBOD (just a bunch of disks) configurations and
b) high server counts

How reproducible:

Every time.

Steps to Reproduce:
1.  Create Gluster volume with 2, 4, 8, 16, 32 .. bricks (easy to do with JBOD)
2.  run smallfile or similar workload on all clients (glusterfs for example)
3.  measure throughput per brick

Note: Dan Lambright and I were able to take this testing out to 84 bricks using virtual machines with a single disk drive as a brick for each VM, and 3 GB + 2 CPU cores/VM.  We made sure replicas were on different physical machines.  See article below for details.  

Actual results:

At some point throughput levels off and starts to decline as brick count is increased.  However, with Gluster volume parameter cluster.lookup-unhashed off instead of default value of on, throughput continues to increase, though perhaps not linearly.

A dangerous workaround is "gluster v set your-volume cluster.lookup-unhashed off", but if you do this you may lose data.

Expected results:

Throughput should scale linearly with brick count, assuming number of bricks/server is small.

Additional info:

https://mojo.redhat.com/people/bengland/blog/2014/04/30/gluster-scalability-test-results-using-virtual-machine-servers

for Red-Hat-external folks, available upon request.

Gluster volume profile output shows that without this tuning, LOOKUP FOP starts to dominate calls and eventually %latency as well.  For example, with just 2 servers and 6 RAID6 bricks/server in a 1-replica volume, we get something like this:

Interval 2 Stats:
   Block Size:              65536b+ 
 No. of Reads:                    0 
No. of Writes:                 4876 
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us           4881      FORGET
      0.00       0.00 us       0.00 us       0.00 us           4876     RELEASE
      0.08      46.11 us      18.00 us     208.00 us            160      STATFS
      0.44      37.75 us      14.00 us     536.00 us           1081        STAT
      1.54      29.12 us       6.00 us    1070.00 us           4876       FLUSH
      8.44     160.01 us      80.00 us     935.00 us           4876       WRITE
     14.74     279.62 us     126.00 us    2729.00 us           4877      CREATE
     74.76     100.29 us      33.00 us    2708.00 us          68948      LOOKUP
 
    Duration: 10 seconds
   Data Read: 0 bytes
Data Written: 319553536 bytes

The number of LOOKUP FOPs is approximately 14 times the number of CREATE FOPs, which makes sense because there are 12 DHT subvolumes and it checks all of them for existence of a file with that name before it does a CREATE.   However, this shouldn't be necessary if DHT layout hasn't changed since volume creation or last rebalance.  

Jeff Darcy has written a patch at https://review.gluster.org/#/c/7702/ to try to make "cluster.lookup-unhashed=auto" be a safe default where we don't have to do exhaustive per-file LOOKUPs on every brick, unless the layout changes, and in that circumstance we can get back to a good state by doing a rebalance (did I capture behavior?)

Comment 1 Anand Avati 2015-05-07 20:46:18 UTC
REVIEW: http://review.gluster.org/7702 (dht: make lookup-unhashed=auto do something actually useful) posted (#8) for review on master by Shyamsundar Ranganathan (srangana)

Comment 2 Anand Avati 2015-05-08 19:16:32 UTC
REVIEW: http://review.gluster.org/7702 (dht: make lookup-unhashed=auto do something actually useful) posted (#9) for review on master by Shyamsundar Ranganathan (srangana)

Comment 3 Anand Avati 2015-05-10 13:17:46 UTC
COMMIT: http://review.gluster.org/7702 committed in master by Vijay Bellur (vbellur) 
------
commit 4eaaf5188fe24a4707dc2cf2934525083cf8e64f
Author: Jeff Darcy <jdarcy>
Date:   Wed May 7 19:31:30 2014 +0000

    dht: make lookup-unhashed=auto do something actually useful
    
    The key concept here is to determine whether a directory is "clean" by
    comparing its last-known-good topology to the current one for the
    volume.  These are stored as "commit hashes" on the directory and the
    volume root respectively.  The volume's commit hash changes whenever a
    brick is added or removed, and a fix-layout is done.  A directory's
    commit hash changes only when a full rebalance (not just fix-layout)
    is done on it.  If all bricks are present and have a directory
    commit hash that matches the volume commit hash, then we can assume
    that every file is in its "proper" place. Therefore, if we look for
    a file in that proper place and don't find it, we can assume it's not
    on any other subvolume and *safely* skip the global (broadcast to all)
    lookup.
    
    Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3
    BUG: 1219637
    Signed-off-by: Jeff Darcy <jdarcy>
    Signed-off-by: Shyam <srangana>
    Reviewed-on: http://review.gluster.org/7702
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System
    Reviewed-by: Vijay Bellur <vbellur>

Comment 4 Nagaprasad Sathyanarayana 2015-10-25 15:05:58 UTC
Fix for this BZ is already present in a GlusterFS release. You can find clone of this BZ, fixed in a GlusterFS release and closed. Hence closing this mainline BZ as well.

Comment 5 Niels de Vos 2016-06-16 12:59:33 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.