1156637 – Gluster small-file creates do not scale with brick count

Bug 1156637 - Gluster small-file creates do not scale with brick count

Summary: Gluster small-file creates do not scale with brick count

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Shyamsundar
QA Contact:	Ben Turner
Docs Contact:
URL:
Whiteboard:
Depends On:	1220064
Blocks:	1202842
TreeView+	depends on / blocked

Reported:	2014-10-24 21:22 UTC by Ben England
Modified:	2015-07-29 04:36 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.7.1-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1219637 (view as bug list)
Environment:
Last Closed:	2015-07-29 04:36:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Ben England 2014-10-24 21:22:49 UTC

Description of problem:

Gluster small-file creates have negative scalability with brick count.  This prevents Gluster from getting reasonable small-file create performance with 

a) JBOD (just a bunch of disks) configurations and
b) high server counts

Version-Release number of selected component (if applicable):

RHSS 3.0 = glusterfs*-3.6.0.28-1.el6rhs

client does not matter.

How reproducible:

Every time.

Steps to Reproduce:
1.  Create Gluster volume with 2, 4, 8, 16, 32 .. bricks (easy to do with JBOD)
2.  run smallfile or similar workload on all clients (glusterfs for example)
3.  measure throughput per brick

Note: Dan Lambright and I were able to take this testing out to 84 bricks using virtual machines with a single disk drive as a brick for each VM, and 3 GB + 2 CPU cores/VM.  We made sure replicas were on different physical machines.  See article below for details.  

Actual results:

At some point throughput levels off and starts to decline as brick count is increased.  However, with Gluster volume parameter cluster.lookup-unhashed off instead of default value of on, throughput continues to increase, though perhaps not linearly.

A dangerous workaround is "gluster v set your-volume cluster.lookup-unhashed off", but if you do this you may lose data.

Expected results:

Throughput should scale linearly with brick count, assuming number of bricks/server is small.

Additional info:

https://mojo.redhat.com/people/bengland/blog/2014/04/30/gluster-scalability-test-results-using-virtual-machine-servers

for Red-Hat-external folks, available upon request.

Gluster volume profile output shows that without this tuning, LOOKUP FOP starts to dominate calls and eventually %latency as well.  For example, with just 2 servers and 6 RAID6 bricks/server in a 1-replica volume, we get something like this:

Interval 2 Stats:
   Block Size:              65536b+ 
 No. of Reads:                    0 
No. of Writes:                 4876 
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us           4881      FORGET
      0.00       0.00 us       0.00 us       0.00 us           4876     RELEASE
      0.08      46.11 us      18.00 us     208.00 us            160      STATFS
      0.44      37.75 us      14.00 us     536.00 us           1081        STAT
      1.54      29.12 us       6.00 us    1070.00 us           4876       FLUSH
      8.44     160.01 us      80.00 us     935.00 us           4876       WRITE
     14.74     279.62 us     126.00 us    2729.00 us           4877      CREATE
     74.76     100.29 us      33.00 us    2708.00 us          68948      LOOKUP
 
    Duration: 10 seconds
   Data Read: 0 bytes
Data Written: 319553536 bytes

The number of LOOKUP FOPs is approximately 14 times the number of CREATE FOPs, which makes sense because there are 12 DHT subvolumes and it checks all of them for existence of a file with that name before it does a CREATE.   However, this shouldn't be necessary if DHT layout hasn't changed since volume creation or last rebalance.  

Jeff Darcy has written a patch at https://review.gluster.org/#/c/7702/ to try to make "cluster.lookup-unhashed=auto" be a safe default where we don't have to do exhaustive per-file LOOKUPs on every brick, unless the layout changes, and in that circumstance we can get back to a good state by doing a rebalance (did I capture behavior?)

Comment 2 Ben England 2014-11-18 12:18:42 UTC

sadly there is another piece to the scalability puzzle for SMB/RHS.  When a small-file create is done there, we see GETXATTR user.glusterfs.get_real_filename:my-filename.type is sent to all servers.  Protocol trace and annotation are at this .tcpdump.gz and .odp respectively:

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/insignia/

1-GbE.tcpdump.gz - the tcpdump, viewable with wireshark
day-in-the-life-smb-gluster.odp - the annotation

I am guessing that this is how we avoid doing READDIRPLUS per file to handle windows case-insensitivity - ask each brick to check for us.  There is no way to know which brick the file is on if we are case-insensitive, since consistent hashing does not map the filename to upper/lower case before hashing.  Could this be a solution - have a volume option to map filename to all upper-case before hashing?  IMHO A lot of users would not care if they were prevented from creating files FOO.BAR and foo.bar in the same directory, even Linux users, and would consider this a small price to pay for being able to use shared storage.

Comment 3 Shyamsundar 2015-04-06 14:23:08 UTC

Current work is to get the patch posted by Jeff ( https://review.gluster.org/#/c/7702/ ). This is for the problem as posted in the #description.

I think we may need to fork comment #2 into its own bug, as that does not fall under the same lookup DHT issue.

Comment 5 Shyamsundar 2015-05-10 14:45:30 UTC

3.7 upstream has this feature merged, and so 3.1 RHGS should get it when upstream is pulled downstream for the release.

Comment 10 Ben England 2015-05-15 11:33:46 UTC

Any backwards compatibility issues here?  For example, if you upgrade server first, but not all clients get upgraded, what will happen?  Or if you upgrade clients first, will this be ok?

Comment 11 Shyamsundar 2015-05-15 12:57:15 UTC

@Ben, bug #1221747 captures the backward compat. issues that you detail in comment #10, work in progress to address the same upstream.

Comment 12 Amit Chaurasia 2015-06-12 06:21:06 UTC

While rebalance is happening, the stat of files show error and the files are not accessible.

They come back to normal when rebalance is finished.Also, I see one file was corrupted.

This seems to be an issue. Need to be looked into unless documented.

Marking this bug as failed.

Comment 13 Amit Chaurasia 2015-06-15 09:23:08 UTC

This BZ is being superceded by 1222053 as the lookup-unhashed option is being depracated. 

Will execute the same testcases with the new option of lookup-optimize.

Comment 14 Shyamsundar 2015-06-15 14:00:47 UTC

The feature/fix provided for the bug is under the lookup-optimize switch and not the older lookup-unhashed switch. As a result, request verification of the same (as posted in comment #13). Marking this on QE again.

Feature also requires documentation, refer to this commit for the same: http://review.gluster.org/#/c/11109/

Comment 17 Ben Turner 2015-07-15 19:05:03 UTC

Verified on glusterfs-3.7.1-9.el6rhs.x86_64.  The numbers are:

3.0.4 - 1480 files / sec
3.1 default - 1600 files / sec
3.1 + cluster.lookup-optimize - 2003 files / sec

Comment 18 Ben Turner 2015-07-15 22:18:21 UTC

Just to note average with client and server event threads set to 4 and lookup optimize enabled:

2146 files / second

Comment 20 errata-xmlrpc 2015-07-29 04:36:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.