1044693 – [Scale] gluster volume command can take minutes at 84-server (or is it 84-brick) count

Bug 1044693 - [Scale] gluster volume command can take minutes at 84-server (or is it 84-brick) count

Summary: [Scale] gluster volume command can take minutes at 84-server (or is it 84-bri...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dusmant
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-18 20:14 UTC by Matt Mahoney
Modified:	2016-02-18 00:15 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-03 17:22:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
tcpdump on virtual server where command was issued (2.04 MB, application/octet-stream) 2014-04-27 11:04 UTC, Ben England	no flags	Details
View All

Description Matt Mahoney 2013-12-18 20:14:56 UTC

Description of problem:
Importing a 128 brick cluster that consists of 16-hosts with 8-bricks per host takes >30 seconds

Version-Release number of selected component (if applicable):
cb12

How reproducible:
Always

Steps to Reproduce:
1. From gluster CLI, create a 16 host cluster (via peer probe)
2. Create two volumes with 64 bricks each (8-hosts, 8-bricks per host)
3. From UI, Create new Cluster using Cluster Import. Measure how long it takes to Import the cluster (start when clicking OK in Add Host dialog. Stop when all hosts transition to Up state).

Actual results:


Expected results:


Additional info:

Comment 2 Ben England 2014-04-27 10:59:08 UTC

I reproduced this bug in a cluster consisting of 84 virtual servers, 1 brick/server, and it took 70 seconds per "add-brick" command.

Comment 3 Ben England 2014-04-27 11:04:51 UTC

Created attachment 890202 [details]
tcpdump on virtual server where command was issued

To see the problem, use the wireshark filter "gluster.cli", and look at the difference in time between the ADD-BRICK Call and ADD_BRICK Reply messages.

Next use the filter "glusterd.mgmt" and you can watch how glusterd implements the request to add a brick.  Not clear why this takes so long.

Comment 4 Anand Avati 2014-04-30 22:50:38 UTC

Will be interesting to see if http://review.gluster.org/7370 helps this issue.

Comment 5 Ben England 2014-05-09 14:08:17 UTC

If someone can get me a Gluster build with patch in comment 4, I can deploy it and try it right away.

I reassigned this bug to gluster server from RHSC because it happens just using the "gluster volume" command, nothing to do with RHSC.

BTW, it takes a full minute or more to execute a "gluster volume stop" on a non-existent volume.  

What's worse is that you can't execute any other gluster commands from anywhere else while it's busy.    We *have* to fix this to make Gluster a truly scalable filesystem, and specifically to support RHS limit of 64 servers (which could mean a lot more than 64 bricks).

Comment 6 Ben England 2014-05-09 20:13:26 UTC

I timed a gluster volume create command with 84 bricks, 1 brick/server, and it took 2 min consistently. I then see errors in subsequent commands on that volume and mounts.  The question is: is it the server count or the brick count that is causing the problem?  If the latter, it is much more serious.  For example, with 60-drives-per-server config, we could easily see 84-brick volumes with 84/5 ~= 16 servers, well within published RHS support limits.  I'll try to find out.

To try to make 84-server case work I:

- re-peer before each volume create
- clear out the log files in /var/log/glusterfs before each volume create (shutdown glusterd beforehand, start it after) so no disk space exhaustion
- put 1-second delay between each "gluster peer probe" command
- put 10-second delay before each "gluster volume" command

These last two items are simulating what happens when a person types the commands.

I'm timing the gluster volume commands so we can see how elapsed time varies with brick count.  That will go into the bz.

....

+ ssh 172.17.50.61 gluster peer probe 172.17.50.57
peer probe: success
+ sleep 1
+ sleep 10
+ ssh 172.17.50.61 gluster volume create scale replica 2 172.17.50.61:/data/gv0/brick1/scale 172.17.50.76:/data/gv0/brick1/scale 172.17.50.106:/data/gv0/brick1/scale 172.17.50.2:/data/gv0/brick1/scale 172.17.50.16:/data/gv0/brick1/scale 172.17.50.31:/data/gv0/brick1/scale 172.17.50.46:/data/gv0/brick1/scale 172.17.50.62:/data/gv0/brick1/scale 172.17.50.77:/data/gv0/brick1/scale 172.17.50.107:/data/gv0/brick1/scale 172.17.50.3:/data/gv0/brick1/scale 172.17.50.17:/data/gv0/brick1/scale 172.17.50.32:/data/gv0/brick1/scale 172.17.50.47:/data/gv0/brick1/scale 172.17.50.63:/data/gv0/brick1/scale 172.17.50.78:/data/gv0/brick1/scale 172.17.50.108:/data/gv0/brick1/scale 172.17.50.4:/data/gv0/brick1/scale 172.17.50.18:/data/gv0/brick1/scale 172.17.50.33:/data/gv0/brick1/scale 172.17.50.48:/data/gv0/brick1/scale 172.17.50.64:/data/gv0/brick1/scale 172.17.50.79:/data/gv0/brick1/scale 172.17.50.109:/data/gv0/brick1/scale 172.17.50.5:/data/gv0/brick1/scale 172.17.50.19:/data/gv0/brick1/scale 172.17.50.34:/data/gv0/brick1/scale 172.17.50.49:/data/gv0/brick1/scale 172.17.50.65:/data/gv0/brick1/scale 172.17.50.80:/data/gv0/brick1/scale 172.17.50.110:/data/gv0/brick1/scale 172.17.50.6:/data/gv0/brick1/scale 172.17.50.20:/data/gv0/brick1/scale 172.17.50.35:/data/gv0/brick1/scale 172.17.50.50:/data/gv0/brick1/scale 172.17.50.66:/data/gv0/brick1/scale 172.17.50.81:/data/gv0/brick1/scale 172.17.50.111:/data/gv0/brick1/scale 172.17.50.7:/data/gv0/brick1/scale 172.17.50.21:/data/gv0/brick1/scale 172.17.50.36:/data/gv0/brick1/scale 172.17.50.51:/data/gv0/brick1/scale 172.17.50.67:/data/gv0/brick1/scale 172.17.50.82:/data/gv0/brick1/scale 172.17.50.112:/data/gv0/brick1/scale 172.17.50.8:/data/gv0/brick1/scale 172.17.50.22:/data/gv0/brick1/scale 172.17.50.37:/data/gv0/brick1/scale 172.17.50.52:/data/gv0/brick1/scale 172.17.50.68:/data/gv0/brick1/scale 172.17.50.83:/data/gv0/brick1/scale 172.17.50.113:/data/gv0/brick1/scale 172.17.50.9:/data/gv0/brick1/scale 172.17.50.23:/data/gv0/brick1/scale 172.17.50.38:/data/gv0/brick1/scale 172.17.50.53:/data/gv0/brick1/scale 172.17.50.69:/data/gv0/brick1/scale 172.17.50.84:/data/gv0/brick1/scale 172.17.50.114:/data/gv0/brick1/scale 172.17.50.10:/data/gv0/brick1/scale 172.17.50.24:/data/gv0/brick1/scale 172.17.50.39:/data/gv0/brick1/scale 172.17.50.54:/data/gv0/brick1/scale 172.17.50.70:/data/gv0/brick1/scale 172.17.50.85:/data/gv0/brick1/scale 172.17.50.115:/data/gv0/brick1/scale 172.17.50.11:/data/gv0/brick1/scale 172.17.50.25:/data/gv0/brick1/scale 172.17.50.40:/data/gv0/brick1/scale 172.17.50.55:/data/gv0/brick1/scale 172.17.50.71:/data/gv0/brick1/scale 172.17.50.86:/data/gv0/brick1/scale 172.17.50.116:/data/gv0/brick1/scale 172.17.50.12:/data/gv0/brick1/scale 172.17.50.26:/data/gv0/brick1/scale 172.17.50.41:/data/gv0/brick1/scale 172.17.50.56:/data/gv0/brick1/scale 172.17.50.72:/data/gv0/brick1/scale 172.17.50.87:/data/gv0/brick1/scale 172.17.50.117:/data/gv0/brick1/scale 172.17.50.13:/data/gv0/brick1/scale 172.17.50.27:/data/gv0/brick1/scale 172.17.50.42:/data/gv0/brick1/scale 172.17.50.57:/data/gv0/brick1/scale

real    2m0.126s
user    0m0.008s
sys     0m0.001s
+ sleep 10
+ ssh 172.17.50.61 gluster volume set scale cluster.lookup-unhashed off allow-insecure on
volume set: failed: Another transaction is in progress. Please try again after sometime.

>>>>>******** SEE ERROR IN PREVIOUS LINE************* <<<<<

real    0m0.581s
user    0m0.008s
sys     0m0.001s
+ sleep 10
+ ssh 172.17.50.61 gluster volume start scale

real    2m0.121s
user    0m0.007s
sys     0m0.003s
+ sleep 10

And then the mount command fails because of allow-insecure on not being set.

Comment 7 Ben England 2014-05-12 19:51:13 UTC

https://mojo.redhat.com/people/bengland/blog/2014/04/30/gluster-scalability-test-results-using-virtual-machine-servers

graph 3 documents response time of gluster volume command vs number of servers, 

it looks exactly like O(N^2) curve.  Specifically I used a spreadsheet formula

R(N) = R(14)*N*N/(14*14)

where N is number of servers, and R(N) is response time measured for that number of servers, and it predicted exactly the response time that I observed.

Can someone apply the patch in comment 4 to Denali and get me a build of it please?  For example, if number of writes done by glusterd was proportional to number of system calls done by glusterd, then this might explain the behavior, since glusterd writes configuration data proportional to the number of bricks on each server, and with O_SYNC each write is an fsync() to disk.    If only one glusterd is writing at a time, then this would explain the behavior.

Comment 8 Ben England 2014-05-15 00:00:33 UTC

Avati suggested that the problem was the shared storage used to back the KVM guests' system disk, which in this case is a single RAID1 pair of disks on the KVM host.  So I reformatted the per-KVM-guest disk as a LVM PV (inside the guest), and created 3 logical volumes:

- 1-GB LV for /var/lib/glusterd
- 25-GB LV  for /var/log/glusterfs (so system disk in guest doesn't fill up)
- remaining 900+ GB used for XFS filesystem containing brick directory.

The result is under-6-second response time even at 84-server volume size.  response time is still going up, but not even O(N), see blog post in previous comment, it's certainly not increasing O(N^2).

Comment 9 Vivek Agarwal 2015-12-03 17:22:44 UTC

Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.