Description of problem: Importing a 128 brick cluster that consists of 16-hosts with 8-bricks per host takes >30 seconds Version-Release number of selected component (if applicable): cb12 How reproducible: Always Steps to Reproduce: 1. From gluster CLI, create a 16 host cluster (via peer probe) 2. Create two volumes with 64 bricks each (8-hosts, 8-bricks per host) 3. From UI, Create new Cluster using Cluster Import. Measure how long it takes to Import the cluster (start when clicking OK in Add Host dialog. Stop when all hosts transition to Up state). Actual results: Expected results: Additional info:
I reproduced this bug in a cluster consisting of 84 virtual servers, 1 brick/server, and it took 70 seconds per "add-brick" command.
Created attachment 890202 [details] tcpdump on virtual server where command was issued To see the problem, use the wireshark filter "gluster.cli", and look at the difference in time between the ADD-BRICK Call and ADD_BRICK Reply messages. Next use the filter "glusterd.mgmt" and you can watch how glusterd implements the request to add a brick. Not clear why this takes so long.
Will be interesting to see if http://review.gluster.org/7370 helps this issue.
If someone can get me a Gluster build with patch in comment 4, I can deploy it and try it right away. I reassigned this bug to gluster server from RHSC because it happens just using the "gluster volume" command, nothing to do with RHSC. BTW, it takes a full minute or more to execute a "gluster volume stop" on a non-existent volume. What's worse is that you can't execute any other gluster commands from anywhere else while it's busy. We *have* to fix this to make Gluster a truly scalable filesystem, and specifically to support RHS limit of 64 servers (which could mean a lot more than 64 bricks).
I timed a gluster volume create command with 84 bricks, 1 brick/server, and it took 2 min consistently. I then see errors in subsequent commands on that volume and mounts. The question is: is it the server count or the brick count that is causing the problem? If the latter, it is much more serious. For example, with 60-drives-per-server config, we could easily see 84-brick volumes with 84/5 ~= 16 servers, well within published RHS support limits. I'll try to find out. To try to make 84-server case work I: - re-peer before each volume create - clear out the log files in /var/log/glusterfs before each volume create (shutdown glusterd beforehand, start it after) so no disk space exhaustion - put 1-second delay between each "gluster peer probe" command - put 10-second delay before each "gluster volume" command These last two items are simulating what happens when a person types the commands. I'm timing the gluster volume commands so we can see how elapsed time varies with brick count. That will go into the bz. .... + ssh 172.17.50.61 gluster peer probe 172.17.50.57 peer probe: success + sleep 1 + sleep 10 + ssh 172.17.50.61 gluster volume create scale replica 2 172.17.50.61:/data/gv0/brick1/scale 172.17.50.76:/data/gv0/brick1/scale 172.17.50.106:/data/gv0/brick1/scale 172.17.50.2:/data/gv0/brick1/scale 172.17.50.16:/data/gv0/brick1/scale 172.17.50.31:/data/gv0/brick1/scale 172.17.50.46:/data/gv0/brick1/scale 172.17.50.62:/data/gv0/brick1/scale 172.17.50.77:/data/gv0/brick1/scale 172.17.50.107:/data/gv0/brick1/scale 172.17.50.3:/data/gv0/brick1/scale 172.17.50.17:/data/gv0/brick1/scale 172.17.50.32:/data/gv0/brick1/scale 172.17.50.47:/data/gv0/brick1/scale 172.17.50.63:/data/gv0/brick1/scale 172.17.50.78:/data/gv0/brick1/scale 172.17.50.108:/data/gv0/brick1/scale 172.17.50.4:/data/gv0/brick1/scale 172.17.50.18:/data/gv0/brick1/scale 172.17.50.33:/data/gv0/brick1/scale 172.17.50.48:/data/gv0/brick1/scale 172.17.50.64:/data/gv0/brick1/scale 172.17.50.79:/data/gv0/brick1/scale 172.17.50.109:/data/gv0/brick1/scale 172.17.50.5:/data/gv0/brick1/scale 172.17.50.19:/data/gv0/brick1/scale 172.17.50.34:/data/gv0/brick1/scale 172.17.50.49:/data/gv0/brick1/scale 172.17.50.65:/data/gv0/brick1/scale 172.17.50.80:/data/gv0/brick1/scale 172.17.50.110:/data/gv0/brick1/scale 172.17.50.6:/data/gv0/brick1/scale 172.17.50.20:/data/gv0/brick1/scale 172.17.50.35:/data/gv0/brick1/scale 172.17.50.50:/data/gv0/brick1/scale 172.17.50.66:/data/gv0/brick1/scale 172.17.50.81:/data/gv0/brick1/scale 172.17.50.111:/data/gv0/brick1/scale 172.17.50.7:/data/gv0/brick1/scale 172.17.50.21:/data/gv0/brick1/scale 172.17.50.36:/data/gv0/brick1/scale 172.17.50.51:/data/gv0/brick1/scale 172.17.50.67:/data/gv0/brick1/scale 172.17.50.82:/data/gv0/brick1/scale 172.17.50.112:/data/gv0/brick1/scale 172.17.50.8:/data/gv0/brick1/scale 172.17.50.22:/data/gv0/brick1/scale 172.17.50.37:/data/gv0/brick1/scale 172.17.50.52:/data/gv0/brick1/scale 172.17.50.68:/data/gv0/brick1/scale 172.17.50.83:/data/gv0/brick1/scale 172.17.50.113:/data/gv0/brick1/scale 172.17.50.9:/data/gv0/brick1/scale 172.17.50.23:/data/gv0/brick1/scale 172.17.50.38:/data/gv0/brick1/scale 172.17.50.53:/data/gv0/brick1/scale 172.17.50.69:/data/gv0/brick1/scale 172.17.50.84:/data/gv0/brick1/scale 172.17.50.114:/data/gv0/brick1/scale 172.17.50.10:/data/gv0/brick1/scale 172.17.50.24:/data/gv0/brick1/scale 172.17.50.39:/data/gv0/brick1/scale 172.17.50.54:/data/gv0/brick1/scale 172.17.50.70:/data/gv0/brick1/scale 172.17.50.85:/data/gv0/brick1/scale 172.17.50.115:/data/gv0/brick1/scale 172.17.50.11:/data/gv0/brick1/scale 172.17.50.25:/data/gv0/brick1/scale 172.17.50.40:/data/gv0/brick1/scale 172.17.50.55:/data/gv0/brick1/scale 172.17.50.71:/data/gv0/brick1/scale 172.17.50.86:/data/gv0/brick1/scale 172.17.50.116:/data/gv0/brick1/scale 172.17.50.12:/data/gv0/brick1/scale 172.17.50.26:/data/gv0/brick1/scale 172.17.50.41:/data/gv0/brick1/scale 172.17.50.56:/data/gv0/brick1/scale 172.17.50.72:/data/gv0/brick1/scale 172.17.50.87:/data/gv0/brick1/scale 172.17.50.117:/data/gv0/brick1/scale 172.17.50.13:/data/gv0/brick1/scale 172.17.50.27:/data/gv0/brick1/scale 172.17.50.42:/data/gv0/brick1/scale 172.17.50.57:/data/gv0/brick1/scale real 2m0.126s user 0m0.008s sys 0m0.001s + sleep 10 + ssh 172.17.50.61 gluster volume set scale cluster.lookup-unhashed off allow-insecure on volume set: failed: Another transaction is in progress. Please try again after sometime. >>>>>******** SEE ERROR IN PREVIOUS LINE************* <<<<< real 0m0.581s user 0m0.008s sys 0m0.001s + sleep 10 + ssh 172.17.50.61 gluster volume start scale real 2m0.121s user 0m0.007s sys 0m0.003s + sleep 10 And then the mount command fails because of allow-insecure on not being set.
https://mojo.redhat.com/people/bengland/blog/2014/04/30/gluster-scalability-test-results-using-virtual-machine-servers graph 3 documents response time of gluster volume command vs number of servers, it looks exactly like O(N^2) curve. Specifically I used a spreadsheet formula R(N) = R(14)*N*N/(14*14) where N is number of servers, and R(N) is response time measured for that number of servers, and it predicted exactly the response time that I observed. Can someone apply the patch in comment 4 to Denali and get me a build of it please? For example, if number of writes done by glusterd was proportional to number of system calls done by glusterd, then this might explain the behavior, since glusterd writes configuration data proportional to the number of bricks on each server, and with O_SYNC each write is an fsync() to disk. If only one glusterd is writing at a time, then this would explain the behavior.
Avati suggested that the problem was the shared storage used to back the KVM guests' system disk, which in this case is a single RAID1 pair of disks on the KVM host. So I reformatted the per-KVM-guest disk as a LVM PV (inside the guest), and created 3 logical volumes: - 1-GB LV for /var/lib/glusterd - 25-GB LV for /var/log/glusterfs (so system disk in guest doesn't fill up) - remaining 900+ GB used for XFS filesystem containing brick directory. The result is under-6-second response time even at 84-server volume size. response time is still going up, but not even O(N), see blog post in previous comment, it's certainly not increasing O(N^2).
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.