Description of problem: [Volume Scale Testing for Aplo] Volume start failed with "Error : Request timed out" after successfully creating & starting around290 gluster volumes using heketi-cli Version-Release number of selected component (if applicable): glusterfs-3.7.9-5.el7rhgs.x86_64 glusterfs-server-3.7.9-5.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. A gluster cluster of 4 RHGS 3.1.3 nodes having glusterd.service MemoryLimit=32G 2. Using heketi-cli, try to create and start around 300 gluster volumes in a loop for i in {1..300}; do heketi-cli volume create --name=vol$i --size=10 --durability="replicate" --replica=3; done 3. Check for command output and heketi logs Actual results: Creation of 290 volumes was successful without any issues in all the 3 volume scale testing done so far. However, it was noticed that from the 291th volume onwards, it starts failing randomly. In the latest testing, vol291 was created successfully but failed to start. From Heketi logs: ###### May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster volume start vol291] on 10.70.42.222:22: Err[Process exited with: 1. Reason was: ()]: Stdout [Error : Request timed out May 18 19:01:22 dhcp42-197 heketi: ]: Stderr [] May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume stop vol291 force] on 10.70.42.222:22: Err[Process exited with: 1. Reason was: ()]: Stdout []: Stderr [volume stop: vol291: failed: Another transaction is in progress for vol291. Please try again after sometime. May 18 19:01:22 dhcp42-197 heketi: ] May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/heketi/executors/sshexec/volume.go:146: Unable to stop volume vol291: Process exited with: 1. Reason was: () May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume delete vol291] on 10.70.42.222:22: Err[Process exited with: 1. Reason was: ()]: Stdout []: Stderr [volume delete: vol291: failed: Another transaction is in progress for vol291. Please try again after sometime. May 18 19:01:22 dhcp42-197 heketi: ] May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/heketi/executors/sshexec/volume.go:158: Unable to delete volume vol291: Process exited with: 1. Reason was: () ###### From gluster logs: ###### [2016-05-18 18:51:32.119103] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30) [0x7fc7e9e60340] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980) [0x7fc7e9e60270] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7) [0x7fc7e9e64a67] ) 0-management: Lock for vol291 held by 4b494dd7-09e6-4d7d-8834-218534548912 [2016-05-18 18:51:32.119140] E [MSGID: 106119] [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire lock for vol291 [2016-05-18 18:51:32.305741] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30) [0x7fc7e9e60340] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980) [0x7fc7e9e60270] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7) [0x7fc7e9e64a67] ) 0-management: Lock for vol291 held by 4b494dd7-09e6-4d7d-8834-218534548912 [2016-05-18 18:51:48.789082] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(+0x1015f2) [0x7fc7e9e615f2] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x4c5) [0x7fc7e9e610a5] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol291 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2016-05-18 18:51:48.844338] E [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(+0x1015f2) [0x7fc7e9e615f2] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x426) [0x7fc7e9e61006] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol291 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2016-05-18 18:51:55.458451] I [socket.c:3508:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1) [2016-05-18 18:51:55.458465] E [rpcsvc.c:1314:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management) [2016-05-18 18:51:55.458487] E [MSGID: 106430] [glusterd-utils.c:474:glusterd_submit_reply] 0-glusterd: Reply submission failed [2016-05-18 18:53:14.606645] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_commit_hook+0x195) [0x7fc7e9dccdc5] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x4c5) [0x7fc7e9e610a5] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh --volname=vol292 --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd [2016-05-18 18:53:14.606823] I [MSGID: 106062] [glusterd-brick-ops.c:1276:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it [2016-05-18 18:51:32.305769] E [MSGID: 106119] [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire lock for vol291 ###### Expected results: Scaling of volumes should be supported by gluster and volume creation should work fine
RCA for this goes same as BZ 1336267
http://review.gluster.org/#/c/14849/ fixes this issue too.
Upstream mainline : http://review.gluster.org/14849 Upstream 3.8 : http://review.gluster.org/14860 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
The reported issue seems to be fixed in glusterfs-3.8.4 and I was able to scale gluster volumes using heketi-cli even beyond 300 volumes without any issues. ############ # gluster --version glusterfs 3.8.4 built on Feb 20 2017 03:15:38 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. # heketi-cli volume list |wc -l 500 # gluster volume list |wc -l 500 # gluster volume info vol500 Volume Name: vol500 Type: Replicate Volume ID: 72c9be57-3284-4e12-8449-84889a782c23 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.47.163:/var/lib/heketi/mounts/vg_814d6fa82429363cf09aaafe5ba7d850/brick_1f6b1df64468e950ee64aebe57f498d9/brick Brick2: 10.70.46.150:/var/lib/heketi/mounts/vg_60357bad265972bb79fd3155e6d473fa/brick_d14fc99faef482004d606480d405df7d/brick Brick3: 10.70.47.161:/var/lib/heketi/mounts/vg_893e765b342697a5086bf56d58332501/brick_3f4e9c68a58d6435ab377d85d08f90ff/brick Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on ############
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html