Bug 1337477 - [Volume Scale] Volume start failed with "Error : Request timed out" after successfully creating & starting around 290 gluster volumes using heketi-cli
Summary: [Volume Scale] Volume start failed with "Error : Request timed out" after suc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Atin Mukherjee
QA Contact: Prasanth
URL:
Whiteboard:
Depends On:
Blocks: 1351522
TreeView+ depends on / blocked
 
Reported: 2016-05-19 09:43 UTC by Prasanth
Modified: 2017-03-23 05:31 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.8.4-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 05:31:54 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC
Red Hat Bugzilla 1336267 None None None Never
Red Hat Bugzilla 1337836 None None None Never

Internal Links: 1336267 1337836

Description Prasanth 2016-05-19 09:43:32 UTC
Description of problem:

[Volume Scale Testing for Aplo] 
Volume start failed with "Error : Request timed out" after successfully creating & starting around290 gluster volumes using heketi-cli

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-5.el7rhgs.x86_64
glusterfs-server-3.7.9-5.el7rhgs.x86_64

How reproducible: Always


Steps to Reproduce:
1. A gluster cluster of 4 RHGS 3.1.3 nodes having glusterd.service MemoryLimit=32G
2. Using heketi-cli, try to create and start around 300 gluster volumes in a loop
for i in {1..300}; do heketi-cli volume create --name=vol$i --size=10 --durability="replicate" --replica=3; done
3. Check for command output and heketi logs

Actual results: Creation of 290 volumes was successful without any issues in all the 3 volume scale testing done so far. However, it was noticed that from the 291th volume onwards, it starts failing randomly. In the latest testing, vol291 was created successfully but failed to start. 

From Heketi logs:

######
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster volume start vol291] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout [Error : Request timed out
May 18 19:01:22 dhcp42-197 heketi: ]: Stderr []
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume stop vol291 force] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout []: Stderr [volume stop: vol291: failed: Another transaction is in progress for vol291. Please try again after sometime.
May 18 19:01:22 dhcp42-197 heketi: ]
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/heketi/executors/sshexec/volume.go:146: Unable to stop volume vol291: Process exited with: 1. Reason was:  ()
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume delete vol291] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout []: Stderr [volume delete: vol291: failed: Another transaction is in progress for vol291. Please try again after sometime.
May 18 19:01:22 dhcp42-197 heketi: ]
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/heketi/executors/sshexec/volume.go:158: Unable to delete volume vol291: Process exited with: 1. Reason was:  ()
######

From gluster logs:

######
[2016-05-18 18:51:32.119103] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30) [0x7fc7e9e60340] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980) [0x7fc7e9e60270] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7) [0x7fc7e9e64a67] ) 0-management: Lock for vol291 held by 4b494dd7-09e6-4d7d-8834-218534548912
[2016-05-18 18:51:32.119140] E [MSGID: 106119] [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire lock for vol291
[2016-05-18 18:51:32.305741] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30) [0x7fc7e9e60340] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980) [0x7fc7e9e60270] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7) [0x7fc7e9e64a67] ) 0-management: Lock for vol291 held by 4b494dd7-09e6-4d7d-8834-218534548912
[2016-05-18 18:51:48.789082] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(+0x1015f2) [0x7fc7e9e615f2] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x4c5) [0x7fc7e9e610a5] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol291 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2016-05-18 18:51:48.844338] E [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(+0x1015f2) [0x7fc7e9e615f2] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x426) [0x7fc7e9e61006] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol291 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2016-05-18 18:51:55.458451] I [socket.c:3508:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2016-05-18 18:51:55.458465] E [rpcsvc.c:1314:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management)
[2016-05-18 18:51:55.458487] E [MSGID: 106430] [glusterd-utils.c:474:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2016-05-18 18:53:14.606645] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_commit_hook+0x195) [0x7fc7e9dccdc5] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x4c5) [0x7fc7e9e610a5] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh --volname=vol292 --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd
[2016-05-18 18:53:14.606823] I [MSGID: 106062] [glusterd-brick-ops.c:1276:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it
[2016-05-18 18:51:32.305769] E [MSGID: 106119] [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire lock for vol291
######

Expected results: Scaling of volumes should be supported by gluster and volume creation should work fine

Comment 2 Atin Mukherjee 2016-05-20 12:33:12 UTC
RCA for this goes same as BZ 1336267

Comment 5 Atin Mukherjee 2016-07-08 04:22:16 UTC
http://review.gluster.org/#/c/14849/ fixes this issue too.

Comment 7 Atin Mukherjee 2016-09-17 14:26:45 UTC
Upstream mainline : http://review.gluster.org/14849
Upstream 3.8 : http://review.gluster.org/14860

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 13 Prasanth 2017-02-28 07:06:22 UTC
The reported issue seems to be fixed in glusterfs-3.8.4 and I was able to scale gluster volumes using heketi-cli even beyond 300 volumes without any issues.

############
# gluster --version
glusterfs 3.8.4 built on Feb 20 2017 03:15:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


# heketi-cli volume list |wc -l
500


# gluster volume list |wc -l
500


# gluster volume info vol500
 
Volume Name: vol500
Type: Replicate
Volume ID: 72c9be57-3284-4e12-8449-84889a782c23
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.47.163:/var/lib/heketi/mounts/vg_814d6fa82429363cf09aaafe5ba7d850/brick_1f6b1df64468e950ee64aebe57f498d9/brick
Brick2: 10.70.46.150:/var/lib/heketi/mounts/vg_60357bad265972bb79fd3155e6d473fa/brick_d14fc99faef482004d606480d405df7d/brick
Brick3: 10.70.47.161:/var/lib/heketi/mounts/vg_893e765b342697a5086bf56d58332501/brick_3f4e9c68a58d6435ab377d85d08f90ff/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
############

Comment 17 errata-xmlrpc 2017-03-23 05:31:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.