1337477 – [Volume Scale] Volume start failed with "Error : Request timed out" after successfully creating & starting around 290 gluster volumes using heketi-cli

Bug 1337477 - [Volume Scale] Volume start failed with "Error : Request timed out" after successfully creating & starting around 290 gluster volumes using heketi-cli

Summary: [Volume Scale] Volume start failed with "Error : Request timed out" after suc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Atin Mukherjee
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351522
TreeView+	depends on / blocked

Reported:	2016-05-19 09:43 UTC by Prasanth
Modified:	2020-03-11 15:07 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 05:31:54 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1336267	unspecified	CLOSED	[scale]: Bricks not started after node reboot.	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1337836	unspecified	CLOSED	[Volume Scale] heketi-cli should not attempt to stop and delete a volume as soon as it receives a CLI timeout (120sec) b...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1336267 1337836

Description Prasanth 2016-05-19 09:43:32 UTC

Description of problem:

[Volume Scale Testing for Aplo] 
Volume start failed with "Error : Request timed out" after successfully creating & starting around290 gluster volumes using heketi-cli

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-5.el7rhgs.x86_64
glusterfs-server-3.7.9-5.el7rhgs.x86_64

How reproducible: Always


Steps to Reproduce:
1. A gluster cluster of 4 RHGS 3.1.3 nodes having glusterd.service MemoryLimit=32G
2. Using heketi-cli, try to create and start around 300 gluster volumes in a loop
for i in {1..300}; do heketi-cli volume create --name=vol$i --size=10 --durability="replicate" --replica=3; done
3. Check for command output and heketi logs

Actual results: Creation of 290 volumes was successful without any issues in all the 3 volume scale testing done so far. However, it was noticed that from the 291th volume onwards, it starts failing randomly. In the latest testing, vol291 was created successfully but failed to start. 

From Heketi logs:

######
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster volume start vol291] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout [Error : Request timed out
May 18 19:01:22 dhcp42-197 heketi: ]: Stderr []
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume stop vol291 force] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout []: Stderr [volume stop: vol291: failed: Another transaction is in progress for vol291. Please try again after sometime.
May 18 19:01:22 dhcp42-197 heketi: ]
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/heketi/executors/sshexec/volume.go:146: Unable to stop volume vol291: Process exited with: 1. Reason was:  ()
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume delete vol291] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout []: Stderr [volume delete: vol291: failed: Another transaction is in progress for vol291. Please try again after sometime.
May 18 19:01:22 dhcp42-197 heketi: ]
May 18 19:01:22 dhcp42-197 heketi: [sshexec] ERROR 2016/05/18 19:01:22 /src/github.com/heketi/heketi/executors/sshexec/volume.go:158: Unable to delete volume vol291: Process exited with: 1. Reason was:  ()
######

From gluster logs:

######
[2016-05-18 18:51:32.119103] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30) [0x7fc7e9e60340] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980) [0x7fc7e9e60270] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7) [0x7fc7e9e64a67] ) 0-management: Lock for vol291 held by 4b494dd7-09e6-4d7d-8834-218534548912
[2016-05-18 18:51:32.119140] E [MSGID: 106119] [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire lock for vol291
[2016-05-18 18:51:32.305741] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30) [0x7fc7e9e60340] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980) [0x7fc7e9e60270] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7) [0x7fc7e9e64a67] ) 0-management: Lock for vol291 held by 4b494dd7-09e6-4d7d-8834-218534548912
[2016-05-18 18:51:48.789082] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(+0x1015f2) [0x7fc7e9e615f2] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x4c5) [0x7fc7e9e610a5] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol291 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2016-05-18 18:51:48.844338] E [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(+0x1015f2) [0x7fc7e9e615f2] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x426) [0x7fc7e9e61006] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol291 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2016-05-18 18:51:55.458451] I [socket.c:3508:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2016-05-18 18:51:55.458465] E [rpcsvc.c:1314:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management)
[2016-05-18 18:51:55.458487] E [MSGID: 106430] [glusterd-utils.c:474:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2016-05-18 18:53:14.606645] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_op_commit_hook+0x195) [0x7fc7e9dccdc5] -->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x4c5) [0x7fc7e9e610a5] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fc7f52c84e5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh --volname=vol292 --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd
[2016-05-18 18:53:14.606823] I [MSGID: 106062] [glusterd-brick-ops.c:1276:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it
[2016-05-18 18:51:32.305769] E [MSGID: 106119] [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire lock for vol291
######

Expected results: Scaling of volumes should be supported by gluster and volume creation should work fine

Comment 2 Atin Mukherjee 2016-05-20 12:33:12 UTC

RCA for this goes same as BZ 1336267

Comment 5 Atin Mukherjee 2016-07-08 04:22:16 UTC

http://review.gluster.org/#/c/14849/ fixes this issue too.

Comment 7 Atin Mukherjee 2016-09-17 14:26:45 UTC

Upstream mainline : http://review.gluster.org/14849
Upstream 3.8 : http://review.gluster.org/14860

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 13 Prasanth 2017-02-28 07:06:22 UTC

The reported issue seems to be fixed in glusterfs-3.8.4 and I was able to scale gluster volumes using heketi-cli even beyond 300 volumes without any issues.

############
# gluster --version
glusterfs 3.8.4 built on Feb 20 2017 03:15:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


# heketi-cli volume list |wc -l
500


# gluster volume list |wc -l
500


# gluster volume info vol500
 
Volume Name: vol500
Type: Replicate
Volume ID: 72c9be57-3284-4e12-8449-84889a782c23
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.47.163:/var/lib/heketi/mounts/vg_814d6fa82429363cf09aaafe5ba7d850/brick_1f6b1df64468e950ee64aebe57f498d9/brick
Brick2: 10.70.46.150:/var/lib/heketi/mounts/vg_60357bad265972bb79fd3155e6d473fa/brick_d14fc99faef482004d606480d405df7d/brick
Brick3: 10.70.47.161:/var/lib/heketi/mounts/vg_893e765b342697a5086bf56d58332501/brick_3f4e9c68a58d6435ab377d85d08f90ff/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
############

Comment 17 errata-xmlrpc 2017-03-23 05:31:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.