Bug 1337836

Summary:	[Volume Scale] heketi-cli should not attempt to stop and delete a volume as soon as it receives a CLI timeout (120sec) but instead wait until the frame-timeout of 600sec
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Prasanth <pprakash>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED ERRATA	QA Contact:	Prasanth <pprakash>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amainkar, asrivast, madam, pprakash, rcyriac, rhinduja, rhs-bugs, vbellur
Target Milestone:	---	Keywords:	Reopened
Target Release:	RHGS 3.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aplo
Fixed In Version:	glusterfs-3.8.4-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-23 05:32:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351522

Description Prasanth 2016-05-20 08:09:55 UTC

Description of problem:

On processing a volume create request, heketi-cli should not attempt to stop and delete a volume as soon as it receives a CLI timeout (120sec) but instead, wait until the frame-timeout of 600sec. 

Version-Release number of selected component (if applicable):
# heketi --version
Heketi v1.3.0-dev-12-g2f52279
glusterfs-3.7.9-5.el7rhgs.x86_64
glusterfs-server-3.7.9-5.el7rhgs.x86_64

How reproducible: Always


Steps to Reproduce:
1. A gluster cluster of 3 RHGS 3.1.3 nodes having glusterd.service MemoryLimit=32G
2. Using heketi-cli, try to create and start around 300 gluster volumes in a loop
3. Check for command output and heketi and gluster logs


Actual results: Even though the volume creation is successful in the gluster nodes, glusterd somehow fails to start the volume and eventually CLI times out after 120sec. However, glusterd still tries to process this request in the background until the frame-timeout of 600sec happens. However, on receiving the "Request timed out" message after 120sec, heketi-cli immediately tries to stop and delete this volume followed by umount --> lvremove --> rmdir, etc. Since gluster volume stop itself fails as it is Unable to acquire lock for that volume, all the subsequent actions are also going to fail.

See logs from Heketi-cli:

#######
May 20 13:01:06 dhcp42-197 heketi: [sshexec] ERROR 2016/05/20 13:01:06 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster volume start vol296] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout [Error : Request timed out
May 20 13:01:06 dhcp42-197 heketi: ]: Stderr []
May 20 13:01:06 dhcp42-197 heketi: [sshexec] ERROR 2016/05/20 13:01:06 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume stop vol296 force] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout []: Stderr [volume stop: vol296: failed: Another transaction is in progress for vol296. Please try again after sometime.
May 20 13:01:06 dhcp42-197 heketi: ]
May 20 13:01:06 dhcp42-197 heketi: [sshexec] ERROR 2016/05/20 13:01:06 /src/github.com/heketi/heketi/executors/sshexec/volume.go:146: Unable to stop volume vol296: Process exited with: 1. Reason was:  ()
May 20 13:01:07 dhcp42-197 heketi: [negroni] Started GET /queue/1899fe9e4701655e57691aa9dae27385
May 20 13:01:07 dhcp42-197 heketi: [negroni] Completed 200 OK in 82.184µs
May 20 13:01:08 dhcp42-197 heketi: [negroni] Started GET /queue/1899fe9e4701655e57691aa9dae27385
May 20 13:01:08 dhcp42-197 heketi: [negroni] Completed 200 OK in 111.284µs
May 20 13:01:08 dhcp42-197 heketi: [sshexec] ERROR 2016/05/20 13:01:08 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo gluster --mode=script volume delete vol296] on 10.70.42.222:22: Err[Process exited with: 1. Reason was:  ()]: Stdout []: Stderr [volume delete: vol296: failed: Another transaction is in progress for vol296. Please try again after sometime.
May 20 13:01:08 dhcp42-197 heketi: ]
May 20 13:01:08 dhcp42-197 heketi: [sshexec] ERROR 2016/05/20 13:01:08 /src/github.com/heketi/heketi/executors/sshexec/volume.go:158: Unable to delete volume vol296: Process exited with: 1. Reason was:  ()
May 20 13:01:08 dhcp42-197 heketi: [heketi] INFO 2016/05/20 13:01:08 Deleting brick a20b78834b0ae628ab21a867fbc3906b
May 20 13:01:08 dhcp42-197 heketi: [heketi] INFO 2016/05/20 13:01:08 Deleting brick a4d290acfc7bdd8cb4d32537ca734509
May 20 13:01:08 dhcp42-197 heketi: [heketi] INFO 2016/05/20 13:01:08 Deleting brick baabad7c391b6cc6d6918fd3d3d71820
May 20 13:01:08 dhcp42-197 heketi: [heketi] INFO 2016/05/20 13:01:08 Deleting brick 83749ea4e93e0d9ab11b9c3cb37ef8a2
May 20 13:01:08 dhcp42-197 heketi: [heketi] INFO 2016/05/20 13:01:08 Deleting brick 928a0fc2285f7e9a3c4a4cf12e6967b6
May 20 13:01:08 dhcp42-197 heketi: [heketi] INFO 2016/05/20 13:01:08 Deleting brick ae5390d0bceea4dc0cdd32aa91c5499f
May 20 13:01:08 dhcp42-197 heketi: [sshexec] ERROR 2016/05/20 13:01:08 /src/github.com/heketi/utils/ssh/ssh.go:155: Failed to run command [sudo umount /var/lib/heketi/mounts/vg_5013101181241c140bd579f226cfd3d1/brick_baabad7c391b6cc6d6918fd3d3d71820] on 10.70.43.158:22: Err[Process exited with: 32. Reason was:  ()]: Stdout []: Stderr [umount: /var/lib/heketi/mounts/vg_5013101181241c140bd579f226cfd3d1/brick_baabad7c391b6cc6d6918fd3d3d71820: target is busy.
#######

This will end up in a inconsistent state of cluster/volume as the same volume will be seen as "Created" from gluster side with some bricks missing but the heketi db is having NO information about this volume at all.

#####
[root@dhcp43-158 ~]# gluster vol info vol296
 
Volume Name: vol296
Type: Distributed-Replicate
Volume ID: 7b500717-3406-4a29-be63-0deaffa18bbc
Status: Created
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.42.222:/var/lib/heketi/mounts/vg_0974b8b4fdf60a1bc086fa3c03a4cf24/brick_a20b78834b0ae628ab21a867fbc3906b/brick
Brick2: 10.70.42.85:/var/lib/heketi/mounts/vg_65f837f8e26d1afb8b2050194afd0203/brick_a4d290acfc7bdd8cb4d32537ca734509/brick
Brick3: 10.70.43.158:/var/lib/heketi/mounts/vg_5013101181241c140bd579f226cfd3d1/brick_baabad7c391b6cc6d6918fd3d3d71820/brick
Brick4: 10.70.42.222:/var/lib/heketi/mounts/vg_13f8363b53cac7068787f533316cd884/brick_83749ea4e93e0d9ab11b9c3cb37ef8a2/brick
Brick5: 10.70.42.85:/var/lib/heketi/mounts/vg_3589dc63d585700961b1e0f2d2be4801/brick_928a0fc2285f7e9a3c4a4cf12e6967b6/brick
Brick6: 10.70.43.158:/var/lib/heketi/mounts/vg_84d8689748532a237c0815823c5c6294/brick_ae5390d0bceea4dc0cdd32aa91c5499f/brick
Options Reconfigured:
performance.readdir-ahead: on
#####

Expected results: heketi-cli should not try to clean up the volume immediately but instead wait for the frame-timeout to happen to avoid ending up in this situation.

Comment 2 Luis Pabón 2016-05-24 18:44:57 UTC

https://github.com/heketi/heketi/issues/348

Comment 7 Luis Pabón 2016-06-16 00:34:18 UTC

Hi, I have been taking a look at this bug, but the code does not wait for 120 secs, it waits for 10 mins max on a command.  If the timeout is from the gluster command coming back after 120 secs, then the bug seems to be in gluster, right?

Comment 9 Luis Pabón 2016-07-06 16:37:04 UTC

We do not use sshexec in OpenShift.  We now use kubeexec.  I'll close until this shows up again.

Comment 11 Luis Pabón 2016-07-07 11:29:24 UTC

Please retry using the latest.

Comment 12 Luis Pabón 2016-07-19 15:28:32 UTC

This issue is due to the gluster CLI returning from the command before the command is completed.   It returns with an error code to Heketi, so Heketi thinks something bad went wrong.

The CLI is supposed to wait until the command is completed and return success or failure.

Comment 14 Atin Mukherjee 2016-07-19 17:32:51 UTC

And I will reply to Luis'es query on comment 12 once I am in front of my laptop.

Comment 16 Atin Mukherjee 2016-07-20 04:07:15 UTC

(In reply to Luis Pabón from comment #12)
> This issue is due to the gluster CLI returning from the command before the
> command is completed.   It returns with an error code to Heketi, so Heketi
> thinks something bad went wrong.
> 
> The CLI is supposed to wait until the command is completed and return
> success or failure.

Ideally yes, it is supposed to but the question is how long? Will it wait for ever if it doesn't get a response back from GlusterD? There could be different reasons where GlusterD may consume time processing the request. Alternatively GlusterD could be deadlocked or a RPC time out could very well be a possibility where the rpc callback will only be hit after 600 secs i.e. the default frame timeout value. With all these considerations, it was decided that an average wait time for CLI shouldn't be more than 120 secs and it should time out even if GlusterD is processing the request. IMHO we can not implement to have CLI waiting for GlusterD to respond. If you have any other suggestion please do let us know.

Now the good news is this issue is already fixed in upstream with http://review.gluster.org/14849 and the RCA of this issue goes same as with BZ 1336267, BZ 1337477 & BZ 1337495 and fix is targeted for RHGS-3.2 release.

I've a question for Prashanth/Luis. 

"Actual results: Even though the volume creation is successful in the gluster nodes, glusterd somehow fails to start the volume and eventually CLI times out after 120sec. However, glusterd still tries to process this request in the background until the frame-timeout of 600sec happens. However, on receiving the "Request timed out" message after 120sec, heketi-cli immediately tries to stop and delete this volume followed by umount --> lvremove --> rmdir, etc. Since gluster volume stop itself fails as it is Unable to acquire lock for that volume, all the subsequent actions are also going to fail."

if volume stop fails since it failed to acquire the lock why heketi cli goes for deleting the volume followed by the other clean ups? That will always lead to inconsistency across the layers. Shouldn't heketi do this only when CLI succeeds?

Comment 17 Atin Mukherjee 2016-07-20 04:09:02 UTC

Since the patch is already posted and merged upstream moving this bug to post state.

Comment 19 Prasanth 2016-07-20 10:06:32 UTC

(In reply to Atin Mukherjee from comment #16)

> I've a question for Prashanth/Luis. 
> 
> "Actual results: Even though the volume creation is successful in the
> gluster nodes, glusterd somehow fails to start the volume and eventually CLI
> times out after 120sec. However, glusterd still tries to process this
> request in the background until the frame-timeout of 600sec happens.
> However, on receiving the "Request timed out" message after 120sec,
> heketi-cli immediately tries to stop and delete this volume followed by
> umount --> lvremove --> rmdir, etc. Since gluster volume stop itself fails
> as it is Unable to acquire lock for that volume, all the subsequent actions
> are also going to fail."
> 
> if volume stop fails since it failed to acquire the lock why heketi cli goes
> for deleting the volume followed by the other clean ups? That will always
> lead to inconsistency across the layers. Shouldn't heketi do this only when
> CLI succeeds?

Atin, 

You are right, heketi should NOT do a full clean up during this situation as it directly leads to in-consistency between the data in heketi db and gluster. And this is the exact reason for me logging this bug and I've put the same thoughts in the expected result as well.

So I believe, it's a good question for Luis to explain this current behaviour! :)

Comment 22 Luis Pabón 2016-08-09 18:51:12 UTC

I think the problem here is that this CLI command looks like is the only one that Heketi uses which could be Asynchronous.  We either need a synchronous command which blocks until the request is complete, or we need a method/command from GlusterD to get status for this command once it times out.

The second option must be well documented and tested.

Comment 23 Luis Pabón 2016-08-09 18:53:08 UTC

(In reply to Luis Pabón from comment #22)
> I think the problem here is that this CLI command looks like is the only one
> that Heketi uses which could be Asynchronous.  We either need a synchronous
> command which blocks until the request is complete, or we need a
> method/command from GlusterD to get status for this command once it times
> out.
> 
> The second option must be well documented and tested.

Also timing out is inconsistent.  Either the command is always asynchronous, or the command is always synchronous.  If we need a new gluster command to make it always asynchronous or always synchronous, that would be fine.

Comment 24 Atin Mukherjee 2016-09-17 14:24:31 UTC

Upstream mainline : http://review.gluster.org/14849
Upstream 3.8 : http://review.gluster.org/14860

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 28 Prasanth 2017-02-28 07:34:21 UTC

The reported issue seems to be fixed in glusterfs-3.8.4 and I was able to scale gluster volumes using heketi-cli even beyond 300 volumes without any issues.

######################
# gluster --version
glusterfs 3.8.4 built on Feb 20 2017 03:15:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


# heketi-cli volume list |wc -l
500


# gluster volume list |wc -l
500


# gluster peer status
Number of Peers: 2

Hostname: dhcp46-150.lab.eng.blr.redhat.com
Uuid: c4bdf1ad-04ab-4301-b9fe-f144272079ef
State: Peer in Cluster (Connected)

Hostname: 10.70.47.163
Uuid: fcd44049-a3b9-4f85-851c-79915812cf3f
State: Peer in Cluster (Connected)


# gluster volume info vol291
 
Volume Name: vol291
Type: Replicate
Volume ID: 48eeed36-e50e-429b-b474-10e4e336ffca
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.46.150:/var/lib/heketi/mounts/vg_1dae8c1c2feb1c16338a0440f64bcfed/brick_6f278e35a09ea7cff70b008784cb99c1/brick
Brick2: 10.70.47.161:/var/lib/heketi/mounts/vg_cf48e4fe475f69149d157bbfae86db75/brick_5cc9403c371cff1f2c4b6504bca5f2e9/brick
Brick3: 10.70.47.163:/var/lib/heketi/mounts/vg_94e77f6c32ac54b0c819ceee0899981f/brick_a6fb045df6a79ee353031f9da40309c5/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on


# gluster volume info vol500
 
Volume Name: vol500
Type: Replicate
Volume ID: 72c9be57-3284-4e12-8449-84889a782c23
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.47.163:/var/lib/heketi/mounts/vg_814d6fa82429363cf09aaafe5ba7d850/brick_1f6b1df64468e950ee64aebe57f498d9/brick
Brick2: 10.70.46.150:/var/lib/heketi/mounts/vg_60357bad265972bb79fd3155e6d473fa/brick_d14fc99faef482004d606480d405df7d/brick
Brick3: 10.70.47.161:/var/lib/heketi/mounts/vg_893e765b342697a5086bf56d58332501/brick_3f4e9c68a58d6435ab377d85d08f90ff/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

# gluster pool list
UUID                                    Hostname                                State
c4bdf1ad-04ab-4301-b9fe-f144272079ef    dhcp46-150.lab.eng.blr.redhat.com       Connected 
fcd44049-a3b9-4f85-851c-79915812cf3f    10.70.47.163                            Connected 
a614a3d8-478f-48d4-8542-e6d9c3b526ad    localhost                               Connected 
######################

Comment 31 errata-xmlrpc 2017-03-23 05:32:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html