1651041 – Higher gluster cli timeout value in all gluster cli commands

Bug 1651041 - Higher gluster cli timeout value in all gluster cli commands

Summary: Higher gluster cli timeout value in all gluster cli commands

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	ocs-3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 3.11.1
Assignee:	Michael Adam
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1644160
TreeView+	depends on / blocked

Reported:	2018-11-19 02:32 UTC by Atin Mukherjee
Modified:	2022-03-13 16:07 UTC (History)
CC List:	22 users (show)
Fixed In Version:	heketi-8.0.0-3
Doc Type:	Bug Fix
Doc Text:	Previously, heketi executed gluster commands with no timeout specification. As a consequence, when glusterd was busy with more requests it lead to the default timeout (120 seconds) of CLI. Heketi considered this as a command failure while glusterd continued to complete the request. With this fix, gluster commands are now specified with higher timeout value of 600 seconds by default, giving glusterd more time to complete the request. This default value 600 seconds for the timeout seems to be sufficient for all known cases where glusterd takes longer due to system load. If needed, the value can be reconfigured with the `gluster_cli_timeout` config-file variable and the `HEKETI_GLUSTER_CLI_TIMEOUT` environment variable.
Clone Of:
Environment:
Last Closed:	2019-02-07 03:41:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1649651	1	None	None	None	2024-09-18 00:49:32 UTC
Red Hat Product Errata	RHEA-2019:0286	0	None	None	None	2019-02-07 03:41:26 UTC

Internal Links: 1649651

Description Atin Mukherjee 2018-11-19 02:32:53 UTC

Description of problem:

Background of the problem is captured at https://bugzilla.redhat.com/show_bug.cgi?id=1649651#c16 and this has resulted into an upstream issue in heketi : https://github.com/heketi/heketi/issues/1426

In certain cases when there's a back pressure on glusterd to process many incoming requests, having heketi to assume a command has failed on a default time out of 120 seconds isn't reasonable as that can lead to inconsistency between heketi and glusterd.

Version-Release number of selected component (if applicable):
OCS 3.9/3.10/3.11

How reproducible:
Always

Steps to Reproduce:
1. Have an workload with frequent PV claim/delete requests, make sure the cluster has ~1000 volumes with cluster.max-bricks-per-process set to 0.
2. Kill a brick
3. glusterd processes the disconnect event and makes the thread busy (due to BZ 1649651)

Actual results:
PVCs go into pending state


Expected results:
PVCs shouldn't go into pending state.

Additional info:

Comment 33 RamaKasturi 2018-12-20 13:55:45 UTC

Verified and works fine.Below are the steps followed to verify the bug.

rpm -qa | grep heketi
heketi-client-8.0.0-5.el7rhgs.x86_64

[root@dhcp47-30 ~]# oc rsh glusterfs-storage-l6kcb
sh-4.2# rpm -qa | grep glusterfs
glusterfs-server-3.12.2-32.el7rhgs.x86_64
glusterfs-api-3.12.2-32.el7rhgs.x86_64
glusterfs-cli-3.12.2-32.el7rhgs.x86_64
glusterfs-fuse-3.12.2-32.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64
glusterfs-libs-3.12.2-32.el7rhgs.x86_64
glusterfs-3.12.2-32.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64

sh-4.2# rpm -qa | grep heketi
heketi-8.0.0-5.el7rhgs.x86_64
heketi-client-8.0.0-5.el7rhgs.x86_64

[root@dhcp47-30 ~]# oc version
oc v3.11.51
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp47-30.lab.eng.blr.redhat.com:8443
openshift v3.11.51
kubernetes v1.11.0+d4cacc0

[root@dhcp47-30 ~]# rpm -qa | grep openshift-ansible
openshift-ansible-playbooks-3.11.51-2.git.0.51c90a3.el7.noarch
openshift-ansible-docs-3.11.51-2.git.0.51c90a3.el7.noarch
openshift-ansible-roles-3.11.51-2.git.0.51c90a3.el7.noarch
openshift-ansible-3.11.51-2.git.0.51c90a3.el7.noarch



Steps followed to verify the bug:
=====================================
First test (Happy Path):
================================
1) Created a hook script called /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh  and added the contents as below

#! /bin/bash

sleep 6m

2) Gave execute permissions to the script by running the command chmod 777 /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh
3) Now ran the command to send pvc request.
4) Once after the sleep started to execute, went a head and delete the script(/var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh ) from all the nodes.
5) I see that volume has started after 6 minutes and pvc got bound successfully.

The above tests proves that things went fine when have put less than 10 minutes of sleep into the script.

Below is what is seen in cmd_history.log:
===============================================
[2018-12-20 11:28:25.101574]  : volume create vol_0a1b153d7a147ca7f35a73430a6813cc replica 3 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_25a06bd6ec93abc6afda09878a4923b5/brick 10.70.47.18:/var/lib/heketi/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_33a82abe5766f00400f105d7e21b68ae/brick 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_96419185b8f90d3dce9ec0008865e0c9/brick : SUCCESS
[2018-12-20 11:28:25.723433]  : volume set vol_0a1b153d7a147ca7f35a73430a6813cc server.tcp-user-timeout 42 : SUCCESS
[2018-12-20 11:34:28.432923]  : volume start vol_0a1b153d7a147ca7f35a73430a6813cc : SUCCESS

Pvc got bound successfully after 6mintues:
===========================================
testfinal2      Bound     pvc-58e9b0a1-044a-11e9-a8ba-005056a5190f   11Gi       RWO            glusterfs-storage      7m

[root@dhcp47-30 ~]# heketi-cli volume list
Id:0a1b153d7a147ca7f35a73430a6813cc    Cluster:d094ae7b7c68d5e3b6289ea6bd057bf1    Name:vol_0a1b153d7a147ca7f35a73430a6813cc




Second test (Negative Path) :
=============================
1) Created a hook script called /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh  and added the contents as below

#! /bin/bash

sleep 11m

2) Gave execute permissions to the script by running the command chmod 777 /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh
3) Now ran the command to send pvc request.
4) Once after the sleep started to execute, went a head and delete the script(/var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh ) from all the nodes.
5) Now heketi fails but with gluster cli failing due to timeout.

Below are the logs from cmd_history.log:
===========================================

2018-12-20 13:15:36.415818]  : volume create vol_426c53a028247b2d8fb6d3d9fec50993 replica 3 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_61aa821dacdcd93a0d0f7ee21da71528/brick 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_756deb6d3603b98eb6081ba75154b761/brick 10.70.47.18:/var/lib/heketi/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_e5bfc4265768107546c2860fd4c0acf2/brick : SUCCESS
[2018-12-20 13:15:37.146529]  : volume set vol_426c53a028247b2d8fb6d3d9fec50993 server.tcp-user-timeout 42 : SUCCESS
[2018-12-20 13:26:39.547713]  : volume stop vol_426c53a028247b2d8fb6d3d9fec50993 force : FAILED : Another transaction is in progress for vol_426c53a028247b2d8fb6d3d9fec50993. Please try again after sometime.
[2018-12-20 13:26:39.842920]  : volume delete vol_426c53a028247b2d8fb6d3d9fec50993 : FAILED : Another transaction is in progress for vol_426c53a028247b2d8fb6d3d9fec50993. Please try again after sometime.
[2018-12-20 13:26:40.879972]  : volume start vol_426c53a028247b2d8fb6d3d9fec50993 : SUCCESS
[2018-12-20 13:26:44.704826]  : volume stop vol_426c53a028247b2d8fb6d3d9fec50993 force : SUCCESS
[2018-12-20 13:26:44.936905]  : volume delete vol_426c53a028247b2d8fb6d3d9fec50993 : SUCCESS


logs from heketi:
=================================
[kubeexec] ERROR 2018/12/20 13:25:37 heketi/pkg/remoteexec/kube/exec.go:85:kube.ExecCommands: Failed to run command [gluster --mode=script --timeout=600 volume start vol_426
c53a028247b2d8fb6d3d9fec50993] on [pod:glusterfs-storage-wtbjp c:glusterfs ns:glusterfs (from host:dhcp46-100.lab.eng.blr.redhat.com selector:glusterfs-node)]: Err[command t
erminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []

Since the hooks scripts are deleted i see that pvc gets bound successfully when another volume request is sent from heketi after that.
=====================================================================================================================================

cmd_history.log 
=========================

[2018-12-20 13:27:21.970233]  : volume create vol_d78c1b54ec7efaf9686f19104843dd43 replica 3 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_ce4109beff40cc4d7cbc9eea2850b686/brick 10.70.47.18:/var/lib/heketi/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_1f270c2e3066bb417f0c86c2106c04f1/brick 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_ca5335b9b4988e41ffc00a91ff2cb89c/brick : SUCCESS
[2018-12-20 13:27:22.574492]  : volume set vol_d78c1b54ec7efaf9686f19104843dd43 server.tcp-user-timeout 42 : SUCCESS
[2018-12-20 13:27:26.118807]  : volume start vol_d78c1b54ec7efaf9686f19104843dd43 : SUCCESS

Heketi logs:
==============================

[kubeexec] DEBUG 2018/12/20 13:27:21 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [gluster --mode=script --timeout=600 volume create vol_d78c1b54ec7e
faf9686f19104843dd43 replica 3 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_ce4109beff40cc4d7cbc9eea2850b686/brick 10.70.47.18:/var/lib/heket
i/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_1f270c2e3066bb417f0c86c2106c04f1/brick 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_ca533
5b9b4988e41ffc00a91ff2cb89c/brick] on [pod:glusterfs-storage-v5tk9 c:glusterfs ns:glusterfs (from host:dhcp47-20.lab.eng.blr.redhat.com selector:glusterfs-node)]: Stdout [vo
lume create: vol_d78c1b54ec7efaf9686f19104843dd43: success: please start the volume to access data
]: Stderr []
[kubeexec] DEBUG 2018/12/20 13:27:22 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [gluster --mode=script --timeout=600 volume set vol_d78c1b54ec7efaf
9686f19104843dd43 server.tcp-user-timeout 42] on [pod:glusterfs-storage-v5tk9 c:glusterfs ns:glusterfs (from host:dhcp47-20.lab.eng.blr.redhat.com selector:glusterfs-node)]:
 Stdout [volume set: success
]: Stderr []
[negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85
[negroni] Completed 200 OK in 145.564µs
[negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85
[negroni] Completed 200 OK in 119.577µs
[negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85
[negroni] Completed 200 OK in 112.304µs
[negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85
[negroni] Completed 200 OK in 105.138µs
[kubeexec] DEBUG 2018/12/20 13:27:26 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [gluster --mode=script --timeout=600 volume start vol_d78c1b54ec7ef
af9686f19104843dd43] on [pod:glusterfs-storage-v5tk9 c:glusterfs ns:glusterfs (from host:dhcp47-20.lab.eng.blr.redhat.com selector:glusterfs-node)]: Stdout [volume start: vo
l_d78c1b54ec7efaf9686f19104843dd43: success
]: Stderr []
[heketi] INFO 2018/12/20 13:27:26 Create Volume succeeded
[asynchttp] INFO 2018/12/20 13:27:26 asynchttp.go:292: Completed job 0204b33b33566ba06f5255c5ae791f85 in 11.367713729s
[negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85
[negroni] Completed 303 See Other in 203.516µs
[negroni] Started GET /volumes/d78c1b54ec7efaf9686f19104843dd43
[negroni] Completed 200 OK in 3.772417ms
[negroni] Started GET /clusters/d094ae7b7c68d5e3b6289ea6bd057bf1
[negroni] Completed 200 OK in 298.465µs
[negroni] Started GET /nodes/063e6982e906f98a8068a061cf38a7f3
[negroni] Completed 200 OK in 1.666358ms
[negroni] Started GET /nodes/b0f088580a1a78323b1e2db89d680ba5
[negroni] Completed 200 OK in 1.562843ms
[negroni] Started GET /nodes/e77ca377cf1f407c4c2d446787ed0e53


PVC gets bound:
============================
[root@dhcp47-30 ~]# oc get pvc testelevnm
NAME         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
testelevnm   Bound     pvc-527dcea1-0459-11e9-a8ba-005056a5190f   11Gi       RWO            glusterfs-storage   36m

Hence, moving this bug to verified state.

pvc create requests have been done for a file volume.

Comment 34 Anjana KD 2019-01-07 07:09:48 UTC

Updated Doc text. Kindly review for Technical accuracy.

Comment 36 Anjana KD 2019-01-08 13:09:50 UTC

Updated the doc text.

Comment 39 errata-xmlrpc 2019-02-07 03:41:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0286

Note You need to log in before you can comment on or make changes to this bug.

akrishna
asriram
bkunal
bshetty
hchiramm
jmulligan
knarra
kramdoss
madam
mrajanna
nberry
ndevos
pprakash
rcyriac
rgeorge
rhs-bugs
rtalur
sankarshan
sarumuga
sselvan
storage-qa-internal
suprasad