Description of problem: Background of the problem is captured at https://bugzilla.redhat.com/show_bug.cgi?id=1649651#c16 and this has resulted into an upstream issue in heketi : https://github.com/heketi/heketi/issues/1426 In certain cases when there's a back pressure on glusterd to process many incoming requests, having heketi to assume a command has failed on a default time out of 120 seconds isn't reasonable as that can lead to inconsistency between heketi and glusterd. Version-Release number of selected component (if applicable): OCS 3.9/3.10/3.11 How reproducible: Always Steps to Reproduce: 1. Have an workload with frequent PV claim/delete requests, make sure the cluster has ~1000 volumes with cluster.max-bricks-per-process set to 0. 2. Kill a brick 3. glusterd processes the disconnect event and makes the thread busy (due to BZ 1649651) Actual results: PVCs go into pending state Expected results: PVCs shouldn't go into pending state. Additional info:
Verified and works fine.Below are the steps followed to verify the bug. rpm -qa | grep heketi heketi-client-8.0.0-5.el7rhgs.x86_64 [root@dhcp47-30 ~]# oc rsh glusterfs-storage-l6kcb sh-4.2# rpm -qa | grep glusterfs glusterfs-server-3.12.2-32.el7rhgs.x86_64 glusterfs-api-3.12.2-32.el7rhgs.x86_64 glusterfs-cli-3.12.2-32.el7rhgs.x86_64 glusterfs-fuse-3.12.2-32.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64 glusterfs-libs-3.12.2-32.el7rhgs.x86_64 glusterfs-3.12.2-32.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64 sh-4.2# rpm -qa | grep heketi heketi-8.0.0-5.el7rhgs.x86_64 heketi-client-8.0.0-5.el7rhgs.x86_64 [root@dhcp47-30 ~]# oc version oc v3.11.51 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp47-30.lab.eng.blr.redhat.com:8443 openshift v3.11.51 kubernetes v1.11.0+d4cacc0 [root@dhcp47-30 ~]# rpm -qa | grep openshift-ansible openshift-ansible-playbooks-3.11.51-2.git.0.51c90a3.el7.noarch openshift-ansible-docs-3.11.51-2.git.0.51c90a3.el7.noarch openshift-ansible-roles-3.11.51-2.git.0.51c90a3.el7.noarch openshift-ansible-3.11.51-2.git.0.51c90a3.el7.noarch Steps followed to verify the bug: ===================================== First test (Happy Path): ================================ 1) Created a hook script called /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh and added the contents as below #! /bin/bash sleep 6m 2) Gave execute permissions to the script by running the command chmod 777 /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh 3) Now ran the command to send pvc request. 4) Once after the sleep started to execute, went a head and delete the script(/var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh ) from all the nodes. 5) I see that volume has started after 6 minutes and pvc got bound successfully. The above tests proves that things went fine when have put less than 10 minutes of sleep into the script. Below is what is seen in cmd_history.log: =============================================== [2018-12-20 11:28:25.101574] : volume create vol_0a1b153d7a147ca7f35a73430a6813cc replica 3 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_25a06bd6ec93abc6afda09878a4923b5/brick 10.70.47.18:/var/lib/heketi/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_33a82abe5766f00400f105d7e21b68ae/brick 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_96419185b8f90d3dce9ec0008865e0c9/brick : SUCCESS [2018-12-20 11:28:25.723433] : volume set vol_0a1b153d7a147ca7f35a73430a6813cc server.tcp-user-timeout 42 : SUCCESS [2018-12-20 11:34:28.432923] : volume start vol_0a1b153d7a147ca7f35a73430a6813cc : SUCCESS Pvc got bound successfully after 6mintues: =========================================== testfinal2 Bound pvc-58e9b0a1-044a-11e9-a8ba-005056a5190f 11Gi RWO glusterfs-storage 7m [root@dhcp47-30 ~]# heketi-cli volume list Id:0a1b153d7a147ca7f35a73430a6813cc Cluster:d094ae7b7c68d5e3b6289ea6bd057bf1 Name:vol_0a1b153d7a147ca7f35a73430a6813cc Second test (Negative Path) : ============================= 1) Created a hook script called /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh and added the contents as below #! /bin/bash sleep 11m 2) Gave execute permissions to the script by running the command chmod 777 /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh 3) Now ran the command to send pvc request. 4) Once after the sleep started to execute, went a head and delete the script(/var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh ) from all the nodes. 5) Now heketi fails but with gluster cli failing due to timeout. Below are the logs from cmd_history.log: =========================================== 2018-12-20 13:15:36.415818] : volume create vol_426c53a028247b2d8fb6d3d9fec50993 replica 3 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_61aa821dacdcd93a0d0f7ee21da71528/brick 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_756deb6d3603b98eb6081ba75154b761/brick 10.70.47.18:/var/lib/heketi/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_e5bfc4265768107546c2860fd4c0acf2/brick : SUCCESS [2018-12-20 13:15:37.146529] : volume set vol_426c53a028247b2d8fb6d3d9fec50993 server.tcp-user-timeout 42 : SUCCESS [2018-12-20 13:26:39.547713] : volume stop vol_426c53a028247b2d8fb6d3d9fec50993 force : FAILED : Another transaction is in progress for vol_426c53a028247b2d8fb6d3d9fec50993. Please try again after sometime. [2018-12-20 13:26:39.842920] : volume delete vol_426c53a028247b2d8fb6d3d9fec50993 : FAILED : Another transaction is in progress for vol_426c53a028247b2d8fb6d3d9fec50993. Please try again after sometime. [2018-12-20 13:26:40.879972] : volume start vol_426c53a028247b2d8fb6d3d9fec50993 : SUCCESS [2018-12-20 13:26:44.704826] : volume stop vol_426c53a028247b2d8fb6d3d9fec50993 force : SUCCESS [2018-12-20 13:26:44.936905] : volume delete vol_426c53a028247b2d8fb6d3d9fec50993 : SUCCESS logs from heketi: ================================= [kubeexec] ERROR 2018/12/20 13:25:37 heketi/pkg/remoteexec/kube/exec.go:85:kube.ExecCommands: Failed to run command [gluster --mode=script --timeout=600 volume start vol_426 c53a028247b2d8fb6d3d9fec50993] on [pod:glusterfs-storage-wtbjp c:glusterfs ns:glusterfs (from host:dhcp46-100.lab.eng.blr.redhat.com selector:glusterfs-node)]: Err[command t erminated with exit code 1]: Stdout [Error : Request timed out ]: Stderr [] Since the hooks scripts are deleted i see that pvc gets bound successfully when another volume request is sent from heketi after that. ===================================================================================================================================== cmd_history.log ========================= [2018-12-20 13:27:21.970233] : volume create vol_d78c1b54ec7efaf9686f19104843dd43 replica 3 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_ce4109beff40cc4d7cbc9eea2850b686/brick 10.70.47.18:/var/lib/heketi/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_1f270c2e3066bb417f0c86c2106c04f1/brick 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_ca5335b9b4988e41ffc00a91ff2cb89c/brick : SUCCESS [2018-12-20 13:27:22.574492] : volume set vol_d78c1b54ec7efaf9686f19104843dd43 server.tcp-user-timeout 42 : SUCCESS [2018-12-20 13:27:26.118807] : volume start vol_d78c1b54ec7efaf9686f19104843dd43 : SUCCESS Heketi logs: ============================== [kubeexec] DEBUG 2018/12/20 13:27:21 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [gluster --mode=script --timeout=600 volume create vol_d78c1b54ec7e faf9686f19104843dd43 replica 3 10.70.47.20:/var/lib/heketi/mounts/vg_18eed6ccd3a036f7880bf0c7fe5323d6/brick_ce4109beff40cc4d7cbc9eea2850b686/brick 10.70.47.18:/var/lib/heket i/mounts/vg_2e334ee3073ebfc5848b7a156b9f2110/brick_1f270c2e3066bb417f0c86c2106c04f1/brick 10.70.46.100:/var/lib/heketi/mounts/vg_de28a78b08ef4501123e295e876f75e1/brick_ca533 5b9b4988e41ffc00a91ff2cb89c/brick] on [pod:glusterfs-storage-v5tk9 c:glusterfs ns:glusterfs (from host:dhcp47-20.lab.eng.blr.redhat.com selector:glusterfs-node)]: Stdout [vo lume create: vol_d78c1b54ec7efaf9686f19104843dd43: success: please start the volume to access data ]: Stderr [] [kubeexec] DEBUG 2018/12/20 13:27:22 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [gluster --mode=script --timeout=600 volume set vol_d78c1b54ec7efaf 9686f19104843dd43 server.tcp-user-timeout 42] on [pod:glusterfs-storage-v5tk9 c:glusterfs ns:glusterfs (from host:dhcp47-20.lab.eng.blr.redhat.com selector:glusterfs-node)]: Stdout [volume set: success ]: Stderr [] [negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85 [negroni] Completed 200 OK in 145.564µs [negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85 [negroni] Completed 200 OK in 119.577µs [negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85 [negroni] Completed 200 OK in 112.304µs [negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85 [negroni] Completed 200 OK in 105.138µs [kubeexec] DEBUG 2018/12/20 13:27:26 heketi/pkg/remoteexec/kube/exec.go:81:kube.ExecCommands: Ran command [gluster --mode=script --timeout=600 volume start vol_d78c1b54ec7ef af9686f19104843dd43] on [pod:glusterfs-storage-v5tk9 c:glusterfs ns:glusterfs (from host:dhcp47-20.lab.eng.blr.redhat.com selector:glusterfs-node)]: Stdout [volume start: vo l_d78c1b54ec7efaf9686f19104843dd43: success ]: Stderr [] [heketi] INFO 2018/12/20 13:27:26 Create Volume succeeded [asynchttp] INFO 2018/12/20 13:27:26 asynchttp.go:292: Completed job 0204b33b33566ba06f5255c5ae791f85 in 11.367713729s [negroni] Started GET /queue/0204b33b33566ba06f5255c5ae791f85 [negroni] Completed 303 See Other in 203.516µs [negroni] Started GET /volumes/d78c1b54ec7efaf9686f19104843dd43 [negroni] Completed 200 OK in 3.772417ms [negroni] Started GET /clusters/d094ae7b7c68d5e3b6289ea6bd057bf1 [negroni] Completed 200 OK in 298.465µs [negroni] Started GET /nodes/063e6982e906f98a8068a061cf38a7f3 [negroni] Completed 200 OK in 1.666358ms [negroni] Started GET /nodes/b0f088580a1a78323b1e2db89d680ba5 [negroni] Completed 200 OK in 1.562843ms [negroni] Started GET /nodes/e77ca377cf1f407c4c2d446787ed0e53 PVC gets bound: ============================ [root@dhcp47-30 ~]# oc get pvc testelevnm NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE testelevnm Bound pvc-527dcea1-0459-11e9-a8ba-005056a5190f 11Gi RWO glusterfs-storage 36m Hence, moving this bug to verified state. pvc create requests have been done for a file volume.
Updated Doc text. Kindly review for Technical accuracy.
Updated the doc text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0286