Bug 1519549

Summary: Heketi timeouts leading to inconsistent CNS state
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Matthew Robson <mrobson>
Component: heketiAssignee: Raghavendra Talur <rtalur>
Status: CLOSED CURRENTRELEASE QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.6CC: ekuric, hchiramm, jlee, kaushal, madam, mrobson, rhs-bugs, rtalur, storage-qa-internal, tcarlin, vinug
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-19 16:46:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1622458    

Description Matthew Robson 2017-11-30 21:22:49 UTC
Description of problem:

New CNS 3.6 deployment on OpenShift with 8.5TB of SSD across 3 DL380 servers.

Tried to provision and then delete 100 PVCs as a test to see how it worked.

Took about 8 minutes to create all 100 PVCs.

Attempting to delete them, resulted in many timeouts and failures from Heketi, leaving an inconsistent state of PVCs, PVs and volumes on the underlying gluster nodes.

[kubeexec] ERROR 2017/11/28 23:57:53 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume start vol_bf0b00ea4aa6cd521cd8e4a3d7e595ea] on glusterfs-5913l: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []

[kubeexec] ERROR 2017/11/28 23:59:55 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume create vol_30376745b87d535239df2b9f6d02875c replica 3 192.168.111.162:/var/lib/heketi/mounts/vg_8c85c4358d1e65cae0de9d1993c0e128/brick_4cd75155be0d3fffdd8161a3aa5c4ef8/brick 192.168.111.161:/var/lib/heketi/mounts/vg_a7d773c52baf36bad3c11a1bc042ba04/brick_6815ce072d4d81a06dabe62c24536456/brick 192.168.111.160:/var/lib/heketi/mounts/vg_fb6689c3ac59eae0c0f04553964f7d2b/brick_bee25721d66af16ce37fceecac110ea7/brick] on glusterfs-5913l: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []

There are 83 unable to delete volume ERRORs in the heketi log:

[kubeexec] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume delete vol_efe32f4754b399b152e62c2615e8ecce] on glusterfs-ck5bv: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []
[sshexec] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/executors/sshexec/volume.go:141: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv:
[heketi] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:483: Unable to delete volume: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv:
[heketi] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:280: Failed to delete volume efe32f4754b399b152e62c2615e8ecce: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv:
[kubeexec] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume delete vol_1ef511b38ba18ee92c2e7ac2beb49e58] on glusterfs-19k5w: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []
[sshexec] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/executors/sshexec/volume.go:141: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w:
[heketi] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:483: Unable to delete volume: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w:
[heketi] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:280: Failed to delete volume 1ef511b38ba18ee92c2e7ac2beb49e58: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w:


Version-Release number of selected component (if applicable):

glusterfs-3.8.4-18.4.el7.x86_64

Heket: rhgs3/rhgs-volmanager-rhel7:3.3.0-362

How reproducible:

Created 100 PVC

Tried to Delete 100 PVCs, failured.

Can not longer provision anything.

Steps to Reproduce:
1. Created 100 PVC, waiting for the volumes and PVs to bind
2. Tried to deleted those 100 PVS and wait for the PVs and volumes to be cleaned up
3. 

Actual results:

There are a lot of timeouts and failures that left PVs and Volumes behind while the PVCs were removed.

Can no longer create new PVCs, just sit in a Pending state.

Expected results:

Creates and Deletes should work and if they do fail, they should be a mechanism to retry or keep the cluster in a consistent state.

Additional info:

Comment 21 Michael Adam 2018-09-19 16:46:59 UTC
This was the main topic of the stability fixes that went into CNS 3.9 and OCS 3.10. This would not happen any more with the latest release.