Fedora Account System
Red Hat Associate
Red Hat Customer
Description of problem: New CNS 3.6 deployment on OpenShift with 8.5TB of SSD across 3 DL380 servers. Tried to provision and then delete 100 PVCs as a test to see how it worked. Took about 8 minutes to create all 100 PVCs. Attempting to delete them, resulted in many timeouts and failures from Heketi, leaving an inconsistent state of PVCs, PVs and volumes on the underlying gluster nodes. [kubeexec] ERROR 2017/11/28 23:57:53 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume start vol_bf0b00ea4aa6cd521cd8e4a3d7e595ea] on glusterfs-5913l: Err[command terminated with exit code 1]: Stdout [Error : Request timed out ]: Stderr [] [kubeexec] ERROR 2017/11/28 23:59:55 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume create vol_30376745b87d535239df2b9f6d02875c replica 3 192.168.111.162:/var/lib/heketi/mounts/vg_8c85c4358d1e65cae0de9d1993c0e128/brick_4cd75155be0d3fffdd8161a3aa5c4ef8/brick 192.168.111.161:/var/lib/heketi/mounts/vg_a7d773c52baf36bad3c11a1bc042ba04/brick_6815ce072d4d81a06dabe62c24536456/brick 192.168.111.160:/var/lib/heketi/mounts/vg_fb6689c3ac59eae0c0f04553964f7d2b/brick_bee25721d66af16ce37fceecac110ea7/brick] on glusterfs-5913l: Err[command terminated with exit code 1]: Stdout [Error : Request timed out ]: Stderr [] There are 83 unable to delete volume ERRORs in the heketi log: [kubeexec] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume delete vol_efe32f4754b399b152e62c2615e8ecce] on glusterfs-ck5bv: Err[command terminated with exit code 1]: Stdout [Error : Request timed out ]: Stderr [] [sshexec] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/executors/sshexec/volume.go:141: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv: [heketi] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:483: Unable to delete volume: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv: [heketi] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:280: Failed to delete volume efe32f4754b399b152e62c2615e8ecce: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv: [kubeexec] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume delete vol_1ef511b38ba18ee92c2e7ac2beb49e58] on glusterfs-19k5w: Err[command terminated with exit code 1]: Stdout [Error : Request timed out ]: Stderr [] [sshexec] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/executors/sshexec/volume.go:141: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w: [heketi] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:483: Unable to delete volume: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w: [heketi] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:280: Failed to delete volume 1ef511b38ba18ee92c2e7ac2beb49e58: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w: Version-Release number of selected component (if applicable): glusterfs-3.8.4-18.4.el7.x86_64 Heket: rhgs3/rhgs-volmanager-rhel7:3.3.0-362 How reproducible: Created 100 PVC Tried to Delete 100 PVCs, failured. Can not longer provision anything. Steps to Reproduce: 1. Created 100 PVC, waiting for the volumes and PVs to bind 2. Tried to deleted those 100 PVS and wait for the PVs and volumes to be cleaned up 3. Actual results: There are a lot of timeouts and failures that left PVs and Volumes behind while the PVCs were removed. Can no longer create new PVCs, just sit in a Pending state. Expected results: Creates and Deletes should work and if they do fail, they should be a mechanism to retry or keep the cluster in a consistent state. Additional info:
This was the main topic of the stability fixes that went into CNS 3.9 and OCS 3.10. This would not happen any more with the latest release.