1519549 – Heketi timeouts leading to inconsistent CNS state

Bug 1519549 - Heketi timeouts leading to inconsistent CNS state

Summary: Heketi timeouts leading to inconsistent CNS state

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.6
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra Talur
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1622458
TreeView+	depends on / blocked

Reported:	2017-11-30 21:22 UTC by Matthew Robson
Modified:	2021-03-11 16:28 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-19 16:46:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1521027	0	high	CLOSED	Volume mismatch between OpenShift PVs, Heketi, and gluster	2021-06-10 13:50:31 UTC

Internal Links: 1521027

Description Matthew Robson 2017-11-30 21:22:49 UTC

Description of problem:

New CNS 3.6 deployment on OpenShift with 8.5TB of SSD across 3 DL380 servers.

Tried to provision and then delete 100 PVCs as a test to see how it worked.

Took about 8 minutes to create all 100 PVCs.

Attempting to delete them, resulted in many timeouts and failures from Heketi, leaving an inconsistent state of PVCs, PVs and volumes on the underlying gluster nodes.

[kubeexec] ERROR 2017/11/28 23:57:53 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume start vol_bf0b00ea4aa6cd521cd8e4a3d7e595ea] on glusterfs-5913l: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []

[kubeexec] ERROR 2017/11/28 23:59:55 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume create vol_30376745b87d535239df2b9f6d02875c replica 3 192.168.111.162:/var/lib/heketi/mounts/vg_8c85c4358d1e65cae0de9d1993c0e128/brick_4cd75155be0d3fffdd8161a3aa5c4ef8/brick 192.168.111.161:/var/lib/heketi/mounts/vg_a7d773c52baf36bad3c11a1bc042ba04/brick_6815ce072d4d81a06dabe62c24536456/brick 192.168.111.160:/var/lib/heketi/mounts/vg_fb6689c3ac59eae0c0f04553964f7d2b/brick_bee25721d66af16ce37fceecac110ea7/brick] on glusterfs-5913l: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []

There are 83 unable to delete volume ERRORs in the heketi log:

[kubeexec] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume delete vol_efe32f4754b399b152e62c2615e8ecce] on glusterfs-ck5bv: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []
[sshexec] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/executors/sshexec/volume.go:141: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv:
[heketi] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:483: Unable to delete volume: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv:
[heketi] ERROR 2017/11/29 00:01:21 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:280: Failed to delete volume efe32f4754b399b152e62c2615e8ecce: Unable to delete volume vol_efe32f4754b399b152e62c2615e8ecce: Unable to execute command on glusterfs-ck5bv:
[kubeexec] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [gluster --mode=script volume delete vol_1ef511b38ba18ee92c2e7ac2beb49e58] on glusterfs-19k5w: Err[command terminated with exit code 1]: Stdout [Error : Request timed out
]: Stderr []
[sshexec] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/executors/sshexec/volume.go:141: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w:
[heketi] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:483: Unable to delete volume: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w:
[heketi] ERROR 2017/11/29 00:01:24 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:280: Failed to delete volume 1ef511b38ba18ee92c2e7ac2beb49e58: Unable to delete volume vol_1ef511b38ba18ee92c2e7ac2beb49e58: Unable to execute command on glusterfs-19k5w:


Version-Release number of selected component (if applicable):

glusterfs-3.8.4-18.4.el7.x86_64

Heket: rhgs3/rhgs-volmanager-rhel7:3.3.0-362

How reproducible:

Created 100 PVC

Tried to Delete 100 PVCs, failured.

Can not longer provision anything.

Steps to Reproduce:
1. Created 100 PVC, waiting for the volumes and PVs to bind
2. Tried to deleted those 100 PVS and wait for the PVs and volumes to be cleaned up
3. 

Actual results:

There are a lot of timeouts and failures that left PVs and Volumes behind while the PVCs were removed.

Can no longer create new PVCs, just sit in a Pending state.

Expected results:

Creates and Deletes should work and if they do fail, they should be a mechanism to retry or keep the cluster in a consistent state.

Additional info:

Comment 21 Michael Adam 2018-09-19 16:46:59 UTC

This was the main topic of the stability fixes that went into CNS 3.9 and OCS 3.10. This would not happen any more with the latest release.

Note You need to log in before you can comment on or make changes to this bug.