1601316 – Heketi volume clean up fails while creating and deleting pvs. Mismatch in heketi topology and heketi volume list

Bug 1601316 - Heketi volume clean up fails while creating and deleting pvs. Mismatch in heketi topology and heketi volume list

Summary: Heketi volume clean up fails while creating and deleting pvs. Mismatch in hek...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.10
Assignee:	Michael Adam
QA Contact:	vinutha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568862
TreeView+	depends on / blocked

Reported:	2018-07-16 05:04 UTC by vinutha
Modified:	2018-11-19 10:58 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-12 09:23:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:2686	0	None	None	None	2018-09-12 09:25:05 UTC

Description vinutha 2018-07-16 05:04:14 UTC

Description of problem:
In a 4 node setup: on creating and deleting 1GB pvs in a loop clean up of heketi volumes failed. 
Heketi volume and gluster volume list shows 6 volumes but heketi topology info shows entire 1TB space occupied.  

Version-Release number of selected component (if applicable):
# oc rsh heketi-storage-1-55bw4 
sh-4.2# rpm -qa| grep heketi 
python-heketi-6.0.0-7.4.el7rhgs.x86_64
heketi-client-6.0.0-7.4.el7rhgs.x86_64
heketi-6.0.0-7.4.el7rhgs.x86_64

# rpm -qa| grep openshift
openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch
openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch

sh-4.2# rpm -qa| grep gluster
glusterfs-client-xlators-3.8.4-54.8.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.8.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.8.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.8.el7rhgs.x86_64
gluster-block-0.2.1-14.1.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.8.el7rhgs.x86_64
glusterfs-3.8.4-54.8.el7rhgs.x86_64
glusterfs-api-3.8.4-54.8.el7rhgs.x86_64
glusterfs-server-3.8.4-54.8.el7rhgs.x86_64


How reproducible:
1:1

Steps to Reproduce:
1. On a CNS 3.9 setup initiated pvs creation and deletion simultaneously. Was also running the gluster v heal in all the 4 gluster pods. 

while true
do
    for i in {101..150}
    do
        ./pvc_create.sh c$i 1; sleep 10;
    done

    sleep 30

    for i in {101..150}
    do
        oc delete pvc c$i; sleep 5;
    done
done

2. After running it for sometime observed that the number of volumes in heketi is 6 , in gluster its 7 but heketi-cli topology info shows the entire 999GB used with 1GB bricks on all nodes. Some nodes have free space = 2GB. 

3. Faced the shd crash issue reported

----- snip of heketi log --------------------------

Result:
[kubeexec] ERROR 2018/07/13 21:26:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [gluster --mode=script volume create cns-vol_glusterfs_c115_60892937-86e3-11e8-aca0-005056a5a62b replica 3 10.70.46.29:/var/lib/heketi/mounts/vg_258cfffb4ca720953c6224286ce775a3/brick_5d0f13c27863545da3dc705aa5c1225b/brick 10.70.46.124:/var/lib/heketi/mounts/vg_7ad51cc79e80230702aebe4a1f67da7e/brick_1dff5c85227341dbaa83a5afcd8e8b4d/brick 10.70.46.210:/var/lib/heketi/mounts/vg_8b5aa693c98fbca5d4f666886869cdad/brick_afa7fd041a4a42ba1ff7534220576680/brick] on glusterfs-storage-pq7l9: Err[command terminated with exit code 1]: Stdout []: Stderr [volume create: cns-vol_glusterfs_c115_60892937-86e3-11e8-aca0-005056a5a62b: failed: Brick: 10.70.46.29:/var/lib/heketi/mounts/vg_258cfffb4ca720953c6224286ce775a3/brick_5d0f13c27863545da3dc705aa5c1225b/brick not available. Brick may be containing or be contained by an existing brick.
]
[kubeexec] DEBUG 2018/07/13 21:26:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: dhcp46-124.lab.eng.blr.redhat.com Pod: glusterfs-storage-kfhp6 Command: mount -o rw,inode64,noatime,nouuid /dev/mapper/vg_7ad51cc79e80230702aebe4a1f67da7e-brick_4e3173d5f96d48661c2b2bbaee089717 /var/lib/heketi/mounts/vg_7ad51cc79e80230702aebe4a1f67da7e/brick_4e3173d5f96d48661c2b2bbaee089717

-----------------------snip of heketi log --------------------------

---- snip of heketi topology ---------------- 
	Node Id: f57fd045067b4f11685b36c7c812fa2f
	State: online
	Cluster Id: 0b33bcaf4015ceace8d12202aac4883a
	Zone: 1
	Management Hostnames: dhcp47-70.lab.eng.blr.redhat.com
	Storage Hostnames: 10.70.47.70
	Devices:
		Id:a45588231a0f181cf9c5066c1b2d906e   Name:/dev/sdd            State:online    Size (GiB):999     Used (GiB):999     Free (GiB):0       
			Bricks:
				Id:002f0e6048291f5625ca4ec9966b0a96   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_002f0e6048291f5625ca4ec9966b0a96/brick
				Id:011681334874a3d431e7cf005cb0226d   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_011681334874a3d431e7cf005cb0226d/brick
				Id:011ecfa848996351e072df1c4201bbf7   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_011ecfa848996351e072df1c4201bbf7/brick
				Id:013635c4eb0542dda91fa2f6af7115e7   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_013635c4eb0542dda91fa2f6af7115e7/brick
				Id:017bfb8e83fc10c49436298c73221d42   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_017bfb8e83fc10c49436298c73221d42/brick
				Id:01a1b253b024592a2e719e5706c2efe4   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_01a1b253b024592a2e719e5706c2efe4/brick
				Id:01e46ab2d35cc0fec3b5ee36d4b1e6dd   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_01e46ab2d35cc0fec3b5ee36d4b1e6dd/brick
				Id:02918cf4ce08a503964aa008083ed9cc   Size (GiB):1       Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_02918cf4ce08a503964aa008083ed9cc/brick
--------------------snip of topology ----------------


Actual results:
The volume list and topology output mismatch 

Expected results:
The volume list and topology output should match 

Additional info:
Will attach logs

Comment 4 Raghavendra Talur 2018-07-16 13:09:20 UTC

Version-Release number of selected component (if applicable):
# oc rsh heketi-storage-1-55bw4 
sh-4.2# rpm -qa| grep heketi 
python-heketi-6.0.0-7.4.el7rhgs.x86_64
heketi-client-6.0.0-7.4.el7rhgs.x86_64
heketi-6.0.0-7.4.el7rhgs.x86_64


Why is CNS 3.9 build being used?

Comment 5 vinutha 2018-07-18 10:29:08 UTC

( In reply to Raghavendra Talur comment #4 )

This setup was created as part of Experian Hotfix testing. So the CNS 3.9 builds were used before upgrading the setup to the hotfix build.

Comment 23 errata-xmlrpc 2018-09-12 09:23:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686

Note You need to log in before you can comment on or make changes to this bug.