Bug 1661517

Summary: heketi-cli volume list returns Error: Id not found
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: heketiAssignee: Michael Adam <madam>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: high    
Version: ocs-3.11CC: hchiramm, jmulligan, knarra, kramdoss, madam, rhs-bugs, rtalur, sankarshan, sselvan, storage-qa-internal, suprasad
Target Milestone: ---Keywords: Regression, ZStream
Target Release: OCS 3.11.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: heketi-8.0.0-7.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-07 03:41:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1644160    

Description RamaKasturi 2018-12-21 12:54:28 UTC
Description of problem:

I see that when running the commands heketi-cli volume list it does not return all the volumes in heketi instead it gives an error Error: Id not found

On further having a debugging session with michael and madhu looks like there are stale brick entries in heketidb due to which this error is seen.

I hit this issue when i was trying to verify the bug https://bugzilla.redhat.com/show_bug.cgi?id=1651041

heketi cli topology info returns the same error as well

Version-Release number of selected component (if applicable):
heketi-client-8.0.0-5.el7rhgs.x86_64

How reproducible:
Hit it once

Steps to Reproduce:
1) Created a hook script called /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh  and added the contents as below

#! /bin/bash

sleep 6m

2) Gave execute permissions to the script by running the command chmod 777 /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh
3) Now ran the command to send pvc request.
4) Ran this for like twice or thrice

Actual results:
When the command heketi-cli volume list was run i see that heketi-cli does not give the correct output instead it throws an error which reads "Error: Id not found"

[root@dhcp47-30 ~]# heketi-cli volume list
Id:0a1b153d7a147ca7f35a73430a6813cc    Cluster:d094ae7b7c68d5e3b6289ea6bd057bf1    Name:vol_0a1b153d7a147ca7f35a73430a6813cc
Id:1e1eb700153010aec0ffcc73b8913a06    Cluster:d094ae7b7c68d5e3b6289ea6bd057bf1    Name:vol_1e1eb700153010aec0ffcc73b8913a06
Error: Id not found

Following error is seen in heketi logs when the above command is executed:

[heketi] INFO 2018/12/21 12:34:52 Periodic health check status: node e77ca377cf1f407c4c2d446787ed0e53 up=true
[heketi] INFO 2018/12/21 12:34:52 Cleaned 0 nodes from health cache
[negroni] Started GET /volumes
[negroni] Completed 200 OK in 1.722351ms
[negroni] Started GET /volumes/0a1b153d7a147ca7f35a73430a6813cc
[negroni] Completed 200 OK in 1.548504ms
[negroni] Started GET /volumes/1e1eb700153010aec0ffcc73b8913a06
[negroni] Completed 200 OK in 1.555116ms
[negroni] Started GET /volumes/426c53a028247b2d8fb6d3d9fec50993
[negroni] Completed 500 Internal Server Error in 661.885µs



Expected results:
heketi-cli volume list should not throw any error and should get executed successfully.




Additional info:

Above script does not get executed simultaneously on three nodes and gets executed one after the other and during this process heketi waits for 10 mins and since it does not hear back from gluster-cli heketi throws an error Request time out and heketi tries to roll back it operations. During the same time glusterd tries to create the volume after all the sleeps are done in all the nodes and volume deletion fails with another transaction in progress. Now after some time heketi tries to delete the above volume and tries to recreate the same volume by allocating new bricks and it some how does not clear the old bricks from the database. Looked like it was a race between heketi delete and retry volume create operation from the intial anlaysis

Comment 2 RamaKasturi 2018-12-21 13:25:42 UTC
Heketi db dump, heketi logs and sosreports from all the nodes are placed in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1661517/

Comment 4 RamaKasturi 2018-12-23 07:26:59 UTC
Clearing need info on self as karthick has provided the qa_ack.

Comment 9 RamaKasturi 2019-01-04 04:54:15 UTC
I see that i do not see heketi-cli volume list returning any error "Id not found" with heketi-8.0.0-7.el7rhgs.x86_64 version of heketi.

Below are the tests i have executed:
======================================
1) Created a hook script called /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh  and added the contents as below

#! /bin/bash

sleep 6m

2) Gave execute permissions to the script by running the command chmod 777 /var/lib/glusterd/hooks/1/start/pre/S99kasturi.sh
3) Now ran the command to send pvc request.

With this patch i see that after when we let gluster cli run into timeout, heketi tries to delete and stop the volume and it fails with error Another transaction in progress, please try after some time.

It then tries to create a volume and fails with [heketi] ERROR 2019/01/03 17:09:28 heketi/apps/glusterfs/operations_volume.go:118:glusterfs.(*VolumeCreateOperation).Exec: Error executing create volume: command terminated with exit code 1


Heketi tries to retry the volume create operation it tries to again delete the volume and once volume is deleted, heketi tries to umount and remove the bricks from system and from /var/lib/heketi/fstab.

I see that once the complete clean up is done on the operation where a retry request for volume create is done, i see that heketi starts allocating new bricks to create the volume again.

Test 2:
================
Tried deleting the sleep from all nodes and i see that pvc gets created and gets bound

Test 3:
===================
Once gluster cli run into timeout i have deleted the pvc and the sleep from all gluster nodes and i see that volume gets created.

Test 4:
====================
Once gluster cli run into timeout deleted the pvc request and had the sleep on all gluster nodes and see that heketi tries to create volume but it fails to do so and it cleans up and starts retry of create volume until the sleep on all the gluster node is over and at last it deletes the volume since there is no pending pvc request.

After all the above tests ran the commands below and i do not see any error when these commands are executed:
==================================================================================================================
[root@dhcp47-25 ~]# heketi-cli volume list
Id:1e6420300203ef9300d5fbca5e6f47f7    Cluster:d5b26c1436476c3af318c1ebada7ed27    Name:heketidbstorage
Id:2744591be6e555b510562a230af6625a    Cluster:d5b26c1436476c3af318c1ebada7ed27    Name:vol_2744591be6e555b510562a230af6625a [block]
Id:b58233cbb29ba78b04f6cebb6f267b1a    Cluster:d5b26c1436476c3af318c1ebada7ed27    Name:vol_b58233cbb29ba78b04f6cebb6f267b1a

[root@dhcp47-25 ~]# heketi-cli topology info

Cluster Id: d5b26c1436476c3af318c1ebada7ed27

    File:  true
    Block: true

    Volumes:

	Name: heketidbstorage
	Size: 2
	Id: 1e6420300203ef9300d5fbca5e6f47f7
	Cluster Id: d5b26c1436476c3af318c1ebada7ed27
	Mount: 10.70.46.244:heketidbstorage
	Mount Options: backup-volfile-servers=10.70.46.67,10.70.47.169
	Durability Type: replicate
	Replica: 3
	Snapshot: Disabled

		Bricks:
			Id: 71461930cb0c36ff678701580fb512c1
			Path: /var/lib/heketi/mounts/vg_f4631b39065b54fdd7f3cf6d425a1e3d/brick_71461930cb0c36ff678701580fb512c1/brick
			Size (GiB): 2
			Node: db8c5ff29e4b08a51aa8491de7b05ab4
			Device: f4631b39065b54fdd7f3cf6d425a1e3d

			Id: 7f50b2bea89022b981c541fc0974f833
			Path: /var/lib/heketi/mounts/vg_43d106b3235378719b1851c5afede8e4/brick_7f50b2bea89022b981c541fc0974f833/brick
			Size (GiB): 2
			Node: 64aed92ad76e1647ad7f0322235cea24
			Device: 43d106b3235378719b1851c5afede8e4

			Id: b8f51cce95ab8bacfe6d21f349d562b3
			Path: /var/lib/heketi/mounts/vg_f0ae2af8a1612a7b9e46141a2c653c0a/brick_b8f51cce95ab8bacfe6d21f349d562b3/brick
			Size (GiB): 2
			Node: e378eebea9716132fdbcb46bd4a4ab9a
			Device: f0ae2af8a1612a7b9e46141a2c653c0a


	Name: vol_2744591be6e555b510562a230af6625a
	Size: 100
	Id: 2744591be6e555b510562a230af6625a
	Cluster Id: d5b26c1436476c3af318c1ebada7ed27
	Mount: 10.70.46.244:vol_2744591be6e555b510562a230af6625a
	Mount Options: backup-volfile-servers=10.70.46.67,10.70.47.169
	Durability Type: replicate
	Replica: 3
	Snapshot: Disabled

		Bricks:
			Id: 0e91d2e0ce1c251687bc20153f14ae53
			Path: /var/lib/heketi/mounts/vg_f4631b39065b54fdd7f3cf6d425a1e3d/brick_0e91d2e0ce1c251687bc20153f14ae53/brick
			Size (GiB): 100
			Node: db8c5ff29e4b08a51aa8491de7b05ab4
			Device: f4631b39065b54fdd7f3cf6d425a1e3d

			Id: 28ce2d8638e79731708968ff4e76f685
			Path: /var/lib/heketi/mounts/vg_f0ae2af8a1612a7b9e46141a2c653c0a/brick_28ce2d8638e79731708968ff4e76f685/brick
			Size (GiB): 100
			Node: e378eebea9716132fdbcb46bd4a4ab9a
			Device: f0ae2af8a1612a7b9e46141a2c653c0a

			Id: 9440fa68356aa559f7cea10c209b9e73
			Path: /var/lib/heketi/mounts/vg_5fe1c25d9a33904d9c2a31519ded3bed/brick_9440fa68356aa559f7cea10c209b9e73/brick
			Size (GiB): 100
			Node: 64aed92ad76e1647ad7f0322235cea24
			Device: 5fe1c25d9a33904d9c2a31519ded3bed


	Name: vol_b58233cbb29ba78b04f6cebb6f267b1a
	Size: 6
	Id: b58233cbb29ba78b04f6cebb6f267b1a
	Cluster Id: d5b26c1436476c3af318c1ebada7ed27
	Mount: 10.70.46.244:vol_b58233cbb29ba78b04f6cebb6f267b1a
	Mount Options: backup-volfile-servers=10.70.46.67,10.70.47.169
	Durability Type: replicate
	Replica: 3
	Snapshot: Enabled
	Snapshot Factor: 1.00

		Bricks:
			Id: 2c6bd84aca994f4d56a18892a8939792
			Path: /var/lib/heketi/mounts/vg_5fe1c25d9a33904d9c2a31519ded3bed/brick_2c6bd84aca994f4d56a18892a8939792/brick
			Size (GiB): 6
			Node: 64aed92ad76e1647ad7f0322235cea24
			Device: 5fe1c25d9a33904d9c2a31519ded3bed

			Id: d300c5bffaba00722668b306b9fc23ba
			Path: /var/lib/heketi/mounts/vg_d78bf209ab9dc627dd2b5e1a4b460563/brick_d300c5bffaba00722668b306b9fc23ba/brick
			Size (GiB): 6
			Node: e378eebea9716132fdbcb46bd4a4ab9a
			Device: d78bf209ab9dc627dd2b5e1a4b460563

			Id: f7a775394783a37f6b843d6ffaa2e5d6
			Path: /var/lib/heketi/mounts/vg_f4631b39065b54fdd7f3cf6d425a1e3d/brick_f7a775394783a37f6b843d6ffaa2e5d6/brick
			Size (GiB): 6
			Node: db8c5ff29e4b08a51aa8491de7b05ab4
			Device: f4631b39065b54fdd7f3cf6d425a1e3d


    Nodes:

	Node Id: 64aed92ad76e1647ad7f0322235cea24
	State: online
	Cluster Id: d5b26c1436476c3af318c1ebada7ed27
	Zone: 1
	Management Hostnames: dhcp46-244.lab.eng.blr.redhat.com
	Storage Hostnames: 10.70.46.244
	Devices:
		Id:43d106b3235378719b1851c5afede8e4   Name:/dev/sde            State:online    Size (GiB):2047    Used (GiB):2       Free (GiB):2045    
			Bricks:
				Id:7f50b2bea89022b981c541fc0974f833   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_43d106b3235378719b1851c5afede8e4/brick_7f50b2bea89022b981c541fc0974f833/brick
		Id:5fe1c25d9a33904d9c2a31519ded3bed   Name:/dev/sdd            State:online    Size (GiB):2047    Used (GiB):106     Free (GiB):1941    
			Bricks:
				Id:2c6bd84aca994f4d56a18892a8939792   Size (GiB):6       Path: /var/lib/heketi/mounts/vg_5fe1c25d9a33904d9c2a31519ded3bed/brick_2c6bd84aca994f4d56a18892a8939792/brick
				Id:9440fa68356aa559f7cea10c209b9e73   Size (GiB):100     Path: /var/lib/heketi/mounts/vg_5fe1c25d9a33904d9c2a31519ded3bed/brick_9440fa68356aa559f7cea10c209b9e73/brick

	Node Id: db8c5ff29e4b08a51aa8491de7b05ab4
	State: online
	Cluster Id: d5b26c1436476c3af318c1ebada7ed27
	Zone: 1
	Management Hostnames: dhcp46-67.lab.eng.blr.redhat.com
	Storage Hostnames: 10.70.46.67
	Devices:
		Id:83ce45dd4d4d0ea72fc8929870b59a52   Name:/dev/sde            State:online    Size (GiB):2047    Used (GiB):0       Free (GiB):2047    
			Bricks:
		Id:f4631b39065b54fdd7f3cf6d425a1e3d   Name:/dev/sdd            State:online    Size (GiB):2047    Used (GiB):108     Free (GiB):1939    
			Bricks:
				Id:0e91d2e0ce1c251687bc20153f14ae53   Size (GiB):100     Path: /var/lib/heketi/mounts/vg_f4631b39065b54fdd7f3cf6d425a1e3d/brick_0e91d2e0ce1c251687bc20153f14ae53/brick
				Id:71461930cb0c36ff678701580fb512c1   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_f4631b39065b54fdd7f3cf6d425a1e3d/brick_71461930cb0c36ff678701580fb512c1/brick
				Id:f7a775394783a37f6b843d6ffaa2e5d6   Size (GiB):6       Path: /var/lib/heketi/mounts/vg_f4631b39065b54fdd7f3cf6d425a1e3d/brick_f7a775394783a37f6b843d6ffaa2e5d6/brick

	Node Id: e378eebea9716132fdbcb46bd4a4ab9a
	State: online
	Cluster Id: d5b26c1436476c3af318c1ebada7ed27
	Zone: 1
	Management Hostnames: dhcp47-169.lab.eng.blr.redhat.com
	Storage Hostnames: 10.70.47.169
	Devices:
		Id:d78bf209ab9dc627dd2b5e1a4b460563   Name:/dev/sde            State:online    Size (GiB):2047    Used (GiB):6       Free (GiB):2041    
			Bricks:
				Id:d300c5bffaba00722668b306b9fc23ba   Size (GiB):6       Path: /var/lib/heketi/mounts/vg_d78bf209ab9dc627dd2b5e1a4b460563/brick_d300c5bffaba00722668b306b9fc23ba/brick
		Id:f0ae2af8a1612a7b9e46141a2c653c0a   Name:/dev/sdd            State:online    Size (GiB):2047    Used (GiB):102     Free (GiB):1945    
			Bricks:
				Id:28ce2d8638e79731708968ff4e76f685   Size (GiB):100     Path: /var/lib/heketi/mounts/vg_f0ae2af8a1612a7b9e46141a2c653c0a/brick_28ce2d8638e79731708968ff4e76f685/brick
				Id:b8f51cce95ab8bacfe6d21f349d562b3   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_f0ae2af8a1612a7b9e46141a2c653c0a/brick_b8f51cce95ab8bacfe6d21f349d562b3/brick


Moving this to verified.

Comment 10 RamaKasturi 2019-01-04 04:55:08 UTC
@vignesh, Can you please update the bug with your results too?

Comment 11 Sri Vignesh Selvan 2019-01-04 06:23:14 UTC
While executing attached testcase manually, I was hitting issue mentioned in comment #5. Tried reproducing with latest heketi build i don't see heketi-cli topology info returning error "Id not found" with heketi-8.0.0-7.el7rhgs.x86_64 version.

Comment 13 errata-xmlrpc 2019-02-07 03:41:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0286