1554467 – Volume create requests are satisfied by heketi but communication is lost to the requestor

Bug 1554467 - Volume create requests are satisfied by heketi but communication is lost to the requestor

Summary: Volume create requests are satisfied by heketi but communication is lost to t...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra Talur
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1559834 1572466 1584540 (view as bug list)
Depends On:
Blocks:	OCS-3.11.1-devel-triage-done
TreeView+	depends on / blocked

Reported:	2018-03-12 17:58 UTC by Rachael
Modified:	2019-08-13 19:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-13 19:05:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	heketi heketi pull 1560	0	None	open	go-client: disable KeepAlives	2020-01-29 12:29:08 UTC

Description Rachael 2018-03-12 17:58:24 UTC

Description of problem:
A script was run for creation of 30 PVCs. While the PVC creation was running the heketi pod was restarted. When the new heketi pod was up and running, the pending PVCs were successfully provisioned. The number of PVCs and PVs was found to be 32 (two PVCs were present before the script was run). However the number of volumes in the heketi-cli volume list and gluster v list output was 36 (including heketidb).

[root@dhcp46-214 ~]# oc get pv|grep pvc|wc -l
32

[root@dhcp46-214 ~]# oc get pvc|grep pvc|wc -l
32

[root@dhcp46-214 ~]# heketi-cli volume list|wc -l
36

[root@dhcp46-214 ~]# oc rsh glusterfs-storage-hwz5z
sh-4.2# gluster v list|wc -l
36

It was seen that volume for three PVCs (claim16, claim20, claim24) were created by both the old and new heketi pods. 

Version-Release number of selected component (if applicable):
rhgs-volmanager-rhel7:3.3.1-4
rhgs-server-rhel7:3.3.1-7


Steps to Reproduce:
1. Create 30 PVCs
2. While PVCs are being provisioned restart heketi pod
3. Get the number of PVs and number of volumes from heketi-cli output

Actual results:
There is a mismatch in the the number of volumes

Expected results:
The number of volumes should be equal(excluding heketidb)

Comment 6 Humble Chirammal 2018-03-13 05:44:13 UTC

In my initial analysis, this does not look like a regression, rather the same scenario we have seen since  start. 

In new heketi pod, we can see:

--snip--

[heketi] WARNING 2018/03/12 15:04:02 Ignoring stale pending operations.Server will be running with incomplete/inconsistent state in DB.
[heketi] INFO 2018/03/12 15:04:02 Block: Auto Create Block Hosting Volume set to true
[heketi] INFO 2018/03/12 15:04:02 Block: New Block Hosting Volume size 100 GB
[heketi] INFO 2018/03/12 15:04:02 GlusterFS Application Loaded
[heketi] ERROR 2018/03/12 15:04:02 /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was terminated while performing one or more operations. Server may refuse to start as long as pending operations are present in the db.

-/snip--

After restart if we do a rollback on these pending operations we could get rid of this issue which may be the case in future heketi releases or stopping heketi's start till we manually clear heketi's pending operation is another way. 

One other solution would be making volume create a blocking call, but I dont think thats a good method to solve the issue, but can think about it.

Comment 7 Raghavendra Talur 2018-03-13 05:52:21 UTC

Rachael, also provide the volume count from gluster side.



Ideally, we want these things:
1. list of all PV
2. list of volumes from heketi
3. list of volumes from gluster
4. heketi db

Comment 8 Humble Chirammal 2018-03-13 05:56:29 UTC

(In reply to Humble Chirammal from comment #6)
> In my initial analysis, this does not look like a regression, rather the
> same scenario we have seen since  start. 
> 
> In new heketi pod, we can see:
> 
> --snip--
> 
> [heketi] WARNING 2018/03/12 15:04:02 Ignoring stale pending
> operations.Server will be running with incomplete/inconsistent state in DB.
> [heketi] INFO 2018/03/12 15:04:02 Block: Auto Create Block Hosting Volume
> set to true
> [heketi] INFO 2018/03/12 15:04:02 Block: New Block Hosting Volume size 100 GB
> [heketi] INFO 2018/03/12 15:04:02 GlusterFS Application Loaded
> [heketi] ERROR 2018/03/12 15:04:02
> /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was
> terminated while performing one or more operations. Server may refuse to
> start as long as pending operations are present in the db.
> 
> -/snip--
> 
> After restart if we do a rollback on these pending operations we could get
> rid of this issue which may be the case in future heketi releases or
> stopping heketi's start till we manually clear heketi's pending operation is
> another way. 

I could be wrong here, need to check the logs in detail.

Also iic we have one more volume that is block hosting volume.


so total, 30 PVCs+ 2 Existing volume + 1 heketiDBstorage + 1 blockhostingvolume = 34 volumes.

Or block PVC is part of this 30 PVC requests Rachel ?

Comment 10 Rachael 2018-03-13 06:46:27 UTC

(In reply to Humble Chirammal from comment #8)
> (In reply to Humble Chirammal from comment #6)
> > In my initial analysis, this does not look like a regression, rather the
> > same scenario we have seen since  start. 
> > 
> > In new heketi pod, we can see:
> > 
> > --snip--
> > 
> > [heketi] WARNING 2018/03/12 15:04:02 Ignoring stale pending
> > operations.Server will be running with incomplete/inconsistent state in DB.
> > [heketi] INFO 2018/03/12 15:04:02 Block: Auto Create Block Hosting Volume
> > set to true
> > [heketi] INFO 2018/03/12 15:04:02 Block: New Block Hosting Volume size 100 GB
> > [heketi] INFO 2018/03/12 15:04:02 GlusterFS Application Loaded
> > [heketi] ERROR 2018/03/12 15:04:02
> > /src/github.com/heketi/heketi/apps/glusterfs/app.go:156: Heketi was
> > terminated while performing one or more operations. Server may refuse to
> > start as long as pending operations are present in the db.
> > 
> > -/snip--
> > 
> > After restart if we do a rollback on these pending operations we could get
> > rid of this issue which may be the case in future heketi releases or
> > stopping heketi's start till we manually clear heketi's pending operation is
> > another way. 
> 
> I could be wrong here, need to check the logs in detail.
> 
> Also iic we have one more volume that is block hosting volume.
> 
> 
> so total, 30 PVCs+ 2 Existing volume + 1 heketiDBstorage + 1
> blockhostingvolume = 34 volumes.
> 
> Or block PVC is part of this 30 PVC requests Rachel ?

30 PVCs + 1 already existing PVC + 1 block hosting volume + 1 heketidbstorage = 33.

Comment 11 Raghavendra Talur 2018-03-13 13:36:51 UTC

Rachael we all the data mentioned below
1. list of all PV
2. list of volumes from heketi
3. list of volumes from gluster
4. heketi db

Comment 12 Raghavendra Talur 2018-03-13 13:50:23 UTC

I correct myself, the only thing missing is heketi.db

Comment 16 Rachael 2018-03-14 13:49:11 UTC

AFAIK, this test was already there but we didn't run it because we were aware of the issue. But with bug fixes that have gone in for this release,  we ran this test.
I was able to reproduce this issue 2/2 times. I will try to reproduce this again and attach the required logs.

Comment 17 Humble Chirammal 2018-03-15 10:55:25 UTC

To add some more details:

Its pretty clear that claim16, claim20, claim24 requests were processed 2 times. This is a known scenario because *any client* which keep trying volume creation receive 'error' ( mostly no route to host.. or something of that sort) in a timeslot where heketi is unavailable  ( between current heketi died and new one is spawn). Once the client receive the error there will be a retry which has been picked up by new heketi. 

However the volume creation which was in progress may be lucky enough to complete both in gluster and heketi side ( DB updated) as in your try, sometimes it will be in middle of operation. 

A permanent solution will be bit complex and really need to go through good amount of code changes or a really thoughtful design between the components.

A quick solution can be done from heketi side, if volume create request has 'Name' field set, for ex:

Claim16:

dept-qe_glusterfs_claim16_9f9a74ff-2606-11e8-b460-005056a55501

dept-qe_glusterfs_claim16_7137aaf3-2604-11e8-b460-005056a55501

If first 3 fields are same we can discard this request, but this solution is not 'scalable' in the sense that:

*) It will only help "Kubernetes" client
*) Its not applicable if "Name" field is NOT set.

At the same time, if we are only worried about kubernetes client, this is still a mitigation plan.

Comment 19 Raghavendra Talur 2018-09-21 07:50:21 UTC

*** Bug 1572466 has been marked as a duplicate of this bug. ***

Comment 21 Raghavendra Talur 2018-10-24 17:59:25 UTC

*** Bug 1584540 has been marked as a duplicate of this bug. ***

Comment 22 Raghavendra Talur 2019-01-23 20:28:50 UTC

*** Bug 1559834 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.