Bug 1379549

Summary:	Bad brick ports are not reusing for the new bricks after remove/replace of bad brick.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Byreddy <bsrirama>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED NOTABUG	QA Contact:	Byreddy <bsrirama>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, bsrirama, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-27 10:34:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Byreddy 2016-09-27 06:24:12 UTC

Description of problem:
=======================
Once volume bricks goes bad, the ports allocated for the bad bricks are not reused for the new bricks added after remove/replace of bad bricks.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-1


How reproducible:
=================
Always

Steps to Reproduce:
===================
1. Have 3 bricks Distribute volume using one node cluster
2. Note down the ports allocated for the bricks
3. Make one of the brick bad (kill it or using any tool)
4. Remove the bad brick now using force option in the remove-brick cli
5. Now add few new bricks using add-brick cli
6. Check bad brick port is allocated for any new bricks added.


Actual results:
===============
Bad brick ports are not reusing for the new bricks after remove/replace of bad brick.

Expected results:
=================
bad brick ports should be reused for the new bricks added after removal of bad bricks as per the new logic mentioned in the bug - https://bugzilla.redhat.com/show_bug.cgi?id=1263090


Additional info:
================
If bad brick becomes good, it will get the same port which was allocated earlier after volume restart(stop & start)/start force,

Comment 3 Atin Mukherjee 2016-09-27 07:34:52 UTC

It seems like that the brick process was not killed gracefully i.e. by kill -15 and hence pmap signout may have not been initiated and glusterd didn't clean up the same from its portmap table, if that's the case then its expected. Please confirm.

Comment 4 Byreddy 2016-09-27 07:45:29 UTC

(In reply to Atin Mukherjee from comment #3)
> It seems like that the brick process was not killed gracefully i.e. by kill
> -15 and hence pmap signout may have not been initiated and glusterd didn't
> clean up the same from its portmap table, if that's the case then its
> expected. Please confirm.

Thanks for you info, i will confirm your input.

Comment 5 Byreddy 2016-09-27 09:32:48 UTC

(In reply to Byreddy from comment #4)
> (In reply to Atin Mukherjee from comment #3)
> > It seems like that the brick process was not killed gracefully i.e. by kill
> > -15 and hence pmap signout may have not been initiated and glusterd didn't
> > clean up the same from its portmap table, if that's the case then its
> > expected. Please confirm.
> 
> Thanks for you info, i will confirm your input.

Using option -15 with kill command to create a bad brick i am not seeing the issue But when i crashed the brick underlying filesystem (xfs) to create a bad brick (production scenario), issue still persist.

Comment 6 Atin Mukherjee 2016-09-27 10:34:17 UTC

This is expected as in this case pmap signout is not initiated

Comment 7 Byreddy 2016-09-27 12:42:02 UTC

(In reply to Atin Mukherjee from comment #6)
> This is expected as in this case pmap signout is not initiated

Before accepting it is NOT A BUG, Please want to know few things.

If pmap signout is initiating, then how bad brick port will be used for new brick when brick underlying filesystem is crashed? this is production scenario, chances of occurring is there.

Having bad brick port entry in the pmap even after removal, is this correct? it should unmap the port entry when we remove a brick is what i am thinking.

Comment 8 Atin Mukherjee 2016-09-27 15:13:22 UTC

(In reply to Byreddy from comment #7)
> (In reply to Atin Mukherjee from comment #6)
> > This is expected as in this case pmap signout is not initiated
> 
> Before accepting it is NOT A BUG, Please want to know few things.
> 
> If pmap signout is initiating, then how bad brick port will be used for new
> brick when brick underlying filesystem is crashed? this is production
> scenario, chances of occurring is there.
> 
> Having bad brick port entry in the pmap even after removal, is this correct?
> it should unmap the port entry when we remove a brick is what i am thinking.

The only way GlusterD can clean up the stale port is through pmap signout event processing and this rpc event can *only* be initiated if a brick process is gracefully shutdown, if a brick process goes down abruptly (due to multiple factors) the event will not be notified to GlusterD, even though this can happen in a production setup but the occurances are rare and most importantly there is no functionality impact, its just that glusterd will not reuse the same port. Hope this clarifies your question.

Comment 9 Byreddy 2016-09-28 04:38:46 UTC

(In reply to Atin Mukherjee from comment #8)
> (In reply to Byreddy from comment #7)
> > (In reply to Atin Mukherjee from comment #6)
> > > This is expected as in this case pmap signout is not initiated
> > 
> > Before accepting it is NOT A BUG, Please want to know few things.
> > 
> > If pmap signout is initiating, then how bad brick port will be used for new
> > brick when brick underlying filesystem is crashed? this is production
> > scenario, chances of occurring is there.
> > 
> > Having bad brick port entry in the pmap even after removal, is this correct?
> > it should unmap the port entry when we remove a brick is what i am thinking.
> 
> The only way GlusterD can clean up the stale port is through pmap signout
> event processing and this rpc event can *only* be initiated if a brick
> process is gracefully shutdown, if a brick process goes down abruptly (due
> to multiple factors) the event will not be notified to GlusterD, even though
> this can happen in a production setup but the occurances are rare and most
> importantly there is no functionality impact, its just that glusterd will
> not reuse the same port. Hope this clarifies your question.

Thanks for your explanation.
Just to highlight the observed thing this bug is filed, these is no any functionality loss with this.