Bug 1560955

Summary: After performing remove-brick followed by add-brick operation, brick went offline state
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Bala Konda Reddy M <bmekala>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED ERRATA QA Contact: Bala Konda Reddy M <bmekala>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.3CC: rhinduja, rhs-bugs, rmadaka, storage-qa-internal, vbellur
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1560957 (view as bug list) Environment:
Last Closed: 2018-09-04 06:45:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1560957    
Bug Blocks: 1503137    

Description Bala Konda Reddy M 2018-03-27 11:03:56 UTC
Description of problem:
##############################################################################
On a three node cluster, Enable brick-mux and create a replica X3 volume. Stop glusterd on node 3 and perform replace-brick on node 1. Replace brick succeeds, now start the glusterd on the node 3. Now Perform add-brick (3 bricks) to the volume. Add-brick succeeds, the brick on the node went offline.   

Version-Release number of selected component (if applicable):
3.8.4-54.3

How reproducible:
2/2

Steps to Reproduce:
1. Create a replica 3 volume and mount it. start io
2. Stop glusterd on one node(N3)
3. Perform replace brick operation on node (N1)
4. Start glusterd on node where it was stopped(N3)
5. Add 3 bricks to the volume, perform this operation on Node (N1)
6. One brick on node(N2) is offline 

Actual results:
Brick on node (N2) is offline

Expected results:
All bricks should be online in the volume

Additional info:

Comment 2 Atin Mukherjee 2018-03-27 11:11:35 UTC
RCA:

glusterd maintains a boolean flag 'port_registered' which is used to determine if a brick has completed its portmap sign in process. This flag is (re)set in pmap_sigin and pmap_signout events. In case of brick multiplexing this flag is the identifier to determine if the very first brick with which the process is spawned up has completed its sign in process. However in case of glusterd restart when a brick is already identified as running, glusterd does a pmap_registry_bind to ensure its portmap table is updated but this flag isn't which is fine in case of non brick multiplex case but causes an issue given the subsequent brick attach can depend on this flag. With replace-brick operation, I think this is more visible as the brick to be replaced is first attached and then the old brick is brought down, so there's eventually no provision for a pmap_signin here as in brick multiplexing only for the very first brick the pmap_signin happens.

Comment 5 Rajesh Madaka 2018-03-29 10:13:26 UTC
Facing similar kind of issue with remove-brick operation while node down.

Not reproducible every time.

steps to reproduce:
------------------
-> Create 3 node cluster n1, n2, n3.
-> Create 2x3 distributed_replicate volume Vol1.
-> Add one more replica set to same volume Vol1 using add-brick command.
-> Shutdown the node n2
-> Then perform remove-brick operation on node n1(remove-brick op fails as expected)
-> Shutdown the node n3
-> Then perform remove-brick operation on node n1(remove-brick op fails )
-> Power on both the nodes n2 and n3
-> Then perform  'gluster vol status Vol1'
-> Some of the bricks will go to offline, it will be random.

Comment 6 Atin Mukherjee 2018-03-31 05:52:29 UTC
None of the above steps have 100% reproducers. Instead with the following steps this can be easily reproducible:

1. Create and start a volume (with more than one brick)
2. remove the first brick
3. add one more brick , this operation will take a very significant long time (because of this bug)
4. check volume status, all bricks barring the newly added one will report a N/A status.

Comment 7 Atin Mukherjee 2018-03-31 11:32:19 UTC
upstream patch : https://review.gluster.org/19800

Comment 8 Atin Mukherjee 2018-04-08 13:02:45 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/134827

Comment 10 Bala Konda Reddy M 2018-04-24 10:18:07 UTC
Build : 3.12.2-8

Performed the steps mentioned in comment 5 and comment 6.
All the bricks are online. after performing add-brick operation. 

Hence marking it as verified.

Comment 12 errata-xmlrpc 2018-09-04 06:45:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607