Bug 1639566

Summary: need clients who use the IP of cluster node(for mounting) being peer detached to be notified of peer detach to avoid inconsistency(like dht overlap and data unavailability)
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED DUPLICATE QA Contact: Bala Konda Reddy M <bmekala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: kiyer, rhs-bugs, sankarshan, storage-qa-internal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-25 02:54:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2018-10-16 06:20:55 UTC
Description of problem:
========================
if you mount a volume using the IP of a node, and later plan to detach the node from trusted pool, the fuse mount stops getting further notifications.This problem can lead to inconsistencies like dht layouts and files being written to only part of storage, as below

As the data plane is not changed, and hence IOs keep happening successfully.
However now if we do an operation leading to vol file regeneration, this is not updated to this fuse client,as glusterd is not running and all management communication to the client happening through that detached node doesnt happen anymore.
Now do a add- brick and rebalance
now create new files and you would see that while the IOs don't fail from clients view, they are not being written to the new dht subvol, as the client is not aware of this,instead all files are being written only to the old dht subvol
(due to layout overlap)
# file: bricks/brick1/rep3/dir3
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0xece13ba823c04c2397f8346265e4880d
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

# file: bricks/brick2/rep3/dir3
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0xece13ba823c04c2397f8346265e4880d
trusted.glusterfs.dht=0xde2bedc400000000000000007ffffffe




Below inconsistencies were seen:
1) in one case EIO was seen post rebalance
2) In all cases there was a dht overlap for already existing dirs
[root@dhcp46-93 glusterfs]# getfattr -d -m . -e hex /bricks/brick*/rep3/dirnew
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/rep3/dirnew
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0x89198c93223c4ac7a9a7c79d593693ab
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

# file: bricks/brick2/rep3/dirnew
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0x89198c93223c4ac7a9a7c79d593693ab
trusted.glusterfs.dht=0xde2bedc400000000000000007ffffffe

3) the above also means, that the data which was rebalanced and moved to new dht  subvol is not visible from the client any more( data unavailable)

4) in all cases, the storage of new replica pair is not being utitilized

5) may be(not sure),there can be a remote possibility of data loss, if the client is not notified of change in layouts

below may be what should be done to avoid this:
1)as part of a peer detach commit process, all the clients should be notified and those clients which are mounted using the ip of node being detached, must be disconnected. Now any further IOs from these clients will fail with say transport end point error , which is totally fine.

if the above is complex to implement in glusterd1.0 , then we must be doing below, to mitigate the problem
1) throw a warning when a peer is being detached, to inform that all clients connected to this node, must first be disconnected (will raise a new bug for this)
2) update peer detach section in documentation to inform that clients must be remounted with different IP, before detaching peer(will raise a new doc bug for this)

Version-Release number of selected component (if applicable):
===================
3.12.2-22

How reproducible:
---------------------
always


Steps to Reproduce:
1. have a 4 node setup{n1,n2,n3,n4)
2. create a 1x3 volume using bricks from  n1,n2,n3
3. fuse mount volume  using ip of n4
4. now write IOs
5. now do a detach peer of n4->will pass
6. now create a new dir, say dir1 and check backend bricks->would be created on all bricks
7. similarly create new files and check backend bricks->would be created on all bricks
8. Now do a add-brick(to make it 2x3, and again use only bricks from n1..n3) and do a rebalance
9. now create files say f{1..100} ->creates will pass from client's view without any errors
10. check backend bricks and see if all 100 files are existing -->you can see that the new dht subvol doesnt have any f* files. which were supposed to be created on them. Check the first dht subvol, and you will see all files of the 100 existing, due to dht overlap, but client not being aware of the mistake 
11. create a new dir say "new-dir". and check backend-->same problem, dir won't be created on new dht subvol