Bug 1609243

Summary:	Able to mount dir which is in gfid split-brain and even files creation are successful in split brain dir
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijay Avuthu <vavuthu>
Component:	fuse	Assignee:	Karthik U S <ksubrahm>
Status:	CLOSED WONTFIX	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	medium
Version:	rhgs-3.4	CC:	rhs-bugs, storage-qa-internal, vavuthu, vdas
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-09 11:33:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2018-07-27 11:11:45 UTC

Description of problem:

Able to mount subdir which is in gfid split-brain

Version-Release number of selected component (if applicable):

Build Used: glusterfs-3.12.2-14.el7rhgs.x86_64

How reproducible: Always

Steps to Reproduce:

1) create 1 * 2 volume and start 
2) create gfid directory split-brain ( let say dir1 )
3) mount the subdir which was in split-brain ( dir1 )

Actual results:

mount is successful and able to write data into split-brain dir

Expected results:

Shouldn't mount the dir which is in split-brain

Additional info:

> Before mounting, below are the attributes

N1:

# getfattr -d -m . -e hex /bricks/brick4/b0/dir1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick4/b0/dir1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sb12-client-1=0x000000000000000400000001
trusted.gfid=0xaa7ed3bba30c43b19e6aa3f184fb28f3
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

# 

N2:

# getfattr -d -m . -e hex /bricks/brick4/b1/dir1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick4/b1/dir1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sb12-client-0=0x000000000000000400000001
trusted.gfid=0x568f03de21754c8a812ca07277d18b0a
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
# 

# gluster vol heal sb12 info
Brick 10.70.47.45:/bricks/brick4/b0
/dir1 
/ - Is in split-brain

Status: Connected
Number of entries: 2

Brick 10.70.47.144:/bricks/brick4/b1
<gfid:568f03de-2175-4c8a-812c-a07277d18b0a> 
/ - Is in split-brain

Status: Connected
Number of entries: 2
#

> subdir mount :

# mount -t glusterfs 10.70.47.45:/sb12/dir1 /mnt/subdir_sb12
[root@dhcp35-125 ~]# df -h | grep subdir_sb12
10.70.47.45:sb12/dir1   40G  441M   40G   2% /mnt/subdir_sb12
# cd /mnt/subdir_sb12/
# touch f{1..3}
# ls
f1  f2  f3
# 

From the logs, it looks like its performs healing and cleared the attributes for the directory when we issue mount on subdir

[2018-07-27 10:37:12.079828] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-sb12-replicate-0: performing metadata selfheal on 00000000-0000-0000-0000-000000000001
[2018-07-27 10:37:12.080325] W [MSGID: 108027] [afr-common.c:2841:afr_discover_done] 0-sb12-replicate-0: no read subvols for /
[2018-07-27 10:37:12.096556] I [MSGID: 108026] [afr-self-heal-common.c:1724:afr_log_selfheal] 0-sb12-replicate-0: Completed metadata selfheal on 00000000-0000-0000-0000-000000000001. sources=[0]  sinks=1 
[2018-07-27 10:37:12.104257] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-sb12-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2018-07-27 10:37:12.121502] I [MSGID: 108026] [afr-self-heal-common.c:1724:afr_log_selfheal] 0-sb12-replicate-0: Completed entry selfheal on 00000000-0000-0000-0000-000000000001. sources= sinks=0 1 

> 

# getfattr -d -m . -e hex /bricks/brick4/b0/dir1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick4/b0/dir1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sb12-client-1=0x000000000000000000000000
trusted.gfid=0xaa7ed3bba30c43b19e6aa3f184fb28f3
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

# 

# getfattr -d -m . -e hex /bricks/brick4/b1/dir1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick4/b1/dir1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sb12-client-0=0x000000000000000000000000
trusted.gfid=0x568f03de21754c8a812ca07277d18b0a
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

#

> ran gluster-health-report tool and there is one error due to script issue.

SOS Reports: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/fuse_sub_dir_split_brain/

Comment 2 Vivek Das 2018-08-12 16:24:20 UTC

Moving this bug out of 3.4.0 as this doesn't meet blocker criteria. This can be re-proposed for 3.4.0 if required.

Comment 4 Amar Tumballi 2018-11-19 09:28:29 UTC

Can this be tried with replica 3 volume? A replica 2 volume type is not supported officially.

Comment 5 Anees Patel 2018-11-28 05:35:37 UTC

Hi I am able to recreate the issue in 1X3 volume,

1. Create 1X3 volume.

2. Change quorum options to create a split-brain directory in replica 3 set.

3. Split-brain was create successfully
# gluster v heal vol_36411cb4f9b145b165212aeaf0ca2588 info
Brick 10.70.47.7:/var/lib/heketi/mounts/vg_1a2cebdd439fca0eb9d5197d6a6ca504/brick_25836af43f1427d9fe24c06feebbb1c7/brick
/ - Is in split-brain

Status: Connected
Number of entries: 1

Brick 10.70.47.108:/var/lib/heketi/mounts/vg_690b7b8be089c66b07c1259811ef6dbc/brick_04c52fc4af911b40730665ef9203304a/brick
/ - Is in split-brain

Status: Connected
Number of entries: 1

Brick 10.70.46.206:/var/lib/heketi/mounts/vg_bb4c74213d62f197a5baed1abad3df73/brick_d2469a52e6c4f94556140cec1de5582e/brick
/ - Is in split-brain

Status: Connected
Number of entries: 1

4. Change volume quorum option back to auto

4. Mount the directory to a client
# mount -t glusterfs 10.70.47.108:vol_36411cb4f9b145b165212aeaf0ca2588/dir1 /mnt/split/

Mount was successful.

5. Write files from mount point
# touch f{1..10}
# ls
f1  f10  f2  f3  f4  f5  f6  f7  f8  f9
# echo "Hi" >>f1
lit]# cat f1
Hi

Expected Result:
Shouldn't mount the dir which is in split-brain

Comment 7 Anees Patel 2018-11-28 07:33:00 UTC

Gluster fuse-subdir mount is a new feature (GA in 3.4), and the issue was discovered in 3.4 build.
Issue is reproducible in latest build
# rpm -qa | grep gluster
python2-gluster-3.12.2-29.el7rhgs.x86_64
glusterfs-debuginfo-3.12.2-29.el7rhgs.x86_64
glusterfs-libs-3.12.2-29.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-29.el7rhgs.x86_64
glusterfs-3.12.2-29.el7rhgs.x86_64

Comment 8 Amar Tumballi 2019-03-13 14:02:53 UTC

Was trying to understand what to do in this case.

Technically, there is no way for a client to know that there is a GFID mismatch in this scenario, mainly because client sees the GFID of this directory as 0x01 (ie, root). Server would know the GFID, but it won't have visibility into what is the 'correct' GFID for the directory, as it won't have the cluster view.


I would like to understand from PM (or any others) about what should be the right behavior in this scenario ?

If the right step is failing the mount by stating the file is not in correct state, then we need to make sure we change design etc. 

The quick and dirty fix is handling this scenario in mount.glusterfs script by checking the directory in bigger volume as temp mount first, before continuing the actual mount with subdir. While this would be good enough to handle 99% of the cases, if one uses `glusterfs` command directly to mount subdirectory, then this would still be an issue.

For now, I will wait for some discussion on this here, and then we can take decision on this.