Bug 1277924

Summary:	Though files are in split-brain able to perform writes to the file
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RajeshReddy <rmekala>
Component:	replicate	Assignee:	hari gowtham <hgowtham>
Status:	CLOSED ERRATA	QA Contact:	Vijay Avuthu <vavuthu>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	amukherj, atumball, hgowtham, ksubrahm, mchangir, nbalacha, nchilaka, pkarampu, ravishankar, rhinduja, rhs-bugs, rkavunga, sanandpa, sheggodu
Target Milestone:	---	Keywords:	Reopened, ZStream
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-12	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1294051 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:26:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1578823, 1579673, 1579674, 1580344
Bug Blocks:	1294051, 1315140, 1503134

Description RajeshReddy 2015-11-04 11:18:45 UTC

Description of problem:
==================
Though files are in split-brain able to perform writes to the file

Version-Release number of selected component (if applicable):
=================
glusterfs-server-3.7.5-5


How reproducible:


Steps to Reproduce:
=============
1. Create 1x2 volume and mount it on client using fuse and disable self-heal-daemon
2. From the mount create few files and do continuous IO's to file and while IO is going on perfrom pkill gluster on node1 
3. After some time bring back the node1 and perfrom pkill on node2 (while IO is going on to the same files)
4. Now files are in split-brian and gluster vol heal info is shwoing the same 
5. From the mount do append to the file using echo "data" >> file (Don't give full file path )

Actual results:


Expected results:
===============
IO should fail and if i disable performance.write-behind getting IO error while writing to the file

Additional info:
==============
[root@rhs-client18 data]# gluster vol info afr1x2 
 
Volume Name: afr1x2
Type: Replicate
Volume ID: 8bdcc83a-f7a5-4440-a0be-13f26ab72ae8
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick1/afr1x2
Brick2: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick1/afr1x2
Options Reconfigured:
performance.write-behind: on
cluster.data-self-heal: off
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
features.scrub: Active
features.bitrot: on
cluster.self-heal-daemon: off
performance.readdir-ahead: on

Comment 4 Mike McCune 2016-03-28 23:18:47 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Vijay Avuthu 2018-04-04 08:55:00 UTC

Update:
===========

Verified with Build: glusterfs-3.12.2-6.el7rhgs.x86_64

1) Create 1 * 2 volume and start
2) set cluster.self-heal-daemon to off
3) create files from mount point
4) continuously append to files from different sessions
5) After few minutes, kill gluster on Node 1
6) After few min, start glusterd on NOde 1 and immediatly kill gluster on Node2
7) After few min, start glusterd on Node 2
8) IO's will fail and files will be in split-brain
9) Try to append on a file which was in split-brain and it should fail

# echo "LAST APPENDING" >>f1
-bash: echo: write error: No such file or directory
#

Changing status to Verified

Comment 32 Mohammed Rafi KC 2018-05-02 13:10:39 UTC

Set up details

git head : 9710b5edaf152142f548e04304f2ea3c1a290fe9 and changed dht_write to wait on an infinite loop


2*2 hot tier and 2*2 cold tier
two clients with nfs mount point
turned off all performance xlators
tier-mode is "test".

I created one file on the mount point, and started a write operation on the file from two mount point. After it completes the lookups and other fops, it will wait on dht_write, because of the infinite loop on dht_writev, 


Once the file completely migrated to the cold tier, I then allowed one say C1 client to write (by comint out of the infinite loop). Since the file is moved from the cached subvol, the write will fail, which in turn triggers dht_migration_complete_check_task. As part of this task afr subvolumes in hot tier will mark afr readables as zero.

Before it updates tier cached subvolume, I allowed the second write from client say C2 to go through. It hits in the afr subvolume of hot tier and returns EIO because of zero readables.

I put breakpoint on "afr_inode_refresh" to see who is calling to refresh the inode and also on  afr_inode_read_subvol_set what si the value of readable array when we set it.


Back traces are available from the link : https://pastebin.com/Vz3V260Q

Comment 37 Pranith Kumar K 2018-05-10 23:00:41 UTC

(In reply to Mohammed Rafi KC from comment #32)
> Set up details
> 
> git head : 9710b5edaf152142f548e04304f2ea3c1a290fe9 and changed dht_write to
> wait on an infinite loop
> 
> 
> 2*2 hot tier and 2*2 cold tier
> two clients with nfs mount point
> turned off all performance xlators
> tier-mode is "test".
> 
> I created one file on the mount point, and started a write operation on the
> file from two mount point. After it completes the lookups and other fops, it
> will wait on dht_write, because of the infinite loop on dht_writev, 
> 
> 
> Once the file completely migrated to the cold tier, I then allowed one say
> C1 client to write (by comint out of the infinite loop). Since the file is
> moved from the cached subvol, the write will fail, which in turn triggers
> dht_migration_complete_check_task. As part of this task afr subvolumes in
> hot tier will mark afr readables as zero.
> 
> Before it updates tier cached subvolume, I allowed the second write from
> client say C2 to go through. It hits in the afr subvolume of hot tier and
> returns EIO because of zero readables.

What is the behavior you are expecting in this scenario? I'll try and see if that is semantically correct in AFR or not.

> 
> I put breakpoint on "afr_inode_refresh" to see who is calling to refresh the
> inode and also on  afr_inode_read_subvol_set what si the value of readable
> array when we set it.
> 
> 
> Back traces are available from the link : https://pastebin.com/Vz3V260Q

Comment 38 Ravishankar N 2018-05-15 16:44:24 UTC

(In reply to Pranith Kumar K from comment #37)
> (In reply to Mohammed Rafi KC from comment #32)
> > Set up details
> > 
> > git head : 9710b5edaf152142f548e04304f2ea3c1a290fe9 and changed dht_write to
> > wait on an infinite loop
> > 
> > 
> > 2*2 hot tier and 2*2 cold tier
> > two clients with nfs mount point
> > turned off all performance xlators
> > tier-mode is "test".
> > 
> > I created one file on the mount point, and started a write operation on the
> > file from two mount point. After it completes the lookups and other fops, it
> > will wait on dht_write, because of the infinite loop on dht_writev, 
> > 
> > 
> > Once the file completely migrated to the cold tier, I then allowed one say
> > C1 client to write (by comint out of the infinite loop). Since the file is
> > moved from the cached subvol, the write will fail, which in turn triggers
> > dht_migration_complete_check_task. As part of this task afr subvolumes in
> > hot tier will mark afr readables as zero.
> > 
> > Before it updates tier cached subvolume, I allowed the second write from
> > client say C2 to go through. It hits in the afr subvolume of hot tier and
> > returns EIO because of zero readables.
> 
> What is the behavior you are expecting in this scenario? I'll try and see if
> that is semantically correct in AFR or not.

I looked the inode refresh logic in afr. If the lookup that happens as a part of inode refresh fails on AFR's children, it marks that child non-readable.In addition, if it fails on all its children, it does unwind the actual read/write FOP with the errno of the lookup failure (and with op_ret=-1). It does not unconditionally return EIO.


> 
> > 
> > I put breakpoint on "afr_inode_refresh" to see who is calling to refresh the
> > inode and also on  afr_inode_read_subvol_set what si the value of readable
> > array when we set it.
> > 
> > 
> > Back traces are available from the link : https://pastebin.com/Vz3V260Q

In the bt, I see that the lower xlator to AFR has failed lookup with ENOTCONN and EIO:

 0x00007f6bc8726dbc in afr_inode_refresh_subvol_with_lookup_cbk (frame=0x7f6bb4043e1c, cookie=0x0, this=0x7f6bc401cbb0, op_ret=-1, op_errno=107, inode=0x7f6bb400decc, 
  0x00007f6bc872712d in afr_inode_refresh_subvol_with_fstat_cbk (frame=0x7f6bb001a45c, cookie=0x1, this=0x7f6bc401cbb0, op_ret=-1, op_errno=2, buf=0x7f6bc939b9c0,

Comment 39 Ravishankar N 2018-05-15 16:45:37 UTC

> In the bt, I see that the lower xlator to AFR has failed lookup with
> ENOTCONN and EIO:

Typo, I meant to say ENOTCONN and ENOENT.

Comment 41 hari gowtham 2018-05-25 07:31:39 UTC

This bug ( https://bugzilla.redhat.com/show_bug.cgi?id=1326248 ) needs to be verified as the fix was reverted.

If the bug still exists we might have to open a new issue and track it (either mark it as a known regression in tier or check if this patch https://review.gluster.org/#/c/20029/1 fixes it )

Comment 42 Vijay Avuthu 2018-07-23 13:40:15 UTC

Update:
============

Build Used: glusterfs-3.12.2-14.el7rhgs.x86_64

How reproducible: 5/2

Tried Below scenario

1) Create 1 * 2 volume and start
2) set cluster.self-heal-daemon to off
3) create files from mount point ( f1-f6 )
4) continuously append to files from different sessions
5) After few minutes, kill gluster on Node 1

# gluster vol heal 12 info
Brick 10.70.47.45:/bricks/brick2/b0
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.47.144:/bricks/brick2/b1
/f1 
/f3 
/f4 
/f2 
/f6 
/f5 
Status: Connected
Number of entries: 6

# 

6) After few min, start glusterd on NOde 1 and immediately kill gluster on Node2

# gluster vol heal 12 info
Brick 10.70.47.45:/bricks/brick2/b0
/f5 
/f6 
/f1 
/f2 
/f3 
Status: Connected
Number of entries: 5

Brick 10.70.47.144:/bricks/brick2/b1
Status: Transport endpoint is not connected
Number of entries: -

# 

I believe at this point of time it should show 6 entries. below are the attributes for the file which are missing in heal info

from Node 1:

[root@dhcp47-45 ~]# getfattr -d -m . -e hex /bricks/brick2/b0/f4
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick2/b0/f4
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xad9ebcfad1f34698a27a7025c73e1fbb
trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634

[root@dhcp47-45 ~]# 

from Node 2:

[root@dhcp47-144 ~]# getfattr -d -m . -e hex /bricks/brick2/b1/f4
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick2/b1/f4
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.12-client-0=0x000007ec0000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xad9ebcfad1f34698a27a7025c73e1fbb
trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634

[root@dhcp47-144 ~]# 

7) After few min, start glusterd on Node 2
8) IO's will fail and files will be in split-brain
9) Try to append on a file which was in split-brain and it should fail

[root@dhcp35-125 ~]# echo "TEST" >>/mnt/12/f1
-bash: echo: write error: Input/output error
[root@dhcp35-125 ~]# 


> ls is hang on mount point ( on file f4 )

[root@dhcp35-125 12]# ls
ls: cannot access f2: Input/output error
ls: cannot access f3: Input/output error

> Ran Health Report Tool ( reported 2 errors due to report tool issue )

> SOS Reports : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/bug_1277924_hang/

Comment 43 Ravishankar N 2018-07-24 04:28:31 UTC

> ls is hang on mount point ( on file f4 )

Karthik, could you take a look at the hang issue on Vijay's set up please? https://code.engineering.redhat.com/gerrit/144275 hasn't made it to a build yet and could be the reason for hang. If it is not, we need to see if there are some other code paths leading to the hang (unlikely).

Comment 44 Karthik U S 2018-07-24 11:20:43 UTC

I checked the sos-reports and the statedumps of bricks and clients. It does not seem like it is the same issue as of https://code.engineering.redhat.com/gerrit/144275, and it should not be the reason because lookup will not take lock.

When I checked the client statedumps it seems like it is hung in the write-behind lookup path.

[global.callpool.stack.11]
stack=0x7fa99d4f1370
uid=0
gid=0
pid=10602
unique=1187862
lk-owner=0000000000000000
op=LOOKUP
type=1
cnt=6

[global.callpool.stack.11.frame.1]
frame=0x7fa99d4e1a50
ref_count=0
translator=12-write-behind
complete=0
parent=12-io-cache
wind_from=ioc_lookup
wind_to=FIRST_CHILD(this)->fops->lookup
unwind_to=ioc_lookup_cbk

[global.callpool.stack.11.frame.2]
frame=0x7fa99d4d5420
ref_count=1
translator=12-io-cache
complete=0
parent=12-quick-read
wind_from=qr_lookup
wind_to=(this->children->xlator)->fops->lookup
unwind_to=qr_lookup_cbk

[global.callpool.stack.11.frame.3]
frame=0x7fa99d4d4a30
ref_count=1
translator=12-quick-read
complete=0
parent=12-md-cache
wind_from=mdc_lookup
wind_to=FIRST_CHILD(this)->fops->lookup
unwind_to=mdc_lookup_cbk

[global.callpool.stack.11.frame.4]
frame=0x7fa99d4dfb00
ref_count=1
translator=12-md-cache
complete=0
parent=12
wind_from=io_stats_lookup
wind_to=(this->children->xlator)->fops->lookup
unwind_to=io_stats_lookup_cbk

[global.callpool.stack.11.frame.5]
frame=0x7fa99d4dd400
ref_count=1
translator=12
complete=0
parent=fuse
wind_from=fuse_lookup_resume
wind_to=FIRST_CHILD(this)->fops->lookup
unwind_to=fuse_lookup_cbk

[global.callpool.stack.11.frame.6]
frame=0x7fa99d4f0180
ref_count=1
translator=fuse
complete=0


Vijay is not able to reproduce this issue on the same setup now.
@Vijay, since it is not hit always and this is a separate issue which has nothing to do with the fix and the fix is working as expected, can we move this to verified and open a new bug for the hang, if it is reproducible again?

Comment 45 Vijay Avuthu 2018-07-25 11:19:10 UTC

Update:
==========

> tried same scenario ( comment #42 )  several times and not able to reproduce the hang issue.

> Original issue was files in split-brain shouldn't append

Scenario:

Verified with Build: glusterfs-3.12.2-14.el7rhgs.x86_64

1) Create 1 * 2 volume and start
2) set cluster.self-heal-daemon to off
3) create files from mount point
4) continuously append to files from different sessions
5) After few minutes, kill gluster on Node 1
6) After few min, start glusterd on NOde 1 and immediately kill gluster on Node2
7) After few min, start glusterd on Node 2
8) IO's will fail and files will be in split-brain
9) Try to append on a file which was in split-brain and it should fail

# echo "LAST APPENDING" >>f1
-bash: f1: Input/output error
#

Changing status to Verified

Comment 47 errata-xmlrpc 2018-09-04 06:26:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607