Bug 802417

Summary:

[glusterfs-3.3.0qa27]: gfid change in the backend leading to EIO

Product:

[Community] GlusterFS

Reporter:

Raghavendra Bhat <rabhat>

Component:

replicate

Assignee:

Jeff Darcy <jdarcy>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

mainline

CC:

enakai, gluster-bugs, jdarcy, joe, shwetha.h.panduranga, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

890598 (view as bug list)

Environment:

Last Closed:

2013-07-24 17:27:16 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

847645, 890598

Attachments:

Description	Flags
shell script for problem recreation	none

Description Raghavendra Bhat 2012-03-12 13:56:10 UTC

Description of problem:
3 replica pure replicate volume. 3 fuse clients and 3 nfs clients, each executing different tools such as ping_pong, fs-perf-test, rdd etc in a loop.
One if the nfemote volume '/export/mirror'.
[2012-03-12 00:29:44.778256] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-mirror-replicate-0: added root inode
[2012-03-12 00:30:18.823386] W [afr-common.c:1392:afr_conflicting_iattrs] 0-mirror-replicate-0: /glusterfs.git/conftest: gfid differs on subvo
lume 2
[2012-03-12 00:30:18.823441] W [afr-common.c:1392:afr_conflicting_iattrs] 0-mirror-replicate-0: /glusterfs.git/conftest: gfid differs on subvo
lume 2
[2012-03-12 00:30:18.823458] W [afr-common.c:1169:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: /glusterfs.git/conftest: gfid different 
on subvolume
[2012-03-12 00:30:18.823478] I [afr-common.c:1313:afr_launch_self_heal] 0-mirror-replicate-0: background  missing-entry self-heal triggered. p
ath: /glusterfs.git/conftest, reason: lookup detected pending operations
[2012-03-12 00:30:18.824992] W [afr-common.c:1392:afr_conflicting_iattrs] 0-mirror-replicate-0: /glusterfs.git/conftest: gfid differs on subvo
lume 1
[2012-03-12 00:30:18.825428] I [afr-self-heal-common.c:917:afr_sh_missing_entries_done] 0-mirror-replicate-0: split brain found, aborting self
heal of /glusterfs.git/conftest
[2012-03-12 00:30:18.825489] E [afr-self-heal-common.c:2034:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background  missing-entry self
-heal failed on /glusterfs.git/conftest
[2012-03-12 00:30:18.825539] E [nfs3-helpers.c:3618:nfs3_fh_resolve_entry_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /glusterfs.git/conftest: Inp
ut/output error
[2012-03-12 00:30:18.825613] E [nfs3.c:3407:nfs3_remove_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.16.156.18:772) mirror : 8e3506f3-2a4e-4
0f2-b59d-f7a428256f7c
[2012-03-12 00:30:18.825650] W [nfs3-helpers.c:3392:nfs3_log_common_res] 0-nfs-nfsv3: XID: 61f0f4fb, REMOVE: NFS: 5(I/O error), POSIX: 14(Bad 
address)
[2012-03-12 00:31:10.404695] W [nfs3.c:1196:nfs3svc_lookup_cbk] 0-nfs: 44f2f4fb: /in => -1 (Input/output error)
s client was executing, git clone glusterfs, autogen, configure, make and rm -rf glusterfs in a loop. It received EIO after sometime.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:

gfid of a file is different in the backend
Expected results:
gfid if a file should be same in all the backends.

Additional info:

getfattr -d -m . -e hex /export/mirror/glusterfs.git/conftest
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/glusterfs.git/conftest
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x854ca9e231764414b6979619e5ef51a8


 getfattr -d -m . -e hex /export/miror/glusterfs.git/conftest
getfattr: Removing leading '/' from absolute path names
# file: export/miror/glusterfs.git/conftest
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x0edf1d9e2867400d933702d840a03b78


getfattr -d -m . -e hex /export/mirror/glusterfs.git/conftest
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/glusterfs.git/conftest
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x0edf1d9e2867400d933702d840a03b78

Comment 1 Shwetha Panduranga 2012-05-16 12:35:17 UTC

TestCase for re-creating the bug:-
---------------------------------

1. create a replicate volume(1x3. brick0, brick1, brick2)
2. create a fuse mount. create a file with name "file1" from mount point. 
3. brick down brick1 and brick2.
4. delete "file1" . create a new file with name "file1" from mount point(GFID in the back end is changed for the file "file1")
5. bring up brick1 and brick2
6. stat "file1" from mount point 

Actual Result:-
---------------
stat: cannot stat `file1': Input/output error

Comment 2 Vijay Bellur 2012-05-18 12:43:25 UTC

Will fix this post 3.3.0 as the nature of changes needed is significant.

Comment 3 Pranith Kumar K 2012-10-05 06:12:42 UTC

*** Bug 863223 has been marked as a duplicate of this bug. ***

Comment 4 Jeff Darcy 2012-10-05 13:36:21 UTC

If we can estimate difficulty there must already be a fix in mind.  What is it?  If it's supposed to be outcast, then maybe we should add a dependency on bug847671.  I can see how outcast on the directory might prevent the second file from being created (though the current outcast patches don't seem to have gotten that far).  Otherwise, some variant of "tombstones" might offer a solution that's easier to implement.

Comment 5 Etsuji Nakai 2012-12-25 07:32:56 UTC

Hi,

I recreated the problem with RHS2.0 and did some investigation on how the problem happens.

I used the attached shell script to recreate the problem and check pending matrix status changes. And I observed that the problem is caused probably because the pending matrix of the parent directly is wrongly cleared. Details is at the bottom of this note.

Hence I suspected this is caused by the same bug with https://bugzilla.redhat.com/show_bug.cgi?id=863223 and checked the patch http://review.gluster.org/#change,4034 could resolve it.

At least, in my setup, the patch resolved the problem :)


I'm using the latest RHS2.0 packages.

glusterfs-server-3.3.0.5rhs-37.el6rhs.x86_64
glusterfs-fuse-3.3.0.5rhs-37.el6rhs.x86_64
glusterfs-3.3.0.5rhs-37.el6rhs.x86_64


Here's the detailed observation of the problem.

1.
Initially, the parent directory on rhs20-01 has the pending matrix indicating the delete/create operations of file1.
------
On rhs20-01(05:31:07.569459571)
# file: data/brick01
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000002
trusted.afr.vol01-client-2=0x000000000000000000000002
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xbf334f8da251432597c5431ce07a2ce8
------

At the same time, the pending matrix of file1 on rhs20-01 indicates the pending data updates.
------
On rhs20-01(05:31:07.569459571)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000090000000000000000
trusted.afr.vol01-client-2=0x000000090000000000000000
trusted.gfid=0x39ec153775a14042939b8423df52b001
------

On the other hands, xttr of file1 on rhs20-02/rhs20-03 still has the old gfid.
------
On rhs20-02(05:31:07.569459571)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000000
trusted.afr.vol01-client-2=0x000000000000000000000000
trusted.gfid=0x9a9248c259a246be94ff897fda3e1078  <--- old gfid

On rhs20-03(05:31:07.569459571)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000000
trusted.afr.vol01-client-2=0x000000000000000000000000
trusted.gfid=0x9a9248c259a246be94ff897fda3e1078  <--- old gfid
------

2.
Then on rhs20-02, file1 with old gifd is replaced with new gifd. Probably, file1 would be empty.
------
rhs20-01(05:31:08.481237212)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x0000000c0000000100000000
trusted.afr.vol01-client-2=0x0000000b0000000000000000
trusted.gfid=0x39ec153775a14042939b8423df52b001

rhs20-02(05:31:08.481237212)
# file: data/brick01/file1
trusted.gfid=0x39ec153775a14042939b8423df52b001  <--- new gfid

rhs20-03(05:31:08.481237212)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000000
trusted.afr.vol01-client-2=0x000000000000000000000000
trusted.gfid=0x9a9248c259a246be94ff897fda3e1078
------

However, a weird thing happens at this moment. The pending matrix of parent directory on rhs20-01 is wrongly cleared both for rhs20-02 and rhs20-03 though it should have been cleared only for rhs20-02.
------
rhs20-01(05:31:08.481237212)
# file: data/brick01
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000000
trusted.afr.vol01-client-2=0x000000000000000000000000 <--- pending entry is wrongly cleared.
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xbf334f8da251432597c5431ce07a2ce8
------

3. Because of this, the old gfid on rhs20-03 will never be renewed. And finally split brain is detected for the conflicting gfid as in the following mnt.log.

----
[2012-12-25 05:31:18.165497] W [afr-common.c:1419:afr_conflicting_iattrs] 0-vol01-replicate-0: /file1: gfid differs on subvolume 2
...
[2012-12-25 05:31:18.166068] D [afr-self-heal-common.c:994:afr_sh_missing_entries_done] 0-vol01-replicate-0: split brain found, aborting selfheal of /file1
[2012-12-25 05:31:18.166079] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-vol01-replicate-0: background  meta-data data missing-entry self-heal failed on /file1
[2012-12-25 05:31:18.166094] W [fuse-bridge.c:292:fuse_entry_cbk] 0-glusterfs-fuse: 1088: LOOKUP() /file1 => -1 (Input/output error)
----


4. After applying the patch http://review.gluster.org/#change,4034 _both for client and servers_, the pending matrix of parent directly was correctly modified as below.

------
rhs20-01(07:20:14.236322807)
# file: data/brick01
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000000 <--- Only rhs20-02's entry was cleared.
trusted.afr.vol01-client-2=0x000000000000000000000002
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0x1ee927b1e5f74fdab1678d4bd692b751

rhs20-01(07:20:14.236322807)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x0000000c0000000100000000
trusted.afr.vol01-client-2=0x0000000b0000000000000000
trusted.gfid=0x7b714930e0224dbdb703d847b2dcb9a1

rhs20-02(07:20:14.236322807)
# file: data/brick01/file1
trusted.gfid=0x7b714930e0224dbdb703d847b2dcb9a1  <--- gfid of rhs20-02 was recreated.

rhs20-03(07:20:14.236322807)
# file: data/brick01/file1
trusted.afr.vol01-client-0=0x000000000000000000000000
trusted.afr.vol01-client-1=0x000000000000000000000000
trusted.afr.vol01-client-2=0x000000000000000000000000
trusted.gfid=0x63f9306e313a480eadaebd3681b4103b
------

Comment 6 Etsuji Nakai 2012-12-25 07:33:51 UTC

Created attachment 668761 [details]
shell script for problem recreation

Comment 7 Etsuji Nakai 2012-12-25 11:27:56 UTC

One thing to add. This is caused depending on the timing. In many cases, old gfid is renewd on both children almost at the same time, and the logic bug (clearing all pending matrix entries) does no harm.

To confirm that the patch actulally resolve this issue, I've kept running the reproduction script for several hours, and I've never seen the split brain until now. (repeated more than 1700 trials.)

Comment 8 Pranith Kumar K 2013-01-02 17:25:06 UTC

Etsuji Nakai,
    Yes the fix posted does fix this particular problem but there is one improvement we need to make so that it handles other cases as well. Jeff is going to post the updated fix. Please follow http://review.gluster.org/4034 for the updates on the patch.

Pranith.

Comment 9 Jeff Darcy 2013-01-03 13:29:16 UTC

*** Bug 863223 has been marked as a duplicate of this bug. ***

Comment 10 Jeff Darcy 2013-01-07 16:47:10 UTC

Joe Julian reports (via IRC) that he also hit this problem today.

Comment 11 Joe Julian 2013-01-07 16:51:32 UTC

Reproduced during disaster recovery. Lost the network. Since I had no time to ensure everything was clean and wouldn't have had time to manage any split brains that might have occurred due to the network issue, I punted and just claimed one server to be good and put it online.

Later, when things slowed down a little, I added a second server back in and made sure there was no split-brain. Everything was good.

Today I added the third server back in and encountered this bug.

Comment 12 Vijay Bellur 2013-01-17 02:56:43 UTC

CHANGE: http://review.gluster.org/4034 (replicate: don't clear changelog for un-healed replicas) merged in master by Anand Avati (avati)

Comment 13 Vijay Bellur 2013-02-03 19:46:52 UTC

CHANGE: http://review.gluster.org/4435 (test: Removed "for" loop of check_xattr function.) merged in master by Anand Avati (avati)