Bug 802417
Summary: | [glusterfs-3.3.0qa27]: gfid change in the backend leading to EIO | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Raghavendra Bhat <rabhat> | ||||
Component: | replicate | Assignee: | Jeff Darcy <jdarcy> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | mainline | CC: | enakai, gluster-bugs, jdarcy, joe, shwetha.h.panduranga, vbellur | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 890598 (view as bug list) | Environment: | |||||
Last Closed: | 2013-07-24 17:27:16 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 847645, 890598 | ||||||
Attachments: |
|
Description
Raghavendra Bhat
2012-03-12 13:56:10 UTC
TestCase for re-creating the bug:- --------------------------------- 1. create a replicate volume(1x3. brick0, brick1, brick2) 2. create a fuse mount. create a file with name "file1" from mount point. 3. brick down brick1 and brick2. 4. delete "file1" . create a new file with name "file1" from mount point(GFID in the back end is changed for the file "file1") 5. bring up brick1 and brick2 6. stat "file1" from mount point Actual Result:- --------------- stat: cannot stat `file1': Input/output error Will fix this post 3.3.0 as the nature of changes needed is significant. *** Bug 863223 has been marked as a duplicate of this bug. *** If we can estimate difficulty there must already be a fix in mind. What is it? If it's supposed to be outcast, then maybe we should add a dependency on bug847671. I can see how outcast on the directory might prevent the second file from being created (though the current outcast patches don't seem to have gotten that far). Otherwise, some variant of "tombstones" might offer a solution that's easier to implement. Hi, I recreated the problem with RHS2.0 and did some investigation on how the problem happens. I used the attached shell script to recreate the problem and check pending matrix status changes. And I observed that the problem is caused probably because the pending matrix of the parent directly is wrongly cleared. Details is at the bottom of this note. Hence I suspected this is caused by the same bug with https://bugzilla.redhat.com/show_bug.cgi?id=863223 and checked the patch http://review.gluster.org/#change,4034 could resolve it. At least, in my setup, the patch resolved the problem :) I'm using the latest RHS2.0 packages. glusterfs-server-3.3.0.5rhs-37.el6rhs.x86_64 glusterfs-fuse-3.3.0.5rhs-37.el6rhs.x86_64 glusterfs-3.3.0.5rhs-37.el6rhs.x86_64 Here's the detailed observation of the problem. 1. Initially, the parent directory on rhs20-01 has the pending matrix indicating the delete/create operations of file1. ------ On rhs20-01(05:31:07.569459571) # file: data/brick01 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000002 trusted.afr.vol01-client-2=0x000000000000000000000002 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.volume-id=0xbf334f8da251432597c5431ce07a2ce8 ------ At the same time, the pending matrix of file1 on rhs20-01 indicates the pending data updates. ------ On rhs20-01(05:31:07.569459571) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000090000000000000000 trusted.afr.vol01-client-2=0x000000090000000000000000 trusted.gfid=0x39ec153775a14042939b8423df52b001 ------ On the other hands, xttr of file1 on rhs20-02/rhs20-03 still has the old gfid. ------ On rhs20-02(05:31:07.569459571) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000000 trusted.afr.vol01-client-2=0x000000000000000000000000 trusted.gfid=0x9a9248c259a246be94ff897fda3e1078 <--- old gfid On rhs20-03(05:31:07.569459571) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000000 trusted.afr.vol01-client-2=0x000000000000000000000000 trusted.gfid=0x9a9248c259a246be94ff897fda3e1078 <--- old gfid ------ 2. Then on rhs20-02, file1 with old gifd is replaced with new gifd. Probably, file1 would be empty. ------ rhs20-01(05:31:08.481237212) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x0000000c0000000100000000 trusted.afr.vol01-client-2=0x0000000b0000000000000000 trusted.gfid=0x39ec153775a14042939b8423df52b001 rhs20-02(05:31:08.481237212) # file: data/brick01/file1 trusted.gfid=0x39ec153775a14042939b8423df52b001 <--- new gfid rhs20-03(05:31:08.481237212) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000000 trusted.afr.vol01-client-2=0x000000000000000000000000 trusted.gfid=0x9a9248c259a246be94ff897fda3e1078 ------ However, a weird thing happens at this moment. The pending matrix of parent directory on rhs20-01 is wrongly cleared both for rhs20-02 and rhs20-03 though it should have been cleared only for rhs20-02. ------ rhs20-01(05:31:08.481237212) # file: data/brick01 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000000 trusted.afr.vol01-client-2=0x000000000000000000000000 <--- pending entry is wrongly cleared. trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.volume-id=0xbf334f8da251432597c5431ce07a2ce8 ------ 3. Because of this, the old gfid on rhs20-03 will never be renewed. And finally split brain is detected for the conflicting gfid as in the following mnt.log. ---- [2012-12-25 05:31:18.165497] W [afr-common.c:1419:afr_conflicting_iattrs] 0-vol01-replicate-0: /file1: gfid differs on subvolume 2 ... [2012-12-25 05:31:18.166068] D [afr-self-heal-common.c:994:afr_sh_missing_entries_done] 0-vol01-replicate-0: split brain found, aborting selfheal of /file1 [2012-12-25 05:31:18.166079] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-vol01-replicate-0: background meta-data data missing-entry self-heal failed on /file1 [2012-12-25 05:31:18.166094] W [fuse-bridge.c:292:fuse_entry_cbk] 0-glusterfs-fuse: 1088: LOOKUP() /file1 => -1 (Input/output error) ---- 4. After applying the patch http://review.gluster.org/#change,4034 _both for client and servers_, the pending matrix of parent directly was correctly modified as below. ------ rhs20-01(07:20:14.236322807) # file: data/brick01 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000000 <--- Only rhs20-02's entry was cleared. trusted.afr.vol01-client-2=0x000000000000000000000002 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.volume-id=0x1ee927b1e5f74fdab1678d4bd692b751 rhs20-01(07:20:14.236322807) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x0000000c0000000100000000 trusted.afr.vol01-client-2=0x0000000b0000000000000000 trusted.gfid=0x7b714930e0224dbdb703d847b2dcb9a1 rhs20-02(07:20:14.236322807) # file: data/brick01/file1 trusted.gfid=0x7b714930e0224dbdb703d847b2dcb9a1 <--- gfid of rhs20-02 was recreated. rhs20-03(07:20:14.236322807) # file: data/brick01/file1 trusted.afr.vol01-client-0=0x000000000000000000000000 trusted.afr.vol01-client-1=0x000000000000000000000000 trusted.afr.vol01-client-2=0x000000000000000000000000 trusted.gfid=0x63f9306e313a480eadaebd3681b4103b ------ Created attachment 668761 [details]
shell script for problem recreation
One thing to add. This is caused depending on the timing. In many cases, old gfid is renewd on both children almost at the same time, and the logic bug (clearing all pending matrix entries) does no harm. To confirm that the patch actulally resolve this issue, I've kept running the reproduction script for several hours, and I've never seen the split brain until now. (repeated more than 1700 trials.) Etsuji Nakai, Yes the fix posted does fix this particular problem but there is one improvement we need to make so that it handles other cases as well. Jeff is going to post the updated fix. Please follow http://review.gluster.org/4034 for the updates on the patch. Pranith. *** Bug 863223 has been marked as a duplicate of this bug. *** Joe Julian reports (via IRC) that he also hit this problem today. Reproduced during disaster recovery. Lost the network. Since I had no time to ensure everything was clean and wouldn't have had time to manage any split brains that might have occurred due to the network issue, I punted and just claimed one server to be good and put it online. Later, when things slowed down a little, I added a second server back in and made sure there was no split-brain. Everything was good. Today I added the third server back in and encountered this bug. CHANGE: http://review.gluster.org/4034 (replicate: don't clear changelog for un-healed replicas) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4435 (test: Removed "for" loop of check_xattr function.) merged in master by Anand Avati (avati) |