862838 – Self-heal is unreliable if other volumes are present

Bug 862838 - Self-heal is unreliable if other volumes are present

Summary: Self-heal is unreliable if other volumes are present

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Jeff Darcy
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-03 17:55 UTC by Jeff Darcy
Modified:	2013-07-24 17:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:22:18 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Testcase execution commands history (34.48 KB, application/octet-stream) 2012-10-05 03:28 UTC, spandura	no flags	Details
View All

Description Jeff Darcy 2012-10-03 17:55:49 UTC

This seems to be related to how the self-heal daemon handles stops/starts for a single volume vs. multiple, so it's important that these steps be followed on systems that already have another volume (not directly used in the test) started.

(1) Create a two-brick replicated volume on pristine directories across two machines that are already serving another volume.

(2) Start the volume.

(3) Mount from a third machine and create a subdirectory.

(4) Kill all of the GlusterFS processes on one machine (e.g. killall -r -9 gluster on gfs1)

(5) Create files both at the volume top level and within the subdirectory.

(6) Start glusterd on the node where it had been killed (e.g. gfs1), causing glusterfsd for the missing brick to start along with the NFS/self-heal daemons.

At this point, proactive self-heal should kick in and the files should be replicated from gfs2 (the one that *wasn't* killed) to gfs1.  Not so.  In fact, the only self-healing that seems to happen is from gfs1's *NFS* daemon, because it does a lookup on the volume root.  Even this is partial, creating only a zero-length file in the volume root with a GFID xattr but no AFR xattrs.  On gfs2 the volume-root file shows changelog counts of data=2,meta=1, while the subdirectory file shows only data=1 from before glusterd was restarted on gfs1.

(7) Use the old-fashioned "find | stat" method on the client to force lookups and (hopefully) self-heals.

Still no luck.  Now the subdirectory file exists on gfs1, but is still zero length and on gfs2 it still has the same data=2,meta=1 changelog as the volume-root file.

(8) Kill glusterfs processes on the *other* server (e.g. gfs2).

At this point attempts to open either file return either the empty files from gfs1 or sometimes ENOENT (even though readdirp/stat do show the file).  Note that this is exactly the kind of data loss that AFR exists to prevent.

(9) Restart glusterd etc. on the second server (e.g. gfs2).

No change.  This is not the state from after step 7.  It's the even worse state from step 8, where we get empty files ENOENT even though the full files are available on gfs2 and glusterfsd is running there.

(10) Unmount and remount on the client, then do "find | stat" again.

Finally, the self-heal from gfs1 to gfs2 happens properly so that changelog flags are cleared and the full file contents are on both servers.


The flag changes are slightly different when rewriting old files vs. creating new ones, but the overall result is the same (stale data is just as bad as missing data).  There seem to be two fundamental problems here.

(a) Self-heal daemon doesn't kick in *at all*.

(b) Self-heal from a client doesn't work without a remount.

Also, "gluster volume heal" simply hangs so it doesn't help either, but that's probably a separate bug.

Comment 1 Jeff Darcy 2012-10-03 21:25:51 UTC

The problem turned out to be something completely unexpected.  Part of the self-heal code was calling synctask_new from code that was already in a synctask.  It would then become self-deadlocked waiting for the new task to complete, because it was itself sitting on the last resource for running that task.  That wouldn't happen with a single volume, because the default synctask-processor count is two so there would in fact be a spare, but with two volumes we'd run out and the entire self-heal daemon would effectively stop.

I've submitted two separate fixed for this.  http://review.gluster.org/4032 fixes this in AFR, by changing _do_crawl_op_on_local_subvols to call afr_syncop_find_child_position directly instead of through synctask_new.  http://review.gluster.org/4031 fixes it in the core, by adding extra code to make sure we have a processor to run the new task even if someone makes this mistake again.

Comment 2 spandura 2012-10-04 03:54:09 UTC

1) performing "find . | xargs stat" (at step7) on the client triggers background self-heal. Not sure why "find | xargs stat" is not triggering the self-heal. when we execute background self-heal tests we always perform "find . | xargs stat". 

2) within 10 minutes of time "self-heal daemon" triggers self-heal but it is not triggering the self-heal immediately as soon as the brick comes online. There is already a bug reported for this. (Bug 852741)

Comment 3 Pranith Kumar K 2012-10-04 07:04:31 UTC

Jeff, shwetha,
   This code path is the result of e8712f36335dd3b8508914f917d74b69a2d751a1.

Pranith.

Comment 4 spandura 2012-10-04 07:15:06 UTC

pranith, can you please explain why "find . | xargs stat" triggered self-heal and why "find | xargs stat" didn't trigger self-heal.

Comment 5 Jeff Darcy 2012-10-04 12:59:22 UTC

(In reply to comment #2)
> 1) performing "find . | xargs stat" (at step7) on the client triggers
> background self-heal. 

The bug is entirely reproducible given two preconditions

(a) Using code from git master - i.e. not GlusterFS 3.3 or Red Hat Storage 2.0 - prior to the two patches mentioned above

(b) With two or more volumes being exported by each involved server

Is it possible that you're not seeing this because you don't meet those preconditions?  Also, doing "find" on the client wouldn't be background self-heal.  It would be foreground

Comment 6 spandura 2012-10-04 16:39:28 UTC

B) I had two volumes (pure-replicate) and (dis-rep) volumes running and 2 servers and 1 brick on each server in case of pure-replicate and 2 more servers for dis-rep . But still not seeing the issue when "find . | xargs stat" is performed.

Comment 7 Jeff Darcy 2012-10-04 17:38:49 UTC

I branched from current master (6c2eb4f2) and reverted the synctask_new fix (557637f3).  Then I rebuilt, reinstalled, and re-ran the steps above..  Here's the state after step 6.

*** On gfs1 (daemons had been killed and restarted)
# file: export/sdb/top
trusted.gfid=0xa42fb2a23c604aafaabdcbdd1cda4d40
# export/sdb/dir and export/sdb/dir/sub not present

*** On gfs2 (daemons had run since start of test)
# file: export/sdb/dir
trusted.afr.rep2-client-0=0x000000000000000000000001
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xf2c6ad80ca704f52bd926ad207c0cd69
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
# file: export/sdb/dir/sub
trusted.afr.rep2-client-0=0x000000010000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0x289bd4b244a949d387aac88f824e47c8
# file: export/sdb/top
trusted.afr.rep2-client-0=0x000000020000000100000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xa42fb2a23c604aafaabdcbdd1cda4d40

As you can see, /export/sdb/top had been created as an empty file with no AFR xattrs on gfs1, leaving the copy on gfs2 with a changelog of data=2,meta=1.  This is the result of the entry self-heal on the volume root from gfs1's NFS daemon.  On gfs2, /export/sdb/dir has a changelog of entry=1 and /export/sdb/dir/sub has a changelog of data=1.  This is all as described above.  At this point I did a "find . | xargs stat" on the still-mounted client (gfs4) with the following result.

*** On gfs1
# file: export/sdb/dir
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xf2c6ad80ca704f52bd926ad207c0cd69
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
# file: export/sdb/dir/sub
trusted.gfid=0x289bd4b244a949d387aac88f824e47c8
# file: export/sdb/top
trusted.gfid=0xa42fb2a23c604aafaabdcbdd1cda4d40

*** On gfs2
# file: export/sdb/dir
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xf2c6ad80ca704f52bd926ad207c0cd69
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
# file: export/sdb/dir/sub
trusted.afr.rep2-client-0=0x000000020000000100000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0x289bd4b244a949d387aac88f824e47c8
# file: export/sdb/top
trusted.afr.rep2-client-0=0x000000020000000100000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xa42fb2a23c604aafaabdcbdd1cda4d40

So /export/sdb/dir got created and seems fine, while /export/sdb/dir/sub got created the same way as /export/sdb/top had previously.  In other words, entry self-heal happened but data self-heal did not.  Finally, I unmounted and remounted the volume on gfs4, then re-ran the find|xargs command.

*** On gfs1
# file: export/sdb/dir
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xf2c6ad80ca704f52bd926ad207c0cd69
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
# file: export/sdb/dir/sub
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0x289bd4b244a949d387aac88f824e47c8
# file: export/sdb/top
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xa42fb2a23c604aafaabdcbdd1cda4d40

*** On gfs2
# file: export/sdb/dir
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xf2c6ad80ca704f52bd926ad207c0cd69
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

# file: export/sdb/dir/sub
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0x289bd4b244a949d387aac88f824e47c8

# file: export/sdb/top
trusted.afr.rep2-client-0=0x000000000000000000000000
trusted.afr.rep2-client-1=0x000000000000000000000000
trusted.gfid=0xa42fb2a23c604aafaabdcbdd1cda4d40

Now all of the AFR xattrs are present, all zero, and (not shown above) the contents are correct on both nodes.  In other words, data self-heal finally happened.  I did use gdb to check the self-heal daemon on gfs2, and it did show the expected two threads in self-deadlock trying to call synctask_new from within synctask_wrap.

I don't know why you're seeing different behavior on your systems, but I'm also not sure what more I can do to demonstrate that this problem is reproducible without yesterday's fixes.

Comment 8 spandura 2012-10-05 03:28:34 UTC

Created attachment 621915 [details]
Testcase execution commands history

Not sure if I am executing some steps wrongly. Hence attaching the output of the test case commands execution . Please let me know us know your thoughts

Comment 9 Vijay Bellur 2012-10-12 01:08:59 UTC

CHANGE: http://review.gluster.org/4032 (replicate: don't use synctask_new from within a synctask) merged in master by Anand Avati (avati)

Comment 10 Vijay Bellur 2012-10-16 21:24:23 UTC

CHANGE: http://review.gluster.org/4085 (syncop: save and restore THIS from the time of context switch) merged in master by Anand Avati (avati)

Note You need to log in before you can comment on or make changes to this bug.