Steps to reproduce: (1) Set up a volume with replica 3. For this example, the bricks are as follows: rep3-client-0 is on gfs1 rep3-client-1 is on gfs2 rep3-client-2 is on gfs4 (2) Start the volume and mount a client. (3) Kill all GlusterFS processes on two of the nodes (e.g. "killall -r -9 gluster" on gfs1 and gfs2). (4) Create a subdirectory and write a file within it. The changelogs for the file itself on the surviving node (gfs4) are as follows: trusted.afr.rep3-client-0=0x000000010000000000000000 trusted.afr.rep3-client-1=0x000000010000000000000000 trusted.afr.rep3-client-2=0x000000000000000000000000 For the directory, they are: trusted.afr.rep3-client-0=0x000000000000000000000001 trusted.afr.rep3-client-1=0x000000000000000000000001 trusted.afr.rep3-client-2=0x000000000000000000000000 (5) Bring glusterd back up on *one* of the other nodes (gfs1). Now the changelogs for the file and directory on gfs4 are like this: # file: export/sdc/dir trusted.afr.rep3-client-0=0x000000000000000000000000 trusted.afr.rep3-client-1=0x000000000000000000000000 trusted.afr.rep3-client-2=0x000000000000000000000000 trusted.gfid=0x93778dc8eabd4e5fbcde9f116bb0452e # file: export/sdc/dir/sub trusted.afr.rep3-client-0=0x000000000000000000000000 trusted.afr.rep3-client-1=0x000000000000000000000000 trusted.afr.rep3-client-2=0x000000000000000000000000 The changelogs on gfs1 are also zero. The file contents are correct, so gfs1 has been fully healed, but nothing has happened on gfs2 and there's no longer any indication that anything should. (6) Bring glusterd back up on the other node (gfs2). Nothing. (7) "find . | stat" on the client. At this point the directory is created on gfs2 (entry self-heal on the volume root) and the file is created within it (entry self-heal on the directory), but the file is zero length and has no AFR xattrs. Somewhat surprisingly, we find this on gfs1. # file: export/sdc/dir/sub trusted.afr.rep3-client-1=0x000000010000000100000000 (8) Do the "find . | stat" on gfs4 a second time. No effect. (9) Unmount and remount on gfs4, then do the "find . | stat" a third time. Now the contents are correct. This is the same "no data self-heal on client without remount" behavior as observed in bug862838, but that's not the issue here. Our problems started back at step 5, when the changelogs for afr-client-1 (gfs2) were cleared even though no self-heal had actually occurred there. We need to fix that first so that proactive self-heal can work with replica 3, before we worry about client-side self-heal.
*** This bug has been marked as a duplicate of bug 802417 ***
No, this is not a duplicate of 802417. That bug has to do with GFID mismatches on re-created files leading to EIO. This bug has to do with data/changelog discrepancies even on existing files leading to stale data. The fix for this bug does not address the other, and I doubt that the converse would be true either.
http://review.gluster.org/#change,4034
I still don't believe this is a true duplicate of bug 802417, but it's close enough. Marking the duplicate in this direction because preserving the history and follower list on the user-reported bug is more important than doing the same for this one. *** This bug has been marked as a duplicate of bug 802417 ***