Bug 761902 (GLUSTER-170)

Summary: Auto-heal fails on files that are open()-ed/mmap()-ed
Product: [Community] GlusterFS Reporter: Gordan Bobic <gordan>
Component: replicateAssignee: Vikas Gorur <vikas>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 2.0.4CC: aavati, corentin.chary, gluster-bugs, gordan, pavan, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Anand Avati 2009-07-27 21:10:11 UTC
Gordon,
  The current replicate does not heal open file descriptors. This is a known limitation in glusterfs-2.0.x (Mentioned under 'Known issues' in http://www.gluster.org/docs/index.php/Understanding_AFR_Translator). We are working on a fix for this by September (2.1 release). It is interesting if this is actually related to the 0-byte issue you are facing.. We are checking the possibility

Avati

Comment 1 Gordan Bobic 2009-07-27 23:57:09 UTC
The system configuration is the same setup as Bug 126 (see there for volume spec files). Currently 2 nodes are online, and the 3rd is a clean, empty new node joining the cluster. This is probably also reproducible with just 1 node online and the 2nd node being the clean, empty node joining the cluster.

When a new, empty node comes online, it cannot auto-heal files on the gluster file system that are currently open and/or mmap-ed on the other nodes. Files entries get created, but ls -la shows that on the new node then are 0 bytes. The complete file system resync was initiated using:
# ls -laR /
but this only seems to download the non-open/mmaped files.

Looking at the files that suffer from this, it is striking that they are all listed as open using lsof on the other two nodes.

The way the boot-strap onto gluster root goes is that the initial root mounts the gluster root, and then fires up a modified init chrooted into the directory where it mounted gluster root. Once the gluster root directory is mounted, it tries to fire up /usr/comoonics/sbin/init. Here are the files init depends on:

init          1    root  cwd       DIR               0,19     4096          1 /
init          1    root  rtd       DIR               0,19     4096          1 /
init          1    root  txt       REG               0,19    47057  228797259 /usr/comoonics/sbin/init
init          1    root  mem       REG               0,19   139416  227928927 /lib64/ld-2.5.so
init          1    root  mem       REG               0,19  1713160  227928957 /lib64/libc-2.5.so
init          1    root  mem       REG               0,19    23360  227928999 /lib64/libdl-2.5.so
init          1    root  mem       REG               0,19   247528  227929101 /lib64/libsepol.so.1
init          1    root  mem       REG               0,19    95464  227929095 /lib64/libselinux.so.1
init          1    root   10u     FIFO               0,17               14455 /dev/initctl

And it is confirmed these are what init is linked against:
# ldd /usr/comoonics/sbin/init
        libsepol.so.1 => /lib64/libsepol.so.1 (0x0000003b79a00000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003b79e00000)
        libc.so.6 => /lib64/libc.so.6 (0x000000334ac00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000000334b000000)
        /lib64/ld-linux-x86-64.so.2 (0x000000334a800000)

All of these happen to be among the files that have 0 size on the new node (init, and all those libraries under /lib64, plus a about 30 other libraries, all of which similarly turn up in lsof on the nodes that are already running). Other files appear to have synced OK.

The problem with this is that the files didn't self-heal properly. This, init fails (as it is not a valid executable).

There is a further chain of dysfunction thereafter in the process of trying to manually get the files onto the new node (touching a file on the existing nodes gives an error like "text file is busy", but if the 0 byte size on replication problem is fixed I suspect the rest will fall in place.

In case you are wondering how the 2nd node came to be online, I rsync-ed the underlying file system across to the new node, so this issue didn't arise as the files required to boot were already in place.

Since this clearly affects the replication of files that are open and/or mmap-ed (such as shared libraries) there is a possibility that this may be related to Bug 126 (shared library corruption). And I just checked - /usr/lib64/libglusterfs.so.0.0.0 got corrupted again on this cluster in the past 24 hours. No other libraries got corrupted, which seems interesting.

Comment 2 Gordan Bobic 2009-07-28 04:29:52 UTC
Yes, now that you mention it, it looks like this is the limitation that I'm bumping into.

I guess that means that for now I have to come up with a workaround, such as dumping+restoring all open files on all the running nodes when adding a new node, between when the new node mounts gluster-root and tries to chroot into it.

Comment 3 Vijay Bellur 2009-07-28 05:03:30 UTC
Changing target-milestone to 2.1 as 2.1-mustfix is deprecated.

Comment 4 Gordan Bobic 2009-07-29 04:02:49 UTC
Something has been bothering me about this, and I just figured what it is. The read shouldn't fail even if self-heal does, provided there is at least one copy available in the cluster. It would seem that with read-subvolume set, the local copy tries to be used even when it cannot be healed, and thus the read fails. read-subvolume should specify preference, not exclusion of other nodes.

Now, granted, if self-heal worked on open files, this wouldn't be a problem, but as things are, I would argue this is still a bug on 2.0.x branch since the read shouldn't fail if there is at least one copy available in the cluster regardless of whether the read-subvolume option is set.

Comment 5 Anand Avati 2009-11-13 07:13:24 UTC
PATCH: http://patches.gluster.com/patch/2218 in master (protocol/client: whitespace cleanup)

Comment 6 Anand Avati 2009-11-13 07:13:28 UTC
PATCH: http://patches.gluster.com/patch/2219 in master (protoocl/client: file directory reopen support)

Comment 7 Anand Avati 2009-11-13 14:33:20 UTC
PATCH: http://patches.gluster.com/patch/2221 in master (protocol/client: preserve open/create flags in fdctx for reopening)

Comment 8 Anand Avati 2009-11-19 05:53:16 UTC
PATCH: http://patches.gluster.com/patch/2271 in master (Check for other return values as well from call to inode_path.)

Comment 9 Anand Avati 2009-11-24 11:39:55 UTC
PATCH: http://patches.gluster.com/patch/2346 in master (cluster/afr: Set read-child = source regardless of foreground/background self-heal)

Comment 10 Anand Avati 2009-11-24 11:40:00 UTC
PATCH: http://patches.gluster.com/patch/2347 in master (cluster/afr: Hold blocking locks for data self-heal.)

Comment 11 Anand Avati 2009-11-24 11:40:03 UTC
PATCH: http://patches.gluster.com/patch/2349 in master (cluster/afr: Refactored the data self-heal algorithm.)

Comment 12 Anand Avati 2009-11-24 11:40:07 UTC
PATCH: http://patches.gluster.com/patch/2348 in master (cluster/afr: Provide a post-post_op hook in the transaction.)

Comment 13 Anand Avati 2009-11-24 11:40:11 UTC
PATCH: http://patches.gluster.com/patch/2350 in master (cluster/afr: Do self-heal on reopened fds.)

Comment 14 Anand Avati 2009-11-24 11:40:15 UTC
PATCH: http://patches.gluster.com/patch/2351 in master (cluster/afr: Refactored the self-heal interface.)

Comment 15 Anand Avati 2009-11-25 11:03:41 UTC
PATCH: http://patches.gluster.com/patch/2362 in master (cluster/afr: Do self-heal on unopened fds.)

Comment 16 Anand Avati 2009-11-29 10:33:01 UTC
PATCH: http://patches.gluster.com/patch/2413 in master (afr: handle fdctx->pre_op_done handling)

Comment 17 Anand Avati 2009-11-29 14:14:57 UTC
PATCH: http://patches.gluster.com/patch/2416 in master (afr: fix crash in afr_sh_data_close)

Comment 18 Anand Avati 2009-12-01 22:52:53 UTC
PATCH: http://patches.gluster.com/patch/2473 in master (afr: fix fd reference leak)

Comment 19 Anand Avati 2009-12-01 22:52:57 UTC
PATCH: http://patches.gluster.com/patch/2475 in master (afr: remove memcpy of @local contents in afr_local_copy)

Comment 20 Anand Avati 2009-12-02 15:29:29 UTC
PATCH: http://patches.gluster.com/patch/2494 in master (cluster/afr: Fix conditional typo.)

Comment 21 Anand Avati 2009-12-04 07:52:02 UTC
PATCH: http://patches.gluster.com/patch/2551 in master (afr: fix memory leaks)

Comment 22 Anand Avati 2009-12-06 07:30:35 UTC
PATCH: http://patches.gluster.com/patch/2580 in master (afr: fix fd ref leak in self-heal)

Comment 23 Anand Avati 2010-02-22 13:57:16 UTC
*** Bug 167 has been marked as a duplicate of this bug. ***