Bug 762267 (GLUSTER-535)

Summary: stripe entry self-heal..
Product: [Community] GlusterFS Reporter: Amar Tumballi <amarts>
Component: stripeAssignee: Amar Tumballi <amarts>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs, jdarcy, rabhat, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Amar Tumballi 2010-01-12 18:47:42 UTC
Lets see the basic design of stripe before thinking of self-heal.

* Directories are kept in same order in backend on all subvolumes.
* All regular files (no symlink, no special files) are created in all the subvolumes.
* Symlink and special files are kept only in the first subvolume.
* The stat structure is always returned from first subvolume so that the many fields will be same across two different stat calls (on same file).


Now, we can heal the directory structure in backend if there is some directory missing in one of the subvolume. 
Also if its a regular file, and the file is not in one of the subvolume, then we can create a empty file (but can't heal data).

The decision to heal (both cases above) can be done in 'lookup()' call itself.

If any subvolume returns op_ret==-1, and op_errno==ENOENT, while some other nodes are returning 'op_ret==0' in lookup (with (S_ISREG(st_mode) || S_ISDIR (st_mode)) is true, we need to create the entries in subvolumes where it fails.


Let me know if any one has thought more about stripe self-heal asap. If no comments by EOW(eek), I will go ahead and start with the development of this feature.

Regards,

Comment 1 Jeff Darcy 2010-01-13 10:45:30 UTC
I have a strong interest in this because I'm working on a stripe/dht hybrid right now (create/open/read/write already work and distribute data much better than stripe or dht alone).  Here are some observations.

The stat structure does not come only from the first subvolume.  What actually happens is that stripe_stat sends the stat call to all subvolumes, and the results are collected in stripe_stack_unwind_buf_cbk.  The most important reason for this is to ensure that st_blocks and st_size get the correct values - which depend on the values from all subvolumes.

We can heal the directory if it's missing, but it's not clear whether that really does any good since the files themselves will still be missing.  If reads are attempted to blocks in the lost file, then those reads will fail because they're beyond the (empty) replacement's EOF.  That's slightly unfortunate, but the situation is far worse if the file is extended.  Imagine a file striped as 3x64K, and for some reason the directory on the last subvolume (index 2) is lost.  If a user opens and then writes one byte at 320K, then the replacement file on subvolume 2 will be extended with a hole up to that point.  Subsequent reads from 128K to 192K-1 will fall into the hole and appear to succeed but the read data will be zero.  Recent experience with a similar bug on a different filesystem reinforces the point that most users would consider this a form of data corruption and would prefer that such reads fail completely instead of returning incorrect data.

Comment 2 Anand Avati 2010-01-19 10:26:05 UTC
PATCH: http://patches.gluster.com/patch/2655 in master (stripe entry self heal)