Description of problem:
AFR2 has an oversight in it's "discovery" code path (a code path that figures out which brick in a replica set a client should be "pinned"). This code path is triggered when the generation changes (e.g. brick connects/disconnects from subvol, or on a fresh start of a client of NFS daemon).
In the event a brick is restored into a cluster, say after re-imaging or re-formatting of the file system, the discovery mechanism does not compare the directory contents checksums (as AFR1 did) of all replicas to ensure that even though wise-fool insists the replica is clean (as it would when the replica is completely empty and no changes are made to other replicas to the brick's root dir), it will happily "pin" a client to the empty brick in the subvolume. Subsequent accesses to this sub-volume will return empty results, and "in-flight" NFS operations will spew errors since directories on this sub-volume have now vanished.
It's important to note that in a 2x or 3x replication setup, only 50% or 33% of clients will observe this behavior since they have a 50% or 66% chance respectively to get "pinned" to a sub-volume which is _not_ empty.
Version-Release number of selected component (if applicable):
It is believed all AFR2 implementations of GlusterFS are affected by this issue and it has been confirmed on 3.6.3, 3.6.4 and 3.7.3.
100% reproducible if you turn off "choose-local" and figure out which replica your FUSE mount will typically pick as it's read-subvolume (in my tests it's the 2nd one, however depending on hardware it could be another one). It might also be worth simply specifying the read-subvolume directly and observing the issue this way.
Steps to Reproduce:
1. Create a 3x replicated volume.
2. Turn choose-local off (simplifies debugging).
3. Mount & write some data to the volume.
4. Down & wipe the data on the 2nd brick in the volume (consistent for my tests, though for other systems it might be the 1st brick or the 3rd brick, it's all dependent on what order the RPC replies come).
5. Bring the empty brick back up, but DO NOT run a full self-heal yet.
6. Re-mount the volume and observe the volume now appears empty.
The volume appears empty.
ls'ing the volume should show you the data you previously wrote.
It's clear that running a full heal will remedy this situation, but it seems a sub-optimal user experience that 33-50% of clients will see their data vanish while the heal is taking place. In AFR1 there was a code path that checksum'ed the actual contents of all readdir replies (and merging them if necessary) to ensure clients saw a consistent and correct view of the FS in such situations.
My suggestion is that during inode refreshes (i.e. generation changes), the entry-self-heal code path is forced to ensure these situations are detected and healed.
Created attachment 1067486 [details]
Prove test showing bug w/ patch fixing it.
Uploaded a patch & prove test showing this issue. It's easily demo'd by pinning the read-subvolume using cluster.read-subvolume-index and then wiping this brick out and adding it back in. Subsequent reads will simply be blank.
Oh and patch is based on v3.6.3, I think our AFR xlator is more or less similar, if it doesn't patch clean let me know and I can re-work it. The patch itself is pretty simple so you should be able to figure it out :).
Any update on this? This is a pretty critical bug fix that should probably be in 3.7 asap.
Pranith and I had a discussion about this. We are working towards a solution.
Meanwhile, we plan to provide a workaround script for the issue this week.
Will update the bug with it asap.
@Anuradha and @Pranith, another benefit to approach shown in this patch is bricks will automatically get healed back into the cluster, there is no need to run "full heal" which at our scale is not desirable and a waste of IO resources.
This patch correctly seeds the root dir heals upon the brick re-joining the cluster and from there the SHD's on the replica nodes will heal back all data. There is no need for any human intervention. Beyond this bug, this feature alone is critically important at our scale and I'd encourage it becoming standard.
I like the idea, I will take the patch in. If just after the brick re-joins, for some reason if an entry creation succeeds just on the new brick but fails on the old brick(May be because of EEXIST failure on correct brick), then the heal will happen in reverse direction which will end up deleting the correct content!! In the worst case it will wipe out the entire brick(I haven't seen it happen in the past 4 years I was on afr though, it is a corner case). That is why we documented the replace-brick/disk procedure in the following way(Yet to be merged):
Anuradha and I were thinking of giving you guys a script just to make sure you guys don't hit this case.
In 3.7.x we fixed replace-brick command to do what you are describing in comment-7 in an automatic way without running full heal which also avoids the bug I described above and the readdir bug:
You can see the patches that fixed replace-brick command @bug:
But replace-disk is a bit tricky in the sense that brick contents change in the background without afr/glusterd's knowledge. I am still thinking how best to address this. I have an idea to keep markers on the directories which are in good condition and on absence of those marker it should never become source and has to trigger the heal and do what your patch does. But I still need to work out some details for backward compatibility.
Thanks for the patch!!
I think we will announce about this bug and give history on gluster-users.
So the situation you refer to shouldn't really be possible with AFR2 + generation + discovery code paths.
Upon a new brick re-joining the cluster, all clients will be forced into the discovery code path prior to being able to legitimately read from the brick which rejoined. With my patch they are forced into the entry-self-heal flow for the root directory (only) such that a correct determination can be made. Also, since a entry-lock is required on the root of the volume, only one such client (FUSE, gNFSd etc) will be able to go into this heal flow at any time, and the patch forces a conservative merge of the root directory in this case. In the worst case if somehow there was a difference between two of the existing bricks (which didn't rejoin) and the brick that re-joined, the root amongst all bricks would simply be merged (as this is a split-brain my definition).
In some sense this situation is itself is a special case of split brain, as the root directory by definition is in split-brain since the the brick and it's two partners most likely all have clean AFR change-log attributes. Thus, the conservative merge is prescribed as remedy to this situation, and what is used in this patch.
Oh and on the replace-bick flow: this patch addresses the case where are you aren't replacing the brick...you are re-integrating the brick after it has been paved. Basically the case where the RAID volume was a total loss (and the host has been re-imaged), and it has to be rebuilt by the replicas. So there's no "replace brick" command to give you signal that this is the case, it simply rejoins the cluster empty (via some help external automation).
It would be nice for example for replace brick to allow you to "replace-brick" with the _same_ brick path, this way this case can be handled more elegantly within the Gluster code base itself. Or perhaps something like, "gluster volume rebuild rebuild <volume> <brick>" which would add the volume-id back to the brick directory and kick a rebuild off on the associated sub-volume.
If I understood the patch correctly, it does the conservative merge only at the time of lookup. There is a chance that the lookups would have been served by md-cache and the entry creation could come in before the code path for conservative merge is hit, which could lead to the problem I described. It is interesting that we both are thinking on similar lines :-). Anuradha and I talked about modifying the replace-brick command to allow same path for replace-brick. We will consider about the rebuild command as well. I will update on this bug about the final decision before we start coding.
I've posted a patch upstream which incorporates the replace-brick command change that Pranith mentioned in comment #11. Let me know if you see any problem with this approach.
Patch can be found at : http://review.gluster.org/#/c/12250/
*** Bug 1120991 has been marked as a duplicate of this bug. ***
http://review.gluster.org/#/c/12250/ is now merged into mainline.
1)The Reset Brick command works well to replace the same brick path, which cannot be acheived by replace brick(which asks for different path)
2)reset brick must not accepet different path (for that go for replace brick)
3) in the testcase mentioned by the reporter(comment#1), the testcase fails now too.
However that being an invalid way of doing it, with reset brick command, the data is available by the client(unlike where the data was not seen on mount when deleting from backend brick)
ie, even now deleting from backend manually will make the data seem lost from mount
4) mention hostname instead of IP in the destination (source as IP) while commit force(this method can be used for replacing IPs with hostnames)
5) mention an online brick for commit force(must fail, saying brick is online0
6)healing post reset brick must work
I have run the above cases for replicate volume
Also note that reset brick has been tried on x3 and dist-disperse volume and i see the workflow working as expected
I see all of them passing.
However following were the observations:
1)for a replicate volume(1x2 or 1x3), i had to use commit force
2) we need to make sure the admin is aware(by documentation) and makes sure that there are no heals pending for the brick being reset , also no splitbrains are there on that brick being reset. If there are any it is advisible to resolve them before doing reset brick
3) always make sure the backend brick path is clean from any stale data where reset brick is being completed by commiting (so as to avoid inconsitencies)
I have verified all the above scenarios and found no issues, except below observations
1)use of commit force and commit (without force) is a bit ambiguous
I saw that we need to use commit force if we dont delete the complete brick path, but delete all the content under brick path (which should not be recommended)
2)I saw a case where on a 1x2 volume the reset brick failed and both the bricks were offline ie brick being reset and the healthy brick. I don't know if that healthy brick was offline already. Hence not sure if that was an intermittent or a setup issue. Tried reproducing but of no avail.
Also , we must recommend in documentation to wait for all pending heals to complete and preferably no IOs going on
moving it to verified as the feature works from what it was expected to do
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.