Bug 1256524
Summary: | [RFE] reset brick | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | rwareing | ||||
Component: | replicate | Assignee: | Ashish Pandey <aspandey> | ||||
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | unspecified | CC: | amukherj, aspandey, pkarampu, rcyriac, rhinduja, rhs-bugs, sankarshan, smohan, spandura, sshreyas | ||||
Target Milestone: | --- | Keywords: | FutureFeature, RFE | ||||
Target Release: | RHGS 3.2.0 | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.8.4-1 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1266876 (view as bug list) | Environment: | |||||
Last Closed: | 2017-03-23 05:23:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1266876, 1277939, 1351503 | ||||||
Attachments: |
|
Description
rwareing
2015-08-24 20:04:55 UTC
Created attachment 1067486 [details]
Prove test showing bug w/ patch fixing it.
Uploaded a patch & prove test showing this issue. It's easily demo'd by pinning the read-subvolume using cluster.read-subvolume-index and then wiping this brick out and adding it back in. Subsequent reads will simply be blank. Oh and patch is based on v3.6.3, I think our AFR xlator is more or less similar, if it doesn't patch clean let me know and I can re-work it. The patch itself is pretty simple so you should be able to figure it out :). Any update on this? This is a pretty critical bug fix that should probably be in 3.7 asap. Pranith and I had a discussion about this. We are working towards a solution. Meanwhile, we plan to provide a workaround script for the issue this week. Will update the bug with it asap. @Anuradha and @Pranith, another benefit to approach shown in this patch is bricks will automatically get healed back into the cluster, there is no need to run "full heal" which at our scale is not desirable and a waste of IO resources. This patch correctly seeds the root dir heals upon the brick re-joining the cluster and from there the SHD's on the replica nodes will heal back all data. There is no need for any human intervention. Beyond this bug, this feature alone is critically important at our scale and I'd encourage it becoming standard. Richard, I like the idea, I will take the patch in. If just after the brick re-joins, for some reason if an entry creation succeeds just on the new brick but fails on the old brick(May be because of EEXIST failure on correct brick), then the heal will happen in reverse direction which will end up deleting the correct content!! In the worst case it will wipe out the entire brick(I haven't seen it happen in the past 4 years I was on afr though, it is a corner case). That is why we documented the replace-brick/disk procedure in the following way(Yet to be merged): http://review.gluster.com/8503 Anuradha and I were thinking of giving you guys a script just to make sure you guys don't hit this case. In 3.7.x we fixed replace-brick command to do what you are describing in comment-7 in an automatic way without running full heal which also avoids the bug I described above and the readdir bug: You can see the patches that fixed replace-brick command @bug: https://bugzilla.redhat.com/show_bug.cgi?id=1207829 But replace-disk is a bit tricky in the sense that brick contents change in the background without afr/glusterd's knowledge. I am still thinking how best to address this. I have an idea to keep markers on the directories which are in good condition and on absence of those marker it should never become source and has to trigger the heal and do what your patch does. But I still need to work out some details for backward compatibility. Thanks for the patch!! I think we will announce about this bug and give history on gluster-users. Pranith Hey Pranith, So the situation you refer to shouldn't really be possible with AFR2 + generation + discovery code paths. Upon a new brick re-joining the cluster, all clients will be forced into the discovery code path prior to being able to legitimately read from the brick which rejoined. With my patch they are forced into the entry-self-heal flow for the root directory (only) such that a correct determination can be made. Also, since a entry-lock is required on the root of the volume, only one such client (FUSE, gNFSd etc) will be able to go into this heal flow at any time, and the patch forces a conservative merge of the root directory in this case. In the worst case if somehow there was a difference between two of the existing bricks (which didn't rejoin) and the brick that re-joined, the root amongst all bricks would simply be merged (as this is a split-brain my definition). In some sense this situation is itself is a special case of split brain, as the root directory by definition is in split-brain since the the brick and it's two partners most likely all have clean AFR change-log attributes. Thus, the conservative merge is prescribed as remedy to this situation, and what is used in this patch. Richard Oh and on the replace-bick flow: this patch addresses the case where are you aren't replacing the brick...you are re-integrating the brick after it has been paved. Basically the case where the RAID volume was a total loss (and the host has been re-imaged), and it has to be rebuilt by the replicas. So there's no "replace brick" command to give you signal that this is the case, it simply rejoins the cluster empty (via some help external automation). It would be nice for example for replace brick to allow you to "replace-brick" with the _same_ brick path, this way this case can be handled more elegantly within the Gluster code base itself. Or perhaps something like, "gluster volume rebuild rebuild <volume> <brick>" which would add the volume-id back to the brick directory and kick a rebuild off on the associated sub-volume. hi Richard, If I understood the patch correctly, it does the conservative merge only at the time of lookup. There is a chance that the lookups would have been served by md-cache and the entry creation could come in before the code path for conservative merge is hit, which could lead to the problem I described. It is interesting that we both are thinking on similar lines :-). Anuradha and I talked about modifying the replace-brick command to allow same path for replace-brick. We will consider about the rebuild command as well. I will update on this bug about the final decision before we start coding. Pranith hi Richard, I've posted a patch upstream which incorporates the replace-brick command change that Pranith mentioned in comment #11. Let me know if you see any problem with this approach. Patch can be found at : http://review.gluster.org/#/c/12250/ *** Bug 1120991 has been marked as a duplicate of this bug. *** http://review.gluster.org/#/c/12250/ is now merged into mainline. QATP: ======= 1)The Reset Brick command works well to replace the same brick path, which cannot be acheived by replace brick(which asks for different path) 2)reset brick must not accepet different path (for that go for replace brick) 3) in the testcase mentioned by the reporter(comment#1), the testcase fails now too. However that being an invalid way of doing it, with reset brick command, the data is available by the client(unlike where the data was not seen on mount when deleting from backend brick) ie, even now deleting from backend manually will make the data seem lost from mount QATP..continued 4) mention hostname instead of IP in the destination (source as IP) while commit force(this method can be used for replacing IPs with hostnames) 5) mention an online brick for commit force(must fail, saying brick is online0 6)healing post reset brick must work Validation: =========== I have run the above cases for replicate volume Also note that reset brick has been tried on x3 and dist-disperse volume and i see the workflow working as expected I see all of them passing. However following were the observations: 1)for a replicate volume(1x2 or 1x3), i had to use commit force 2) we need to make sure the admin is aware(by documentation) and makes sure that there are no heals pending for the brick being reset , also no splitbrains are there on that brick being reset. If there are any it is advisible to resolve them before doing reset brick 3) always make sure the backend brick path is clean from any stale data where reset brick is being completed by commiting (so as to avoid inconsitencies) I have verified all the above scenarios and found no issues, except below observations 1)use of commit force and commit (without force) is a bit ambiguous I saw that we need to use commit force if we dont delete the complete brick path, but delete all the content under brick path (which should not be recommended) 2)I saw a case where on a 1x2 volume the reset brick failed and both the bricks were offline ie brick being reset and the healthy brick. I don't know if that healthy brick was offline already. Hence not sure if that was an intermittent or a setup issue. Tried reproducing but of no avail. Also , we must recommend in documentation to wait for all pending heals to complete and preferably no IOs going on moving it to verified as the feature works from what it was expected to do Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |