Bug 871727
Summary: | [RHEV-RHS] Bringing down one storage node in a pure replicate volume (1x2) moved one of the VM to paused state. | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | spandura | |
Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> | |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 2.0 | CC: | bbuckley, bfoster, bmohanra, cmosher, grajaiya, jdarcy, nsathyan, pkarampu, ravishankar, rcyriac, rhs-bugs, rwheeler, sasundar, spandura, ssaha, storage-qa-internal, vagarwal, vbellur | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.1.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.0-1.el7rhs | Doc Type: | Bug Fix | |
Doc Text: |
Previously, when self-heal is triggered by shd, it did not update the read-children. Due to this, if the other brick dies then the VMs go into paused state as mount assumes all read-children are down. With this fix, this issue is resolved and it repopulates read-children using getxattr.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1095112 (view as bug list) | Environment: |
virt rhev integration
|
|
Last Closed: | 2015-07-29 04:27:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 957769, 1095112, 1202842 |
Description
spandura
2012-10-31 08:33:51 UTC
Yet to start work on this bug. How reproducible is this bug in your testing? 100%? 50%? If reproducible enough to discern, does it always occur at the same point in the test case? This appears similar to a recent report suspected to be related to selinux, but ultimately was not reproducible. I've attempted to reproduce this a couple times on a local rhev/rhs setup without success. I've run through the entire test case sequence a couple times as well as repeated the final recovery step (step 15, boot up a node into recovery while the VMs are being updated) a couple more times independently. One of the latter tests is still running at the moment. I've reproduced some hung task messages, but no paused VMs thus far... I didn't see anything obvious in the logs that would explain the failure, though there is a lot of data (including expected failure output) so it's very possible I've missed something. I'll take a second look when I have a chance, but in the meantime I'd suggest we try and get access to an environment in this state if at all possible. Per 31/01 tiger team bug triage meeting, reducing priority because we can resume from the paused state. Tried to recreate the problem. The test case passed this time with no VM's getting paused. Dropping blocker tag as per program meeting on 03/11. Targeting for 2.1.z (Big Bend) U1. Can we run a round of test? its been 7months since we last ran this test-case. per triage 12/13, removing from corbett list To add to this bug, this issue was filed when server-side quorum and client-side quorum are not made as default to virt profile. From RHSS 2.1 Update2 , we have enabled server-side quorum and client-side quorum in virt profile (i.e) for virt-store volumes client-side quorum has certain constraints on its design, that the first brick of the replica group should be up. So, in this case, when client-side quorum is enabled, the VMs on that volume will go to paused state, failing fault-tolerance But the failure of second brick doesn't affect the VMs on that virt-store. Also tested the behavior without quorums enabled. When one of the brick/node goes down, the other replica pair was available and App VMs are up and running healthy. What is the status of bug fix for this problem? My customer is has just filed a support ticket #01079904 for the same problem. Thanks Jin More info: Customer is using: RHEV 3.3.1-0.48.el6ev RHS glusterfs 3.4.0.59rhs He has a replicated volume on a two-node RHSS cluster: gluster-node-0 gluster-node-1 He is using "GlusterFS" Storage Domain as configured below Path: gluster-node-0.example.com:/TCC-RHEV VFS Type: glusterfs Mount options: backup-volfile-servers=gluster-node-1.example.com He reported that, VM got paused when he manually took down the "gluster-node-0" node. But the VM works fine when he manually took down node "gluster-node-1". In addition, I know when "PosixFS" is used, RHEV uses Gluster FUSE client to mount Gluster volume on RHEV-H, but in this case he is using "GlusetFS" type Storage Domain. Does the "Mount Options" actually do anything in "GlusterFS" Storage Domain (is it using libgfapi?)? Should customer use "PosixFS" or "GlusetFS" in RHEV? hi Jin Zhou, Did the customer enable client-quorum by any chance? If it is enabled then this behavior is expected. Could you please check gluster volume info output to confirm the same. Pranith Customer is using the default "virt" profile, so I think by default client quorum is set to auto, and server quorum is set to "server". But the part I don't understand is what caused the difference betwwen failure on gluster-node-0, and gluster-node-1? I would expect the client quorum being enforced regardless of which brick/node goes down. Why VM is only suspended when gluster-node-0 goes offline, not gluster-node-1? SATHEESARAN's note above seem to indicate this behavior, but not detailed enough for me. Lastly, since we only officially support replica=2 today, what is rational for enabling client quorum as "auto", it seems useless to me. But I could be wrong. Thanks hi Jin Zhou, client-quorum calculation happens the following way: In general cases, quorum is met when n/2 + 1 bricks of the replica set are available, but if the number of bricks is even and exactly n/2 bricks in replica set are up then quorum is met if the first brick in the set is up. The reason why client-quorum is enabled by default is that image going into split-brain is much worse than losing availability when the first brick goes down. Without any quorum VMs are accessible when 1) both bricks are up 2) When only first brick goes down in replica set 3) When only second brick goes down in replica set With client quorum, in cases 1), 3) the VMs are accessible. Pranith Tested with RHGS 3.1 Nightly build ( glusterfs-3.7.1-11.el7rhgs ) with the following test : 1. Used replica 2 volume to back the RHEV Data domain 2. Powered off one of the node abruptly and observed that the VMs are still accessible and available. Marking this bug as VERIFIED Hi Pranith, The doc text is updated. Please review the same and share your technical review comments. If it looks ok, then sign-off on the same. Regards, Bhavana Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |