Bug 874045
Summary: | [RHEV-RHS] VM's were not responding when self-heal is in progress | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | spandura | |
Component: | glusterfs | Assignee: | Brian Foster <bfoster> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Rahul Hinduja <rhinduja> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | high | |||
Version: | 2.0 | CC: | aavati, bfoster, grajaiya, hchiramm, maillistofyinyin, rhinduja, rhs-bugs, rwheeler, sdharane, vbellur | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.3.0.5rhs-40 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 881685 (view as bug list) | Environment: | ||
Last Closed: | 2015-08-10 07:47:53 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 881685 |
Description
spandura
2012-11-07 11:05:51 UTC
Additional Info:- ================ 1) Initially when the VM's were moved to paused state (as shown in RHEVM), ssh to those machines and IO on those machines were successful 2) Performed Reboot of the VM's which were moved to paused state (as shown in RHEVM) by executing the "reboot" command from the VM. ssh to those machines failed. The VM's are not responding from a long time. I don't see any errors in the logs from gluster mounts on 7th/11/2012 from fuse. If the operations are successful on the VMs that are paused, I wonder if they are actually paused or the rhev-m is showing it as paused for some reason. (In reply to comment #7) > I don't see any errors in the logs from gluster mounts on 7th/11/2012 from > fuse. If the operations are successful on the VMs that are paused, I wonder > if they are actually paused or the rhev-m is showing it as paused for some > reason. Were VMs in "paused" state or "Not responding" state ? Screen-shot says it was "Not responding" state and NOT "paused" state. The VMs can move to "not responding" state when the RHEV Management agent did not receive response when trying to issue a control command over the VM using libvirt or failed to receive keep alive from the guest tools. This does not mean that the VM is necessary down. Federico wants this bug to be recreated and the setup to be shown to him before we can move forward with this bug. Could you please re-create and show the setup to him. Thanks Pranith. Yes, in status(from snapshots) it shows "Not Responding" . It was always not responding and never paused. sorry for the typo. *** Bug 874734 has been marked as a duplicate of this bug. *** Just a data point... I ran through a scaled down (3 and 2 VMs rather than 5 and 3) version of this sequence since I wanted to spin up a few VMs on our setup here and didn't reproduce any problems. I repeated the test with all 5 VMs running the test script after a cycle of one of the bricks. The VMs did not pause and I didn't notice any not responding messages in the UI, though I did reproduce some hung task delays in at least one VM in the latter test. My one hypervisor server is pretty loaded at this point, so perhaps I'll try again when I have another hypervisor available to drive more guests. Brian, Assigning this to you for now. I finally managed to reproduce a behavior that fits the description here. I start a couple VMs in a 2x2 dist-rep volume, kill the glusterfsd's on one node, wait a bit and restart glusterd on that node. self-heal begins in the client and after I few minutes I run a sync and find the guest non-responsive (hung task messages ensue) until the self-heal completes. I hacked in a flush bypass and still reproduced the behavior, but did not reproduce if data-self-heal is disabled in the client or if I kill glustershd immediately after restarting glusterd. I observe is the following state: - self-heal starts in the client. - A (pid=-1, start=0, len=0) lock request appears and is blocked on the self-heal. Given that a self-heal is already in progress on the client, I attribute this lock request to glustershd. - The guest issues a write transaction, the lock for which conflicts with the blocked lock above and ultimately pends until the self-heal completes. A few ways I can think of to resolve this problem in order of increasing complexity: - Don't pend normal priority locks on blocked low-priority locks. This introduces the potential for starvation of the low priority lock, but right now I think the only user is afr, which might be reasonably safe. - After a previous discussion with Pranith indicating the purpose of the 0-0 lock is to exclude multiple self-heals, we could move the 0-0 lock into a separate lock domain (i.e., from the volume name to the volume name + "-sh") and hold it for the duration of the self-heal. This could work so long as we can handle backwards compatibility (self-heals from older clients) correctly. - Find a way to use non-blocking locks in glustershd (e.g., skip files we can't lock for a later pass). (In reply to comment #15) > - Don't pend normal priority locks on blocked low-priority locks. This > introduces the potential for starvation of the low priority lock, but right > now I think the only user is afr, which might be reasonably safe. > - After a previous discussion with Pranith indicating the purpose of the 0-0 > lock is to exclude multiple self-heals, we could move the 0-0 lock into a > separate lock domain (i.e., from the volume name to the volume name + "-sh") > and hold it for the duration of the self-heal. This could work so long as we > can handle backwards compatibility (self-heals from older clients) correctly. > - Find a way to use non-blocking locks in glustershd (e.g., skip files we > can't lock for a later pass). The last approach sounds like the best compromise. However, if it is not feasible to make glustershd locks purely non-blocking, then only glustershd's INITIAL lock to acquire full range must be made low priority (and not a blanket pid=-1 - for e.g we would not want range lock which also has pid=-1 to be of low priority). In any case it would be best if glustershd leaves the file having an existing lock for the next iteration. Thanks Avati. After some trouble trying to manufacture this failure locally and a bit more debugging in the rhev setup, I have to slightly amend the description in comment #14. glustershd and the client do race for the self-heal, but glustershd actually gets the lock and proceeds with the self-heal. The client blocks on the full lock request and subsequently the guest seems to lock up until glustershd completes. I think the potential trylock solution still holds by doing the trylock in the client rather than glustershd. I'm testing a change that takes this approach in read/write triggered self-heals (which is where the already running vm use case leads to this situation) and incorporates Avati's point in comment #16 to do so only on the initial lock attempt. The aforementioned changes are posted here: http://review.gluster.org/#change,4257 http://review.gluster.org/#change,4258 I have also included for review a prospective change to make afr_flush() non-transactional, though this might be still subject to open issues: http://review.gluster.org/#change,4261 CHANGE: http://review.gluster.org/4261 (afr: make flush non-transactional) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4257 (afr: support self-heal data trylock mechanism) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4258 (afr: use data trylock mode in read/write self-heal trigger paths) merged in master by Anand Avati (avati) |