Description of problem: ========================= In a pure replicate-volume (1x2) , during self-heal in progress, the VM's are moved to paused state. [11/07/12 - 16:08:48 root@rhs-client6 ~]# gluster v info replicate Volume Name: replicate Type: Replicate Volume ID: 19270a9d-a664-4344-8adb-a4ff1909f7f6 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: rhs-client6.lab.eng.blr.redhat.com:/disk0 Brick2: rhs-client7.lab.eng.blr.redhat.com:/disk0 Options Reconfigured: cluster.data-self-heal-algorithm: full diagnostics.client-log-level: INFO performance.quick-read: disable performance.io-cache: disable performance.stat-prefetch: disable performance.read-ahead: disable cluster.eager-lock: enable storage.linux-aio: enable Version-Release number of selected component (if applicable): ================================================================ [11/07/12 - 10:46:54 root@rhs-client6 ~]# gluster --version glusterfs 3.3.0rhsvirt1 built on Oct 28 2012 23:50:59 [11/07/12 - 10:48:07 root@rhs-client6 ~]# rpm -qa | grep gluster glusterfs-fuse-3.3.0rhsvirt1-8.el6rhs.x86_64 glusterfs-debuginfo-3.3.0rhsvirt1-8.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-3.3.0rhsvirt1-8.el6rhs.x86_64 glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-8.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-geo-replication-3.3.0rhsvirt1-8.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch How reproducible: ================== Intermittent Steps to Reproduce: ================= 1.Create a distribute-replicate volume (1x2) with 2 servers and 1 brick on each server. This is the storage for the VM's. start the volume. 2.Create 5 VM's, Perform following operations on all the VM's. a. rhn_register b. yum update c. reboot d. execute "for i in `seq 1 100`; do rm -rf testdir ; mkdir testdir ; cd testdir ; for j in `seq 1 10000` ; do dd if=/dev/urandom of=file.$j bs=1k count=1024; done ; cd ../ ; done " 3. Poweroff one storage node. 4. Create 3 new VM's. start 1 VM. On the VM perform the following operation. a. rhn_register b. yum update c. reboot 5. On 2nd newly created VM's. start the VM and perform rhn_register 6. poweron the storage node While the storage node comes online perform the following:- --------------------------------------------------------- 7. The IO's on first 5 VM's continues. 8. execute the code on rebooted VM in step 4 : "for i in `seq 1 100`; do rm -rf testdir ; mkdir testdir ; cd testdir ; for j in `seq 1 10000` ; do dd if=/dev/urandom of=file.$j bs=1k count=1024; done ; cd ../ ; done " 9. yum update on the rhn_registered VM in step5. 10. start the other newly created VM. Actual results: ================ Few VM's which were running "dd" in loop moved to paused state. Expected results: ================ VM's should be successfully running.
Additional Info:- ================ 1) Initially when the VM's were moved to paused state (as shown in RHEVM), ssh to those machines and IO on those machines were successful 2) Performed Reboot of the VM's which were moved to paused state (as shown in RHEVM) by executing the "reboot" command from the VM. ssh to those machines failed. The VM's are not responding from a long time.
I don't see any errors in the logs from gluster mounts on 7th/11/2012 from fuse. If the operations are successful on the VMs that are paused, I wonder if they are actually paused or the rhev-m is showing it as paused for some reason.
(In reply to comment #7) > I don't see any errors in the logs from gluster mounts on 7th/11/2012 from > fuse. If the operations are successful on the VMs that are paused, I wonder > if they are actually paused or the rhev-m is showing it as paused for some > reason. Were VMs in "paused" state or "Not responding" state ? Screen-shot says it was "Not responding" state and NOT "paused" state. The VMs can move to "not responding" state when the RHEV Management agent did not receive response when trying to issue a control command over the VM using libvirt or failed to receive keep alive from the guest tools. This does not mean that the VM is necessary down.
Federico wants this bug to be recreated and the setup to be shown to him before we can move forward with this bug. Could you please re-create and show the setup to him. Thanks Pranith.
Yes, in status(from snapshots) it shows "Not Responding" . It was always not responding and never paused. sorry for the typo.
*** Bug 874734 has been marked as a duplicate of this bug. ***
Just a data point... I ran through a scaled down (3 and 2 VMs rather than 5 and 3) version of this sequence since I wanted to spin up a few VMs on our setup here and didn't reproduce any problems. I repeated the test with all 5 VMs running the test script after a cycle of one of the bricks. The VMs did not pause and I didn't notice any not responding messages in the UI, though I did reproduce some hung task delays in at least one VM in the latter test. My one hypervisor server is pretty loaded at this point, so perhaps I'll try again when I have another hypervisor available to drive more guests.
Brian, Assigning this to you for now.
I finally managed to reproduce a behavior that fits the description here. I start a couple VMs in a 2x2 dist-rep volume, kill the glusterfsd's on one node, wait a bit and restart glusterd on that node. self-heal begins in the client and after I few minutes I run a sync and find the guest non-responsive (hung task messages ensue) until the self-heal completes. I hacked in a flush bypass and still reproduced the behavior, but did not reproduce if data-self-heal is disabled in the client or if I kill glustershd immediately after restarting glusterd. I observe is the following state: - self-heal starts in the client. - A (pid=-1, start=0, len=0) lock request appears and is blocked on the self-heal. Given that a self-heal is already in progress on the client, I attribute this lock request to glustershd. - The guest issues a write transaction, the lock for which conflicts with the blocked lock above and ultimately pends until the self-heal completes.
A few ways I can think of to resolve this problem in order of increasing complexity: - Don't pend normal priority locks on blocked low-priority locks. This introduces the potential for starvation of the low priority lock, but right now I think the only user is afr, which might be reasonably safe. - After a previous discussion with Pranith indicating the purpose of the 0-0 lock is to exclude multiple self-heals, we could move the 0-0 lock into a separate lock domain (i.e., from the volume name to the volume name + "-sh") and hold it for the duration of the self-heal. This could work so long as we can handle backwards compatibility (self-heals from older clients) correctly. - Find a way to use non-blocking locks in glustershd (e.g., skip files we can't lock for a later pass).
(In reply to comment #15) > - Don't pend normal priority locks on blocked low-priority locks. This > introduces the potential for starvation of the low priority lock, but right > now I think the only user is afr, which might be reasonably safe. > - After a previous discussion with Pranith indicating the purpose of the 0-0 > lock is to exclude multiple self-heals, we could move the 0-0 lock into a > separate lock domain (i.e., from the volume name to the volume name + "-sh") > and hold it for the duration of the self-heal. This could work so long as we > can handle backwards compatibility (self-heals from older clients) correctly. > - Find a way to use non-blocking locks in glustershd (e.g., skip files we > can't lock for a later pass). The last approach sounds like the best compromise. However, if it is not feasible to make glustershd locks purely non-blocking, then only glustershd's INITIAL lock to acquire full range must be made low priority (and not a blanket pid=-1 - for e.g we would not want range lock which also has pid=-1 to be of low priority). In any case it would be best if glustershd leaves the file having an existing lock for the next iteration.
Thanks Avati. After some trouble trying to manufacture this failure locally and a bit more debugging in the rhev setup, I have to slightly amend the description in comment #14. glustershd and the client do race for the self-heal, but glustershd actually gets the lock and proceeds with the self-heal. The client blocks on the full lock request and subsequently the guest seems to lock up until glustershd completes. I think the potential trylock solution still holds by doing the trylock in the client rather than glustershd. I'm testing a change that takes this approach in read/write triggered self-heals (which is where the already running vm use case leads to this situation) and incorporates Avati's point in comment #16 to do so only on the initial lock attempt.
The aforementioned changes are posted here: http://review.gluster.org/#change,4257 http://review.gluster.org/#change,4258 I have also included for review a prospective change to make afr_flush() non-transactional, though this might be still subject to open issues: http://review.gluster.org/#change,4261
CHANGE: http://review.gluster.org/4261 (afr: make flush non-transactional) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4257 (afr: support self-heal data trylock mechanism) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4258 (afr: use data trylock mode in read/write self-heal trigger paths) merged in master by Anand Avati (avati)