Bug 874045

Summary:	[RHEV-RHS] VM's were not responding when self-heal is in progress
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	spandura
Component:	glusterfs	Assignee:	Brian Foster <bfoster>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	2.0	CC:	aavati, bfoster, grajaiya, hchiramm, maillistofyinyin, rhinduja, rhs-bugs, rwheeler, sdharane, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.3.0.5rhs-40	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	881685 (view as bug list)		Environment:
Last Closed:	2015-08-10 07:47:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	881685

Description spandura 2012-11-07 11:05:51 UTC

Description of problem:
=========================
In a pure replicate-volume (1x2) , during self-heal in progress, the VM's are moved to paused state. 

[11/07/12 - 16:08:48 root@rhs-client6 ~]# gluster v info replicate
 
Volume Name: replicate
Type: Replicate
Volume ID: 19270a9d-a664-4344-8adb-a4ff1909f7f6
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhs-client6.lab.eng.blr.redhat.com:/disk0
Brick2: rhs-client7.lab.eng.blr.redhat.com:/disk0
Options Reconfigured:
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: INFO
performance.quick-read: disable
performance.io-cache: disable
performance.stat-prefetch: disable
performance.read-ahead: disable
cluster.eager-lock: enable
storage.linux-aio: enable


Version-Release number of selected component (if applicable):
================================================================
[11/07/12 - 10:46:54 root@rhs-client6 ~]# gluster --version
glusterfs 3.3.0rhsvirt1 built on Oct 28 2012 23:50:59

[11/07/12 - 10:48:07 root@rhs-client6 ~]# rpm -qa | grep gluster
glusterfs-fuse-3.3.0rhsvirt1-8.el6rhs.x86_64
glusterfs-debuginfo-3.3.0rhsvirt1-8.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-8.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-8.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.3.0rhsvirt1-8.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch


How reproducible:
==================
Intermittent 


Steps to Reproduce:
=================
1.Create a distribute-replicate volume (1x2) with 2 servers and 1 brick on each server. This is the storage for the VM's. start the volume.

2.Create 5 VM's, Perform following operations on all the VM's. 

  a. rhn_register
  b. yum update
  c. reboot
  d. execute "for i in `seq 1 100`; do rm -rf testdir ; mkdir testdir ; cd testdir ; for j in `seq 1 10000` ; do dd if=/dev/urandom of=file.$j bs=1k count=1024; done ; cd ../ ; done "

3. Poweroff one storage node. 
 
4. Create 3 new VM's. start 1 VM. On the VM perform the following operation. 

  a. rhn_register
  b. yum update
  c. reboot

5. On 2nd newly created VM's. start the VM and perform rhn_register

6. poweron the storage node

While the storage node comes online perform the following:-
---------------------------------------------------------
7. The IO's on first 5 VM's continues. 

8. execute the code on rebooted VM in step 4 : "for i in `seq 1 100`; do rm -rf testdir ; mkdir testdir ; cd testdir ; for j in `seq 1 10000` ; do dd if=/dev/urandom of=file.$j bs=1k count=1024; done ; cd ../ ; done "

9. yum update on the rhn_registered VM in step5. 

10. start the other newly created VM. 


Actual results:
================
Few VM's which were running "dd" in loop moved to paused state.

Expected results:
================
VM's should be successfully running.

Comment 4 spandura 2012-11-07 11:40:41 UTC

Additional Info:-
================
1) Initially when the VM's were moved to paused state (as shown in RHEVM), ssh to those machines and IO on those machines were successful

2) Performed Reboot of the VM's which were moved to paused state (as shown in RHEVM) by executing the "reboot" command from the VM. ssh to those machines failed. 

The VM's are not responding from a long time.

Comment 7 Pranith Kumar K 2012-11-12 06:50:59 UTC

I don't see any errors in the logs from gluster mounts on 7th/11/2012 from fuse. If the operations are successful on the VMs that are paused, I wonder if they are actually paused or the rhev-m is showing it as paused for some reason.

Comment 8 Humble Chirammal 2012-11-14 11:41:04 UTC

(In reply to comment #7)
> I don't see any errors in the logs from gluster mounts on 7th/11/2012 from
> fuse. If the operations are successful on the VMs that are paused, I wonder
> if they are actually paused or the rhev-m is showing it as paused for some
> reason.

Were VMs in "paused" state or "Not responding" state ? Screen-shot says it was "Not responding" state and NOT "paused" state.

The VMs can move to "not responding" state when the RHEV Management agent did not receive response when trying to issue a control command over the VM using libvirt or failed to receive keep alive from the guest tools. This does not mean that the VM is necessary down.

Comment 9 Pranith Kumar K 2012-11-15 03:43:07 UTC

Federico wants this bug to be recreated and the setup to be shown to him before we can move forward with this bug. Could you please re-create and show the setup to him.

Thanks
Pranith.

Comment 10 spandura 2012-11-15 05:01:06 UTC

Yes, in status(from snapshots) it shows "Not Responding" . It was always not responding and never paused. sorry for the typo.

Comment 11 Pranith Kumar K 2012-11-15 07:56:28 UTC

*** Bug 874734 has been marked as a duplicate of this bug. ***

Comment 12 Brian Foster 2012-11-16 17:23:20 UTC

Just a data point... I ran through a scaled down (3 and 2 VMs rather than 5 and 3) version of this sequence since I wanted to spin up a few VMs on our setup here and didn't reproduce any problems. I repeated the test with all 5 VMs running the test script after a cycle of one of the bricks.

The VMs did not pause and I didn't notice any not responding messages in the UI, though I did reproduce some hung task delays in at least one VM in the latter test. My one hypervisor server is pretty loaded at this point, so perhaps I'll try again when I have another hypervisor available to drive more guests.

Comment 13 Vijay Bellur 2012-11-26 08:38:11 UTC

Brian, 

Assigning this to you for now.

Comment 14 Brian Foster 2012-11-28 17:29:19 UTC

I finally managed to reproduce a behavior that fits the description here. I start a couple VMs in a 2x2 dist-rep volume, kill the glusterfsd's on one node, wait a bit and restart glusterd on that node.

self-heal begins in the client and after I few minutes I run a sync and find the guest non-responsive (hung task messages ensue) until the self-heal completes. I hacked in a flush bypass and still reproduced the behavior, but did not reproduce if data-self-heal is disabled in the client or if I kill glustershd immediately after restarting glusterd.

I observe is the following state:
- self-heal starts in the client.
- A (pid=-1, start=0, len=0) lock request appears and is blocked on the self-heal. Given that a self-heal is already in progress on the client, I attribute this lock request to glustershd.
- The guest issues a write transaction, the lock for which conflicts with the blocked lock above and ultimately pends until the self-heal completes.

Comment 15 Brian Foster 2012-11-28 17:46:25 UTC

A few ways I can think of to resolve this problem in order of increasing complexity:

- Don't pend normal priority locks on blocked low-priority locks. This introduces the potential for starvation of the low priority lock, but right now I think the only user is afr, which might be reasonably safe.
- After a previous discussion with Pranith indicating the purpose of the 0-0 lock is to exclude multiple self-heals, we could move the 0-0 lock into a separate lock domain (i.e., from the volume name to the volume name + "-sh") and hold it for the duration of the self-heal. This could work so long as we can handle backwards compatibility (self-heals from older clients) correctly.
- Find a way to use non-blocking locks in glustershd (e.g., skip files we can't lock for a later pass).

Comment 16 Anand Avati 2012-11-29 19:38:41 UTC

(In reply to comment #15)

> - Don't pend normal priority locks on blocked low-priority locks. This
> introduces the potential for starvation of the low priority lock, but right
> now I think the only user is afr, which might be reasonably safe.
> - After a previous discussion with Pranith indicating the purpose of the 0-0
> lock is to exclude multiple self-heals, we could move the 0-0 lock into a
> separate lock domain (i.e., from the volume name to the volume name + "-sh")
> and hold it for the duration of the self-heal. This could work so long as we
> can handle backwards compatibility (self-heals from older clients) correctly.
> - Find a way to use non-blocking locks in glustershd (e.g., skip files we
> can't lock for a later pass).

The last approach sounds like the best compromise. However, if it is not feasible to make glustershd locks purely non-blocking, then only glustershd's INITIAL lock to acquire full range must be made low priority (and not a blanket pid=-1 - for e.g we would not want range lock which also has pid=-1 to be of low priority). In any case it would be best if glustershd leaves the file having an existing lock for the next iteration.

Comment 17 Brian Foster 2012-12-03 14:26:28 UTC

Thanks Avati. After some trouble trying to manufacture this failure locally and a bit more debugging in the rhev setup, I have to slightly amend the description in comment #14. glustershd and the client do race for the self-heal, but glustershd actually gets the lock and proceeds with the self-heal. The client blocks on the full lock request and subsequently the guest seems to lock up until glustershd completes.

I think the potential trylock solution still holds by doing the trylock in the client rather than glustershd. I'm testing a change that takes this approach in read/write triggered self-heals (which is where the already running vm use case leads to this situation) and incorporates Avati's point in comment #16 to do so only on the initial lock attempt.

Comment 18 Brian Foster 2012-12-03 21:00:40 UTC

The aforementioned changes are posted here:

http://review.gluster.org/#change,4257
http://review.gluster.org/#change,4258

I have also included for review a prospective change to make afr_flush() non-transactional, though this might be still subject to open issues:

http://review.gluster.org/#change,4261

Comment 19 Vijay Bellur 2012-12-04 22:43:10 UTC

CHANGE: http://review.gluster.org/4261 (afr: make flush non-transactional) merged in master by Anand Avati (avati)

Comment 20 Vijay Bellur 2012-12-04 22:45:17 UTC

CHANGE: http://review.gluster.org/4257 (afr: support self-heal data trylock mechanism) merged in master by Anand Avati (avati)

Comment 21 Vijay Bellur 2012-12-04 22:45:35 UTC

CHANGE: http://review.gluster.org/4258 (afr: use data trylock mode in read/write self-heal trigger paths) merged in master by Anand Avati (avati)