983145 – [RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.0 volume, used as VM image store on RHEV, leads to paused VM

Bug 983145 - [RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.0 volume, used as VM image store on RHEV, leads to paused VM

Summary: [RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.0 volume, u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	Bug Updates Notification Mailing List
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-10 15:38 UTC by Rejy M Cyriac
Modified:	2015-05-15 18:20 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.4.0.44.1u2rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:	virt rhev integration
Last Closed:	2014-02-25 07:33:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Rejy M Cyriac 2013-07-10 15:38:56 UTC

Description of problem:
In a RHEV+RHS environment, with distribute-replicate RHS 2.0+ volume used as image store, run of remove-brick operation intermittently leads to paused VM, recoverable only after a forced shutdown of the VM.

Version-Release number of selected component (if applicable):
glusterfs-server-3.3.0.11rhs-1.el6rhs.x86_64

RHEVM: 3.2 (3.2.0-11.37.el6ev)
RHS; 2.0+ (3.3.0.11rhs-1.el6rhs.x86_64)
Hypervisor: RHEL6.4 & RHEVH6.4 with glusterfs-3.3.0.11rhs-1.el6.x86_64 and glusterfs-fuse-3.3.0.11rhs-1.el6.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1.add distribute-replicate volume to RHEV as Posix compliant FS storage domain
2.create and run VMS on the storage domain
3.perform remove-brick operation start
4.access VM function till remove-brick status shows completed
5.VM may go into paused state, with message "VM <VM-name> has paused due to unknown storage error."
6.the VM is recoverable after forced shutdown of VM, which may lead to loss of data not synced

Actual results:

during remove-brick operation, intermittently the VM gets into paused state, which is recoverable after forced shutdown of VM, but may lead to loss of data not synced


Expected results:

Functioning of VM should not be impacted during the remove-brick operation.

Additional info:

This issue was initially reported in BZ 923555 for RHS 2.0+. But that BZ has now evolved to handling another issue, also caused by remove-brick operation, but valid only on RHS 2.1 , and where, as a result the VMs get corrupted.

So this BZ has been opened to deal with the original issue afresh on RHS 2.0+, which is still reproducible, and leads to intermittent instances of paused VMs, and may lead to loss of data not synced

Comment 3 Scott Haines 2013-09-23 19:47:22 UTC

Targeting for 2.1.z (Big Bend) U1.

Comment 4 Amar Tumballi 2013-11-26 07:17:05 UTC

https://code.engineering.redhat.com/gerrit/#/c/16039/ should fix this. Can we have a run of tests for this with glusterfs-3.4.0.44.1u2rhs build?

Comment 5 Gowrishankar Rajaiyan 2013-12-10 16:25:29 UTC

Clearing the needinfo flag since this bug is now ON_QA for verification.

Comment 6 SATHEESARAN 2014-01-20 08:40:23 UTC

Tested with glusterfs-3.4.0.57rhs-1.el6rhs and rhevm IS32.2

All below operation are performed from RHEVM UI

1. Created a GlusterFS Data center (3.3 compatibility )
2. Created a gluster enabled cluster (3.3 compatibility)
3. Added 4 RHSS Node, one by one, to the above created cluster 
4. Once all the RHSS Nodes are up in UI, create a distribute replicate volume of type 6X2
5. Optimize the volume for virt-store
6. Started the volume
7. Create a Data domain using the above created volume
8. Create 2 App VMs with root disk of size 30GB
9. Install the App VMs with RHEL 6.5
10. Run "dd" command in loop inside these vms
(i.e) dd if=/dev/urandom of=/home/file$i bs=1024k count=1000
11. From one of the RHSS Node (gluster cli), start remove brick with data migration
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start
12. Check for the status of remove brick operation
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> status
This migration should get completed
13. Commit the bricks once the rebalance is completed
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> commit
Now the volume has become 5X2
14. Repeated step 11, step 12, step 13 till the volume becomes 2X2 (i.e) repeat the remove-brick 3 more times

App VMs are healthy
Rebooted the App VMs multiple times and again they are healthy

Comment 8 errata-xmlrpc 2014-02-25 07:33:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.