Bug 983145 - [RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.0 volume, used as VM image store on RHEV, leads to paused VM
[RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.0 volume, u...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.0
All Linux
medium Severity medium
: ---
: RHGS 2.1.2
Assigned To: Bug Updates Notification Mailing List
SATHEESARAN
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-10 11:38 EDT by Rejy M Cyriac
Modified: 2015-05-15 14:20 EDT (History)
7 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.44.1u2rhs-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
virt rhev integration
Last Closed: 2014-02-25 02:33:02 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rejy M Cyriac 2013-07-10 11:38:56 EDT
Description of problem:
In a RHEV+RHS environment, with distribute-replicate RHS 2.0+ volume used as image store, run of remove-brick operation intermittently leads to paused VM, recoverable only after a forced shutdown of the VM.

Version-Release number of selected component (if applicable):
glusterfs-server-3.3.0.11rhs-1.el6rhs.x86_64

RHEVM: 3.2 (3.2.0-11.37.el6ev)
RHS; 2.0+ (3.3.0.11rhs-1.el6rhs.x86_64)
Hypervisor: RHEL6.4 & RHEVH6.4 with glusterfs-3.3.0.11rhs-1.el6.x86_64 and glusterfs-fuse-3.3.0.11rhs-1.el6.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1.add distribute-replicate volume to RHEV as Posix compliant FS storage domain
2.create and run VMS on the storage domain
3.perform remove-brick operation start
4.access VM function till remove-brick status shows completed
5.VM may go into paused state, with message "VM <VM-name> has paused due to unknown storage error."
6.the VM is recoverable after forced shutdown of VM, which may lead to loss of data not synced

Actual results:

during remove-brick operation, intermittently the VM gets into paused state, which is recoverable after forced shutdown of VM, but may lead to loss of data not synced


Expected results:

Functioning of VM should not be impacted during the remove-brick operation.

Additional info:

This issue was initially reported in BZ 923555 for RHS 2.0+. But that BZ has now evolved to handling another issue, also caused by remove-brick operation, but valid only on RHS 2.1 , and where, as a result the VMs get corrupted.

So this BZ has been opened to deal with the original issue afresh on RHS 2.0+, which is still reproducible, and leads to intermittent instances of paused VMs, and may lead to loss of data not synced
Comment 3 Scott Haines 2013-09-23 15:47:22 EDT
Targeting for 2.1.z (Big Bend) U1.
Comment 4 Amar Tumballi 2013-11-26 02:17:05 EST
https://code.engineering.redhat.com/gerrit/#/c/16039/ should fix this. Can we have a run of tests for this with glusterfs-3.4.0.44.1u2rhs build?
Comment 5 Gowrishankar Rajaiyan 2013-12-10 11:25:29 EST
Clearing the needinfo flag since this bug is now ON_QA for verification.
Comment 6 SATHEESARAN 2014-01-20 03:40:23 EST
Tested with glusterfs-3.4.0.57rhs-1.el6rhs and rhevm IS32.2

All below operation are performed from RHEVM UI

1. Created a GlusterFS Data center (3.3 compatibility )
2. Created a gluster enabled cluster (3.3 compatibility)
3. Added 4 RHSS Node, one by one, to the above created cluster 
4. Once all the RHSS Nodes are up in UI, create a distribute replicate volume of type 6X2
5. Optimize the volume for virt-store
6. Started the volume
7. Create a Data domain using the above created volume
8. Create 2 App VMs with root disk of size 30GB
9. Install the App VMs with RHEL 6.5
10. Run "dd" command in loop inside these vms
(i.e) dd if=/dev/urandom of=/home/file$i bs=1024k count=1000
11. From one of the RHSS Node (gluster cli), start remove brick with data migration
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start
12. Check for the status of remove brick operation
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> status
This migration should get completed
13. Commit the bricks once the rebalance is completed
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> commit
Now the volume has become 5X2
14. Repeated step 11, step 12, step 13 till the volume becomes 2X2 (i.e) repeat the remove-brick 3 more times

App VMs are healthy
Rebooted the App VMs multiple times and again they are healthy
Comment 8 errata-xmlrpc 2014-02-25 02:33:02 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.