Red Hat Bugzilla – Bug 983145
[RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.0 volume, used as VM image store on RHEV, leads to paused VM
Last modified: 2015-05-15 14:20:44 EDT
Description of problem:
In a RHEV+RHS environment, with distribute-replicate RHS 2.0+ volume used as image store, run of remove-brick operation intermittently leads to paused VM, recoverable only after a forced shutdown of the VM.
Version-Release number of selected component (if applicable):
RHEVM: 3.2 (3.2.0-11.37.el6ev)
RHS; 2.0+ (22.214.171.124rhs-1.el6rhs.x86_64)
Hypervisor: RHEL6.4 & RHEVH6.4 with glusterfs-126.96.36.199rhs-1.el6.x86_64 and glusterfs-fuse-188.8.131.52rhs-1.el6.x86_64
Steps to Reproduce:
1.add distribute-replicate volume to RHEV as Posix compliant FS storage domain
2.create and run VMS on the storage domain
3.perform remove-brick operation start
4.access VM function till remove-brick status shows completed
5.VM may go into paused state, with message "VM <VM-name> has paused due to unknown storage error."
6.the VM is recoverable after forced shutdown of VM, which may lead to loss of data not synced
during remove-brick operation, intermittently the VM gets into paused state, which is recoverable after forced shutdown of VM, but may lead to loss of data not synced
Functioning of VM should not be impacted during the remove-brick operation.
This issue was initially reported in BZ 923555 for RHS 2.0+. But that BZ has now evolved to handling another issue, also caused by remove-brick operation, but valid only on RHS 2.1 , and where, as a result the VMs get corrupted.
So this BZ has been opened to deal with the original issue afresh on RHS 2.0+, which is still reproducible, and leads to intermittent instances of paused VMs, and may lead to loss of data not synced
Targeting for 2.1.z (Big Bend) U1.
https://code.engineering.redhat.com/gerrit/#/c/16039/ should fix this. Can we have a run of tests for this with glusterfs-184.108.40.206.1u2rhs build?
Clearing the needinfo flag since this bug is now ON_QA for verification.
Tested with glusterfs-220.127.116.11rhs-1.el6rhs and rhevm IS32.2
All below operation are performed from RHEVM UI
1. Created a GlusterFS Data center (3.3 compatibility )
2. Created a gluster enabled cluster (3.3 compatibility)
3. Added 4 RHSS Node, one by one, to the above created cluster
4. Once all the RHSS Nodes are up in UI, create a distribute replicate volume of type 6X2
5. Optimize the volume for virt-store
6. Started the volume
7. Create a Data domain using the above created volume
8. Create 2 App VMs with root disk of size 30GB
9. Install the App VMs with RHEL 6.5
10. Run "dd" command in loop inside these vms
(i.e) dd if=/dev/urandom of=/home/file$i bs=1024k count=1000
11. From one of the RHSS Node (gluster cli), start remove brick with data migration
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start
12. Check for the status of remove brick operation
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> status
This migration should get completed
13. Commit the bricks once the rebalance is completed
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> commit
Now the volume has become 5X2
14. Repeated step 11, step 12, step 13 till the volume becomes 2X2 (i.e) repeat the remove-brick 3 more times
App VMs are healthy
Rebooted the App VMs multiple times and again they are healthy
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.