Description of problem: In a RHEV+RHS environment, with distribute-replicate RHS 2.0+ volume used as image store, run of remove-brick operation intermittently leads to paused VM, recoverable only after a forced shutdown of the VM. Version-Release number of selected component (if applicable): glusterfs-server-3.3.0.11rhs-1.el6rhs.x86_64 RHEVM: 3.2 (3.2.0-11.37.el6ev) RHS; 2.0+ (3.3.0.11rhs-1.el6rhs.x86_64) Hypervisor: RHEL6.4 & RHEVH6.4 with glusterfs-3.3.0.11rhs-1.el6.x86_64 and glusterfs-fuse-3.3.0.11rhs-1.el6.x86_64 How reproducible: Intermittent Steps to Reproduce: 1.add distribute-replicate volume to RHEV as Posix compliant FS storage domain 2.create and run VMS on the storage domain 3.perform remove-brick operation start 4.access VM function till remove-brick status shows completed 5.VM may go into paused state, with message "VM <VM-name> has paused due to unknown storage error." 6.the VM is recoverable after forced shutdown of VM, which may lead to loss of data not synced Actual results: during remove-brick operation, intermittently the VM gets into paused state, which is recoverable after forced shutdown of VM, but may lead to loss of data not synced Expected results: Functioning of VM should not be impacted during the remove-brick operation. Additional info: This issue was initially reported in BZ 923555 for RHS 2.0+. But that BZ has now evolved to handling another issue, also caused by remove-brick operation, but valid only on RHS 2.1 , and where, as a result the VMs get corrupted. So this BZ has been opened to deal with the original issue afresh on RHS 2.0+, which is still reproducible, and leads to intermittent instances of paused VMs, and may lead to loss of data not synced
Targeting for 2.1.z (Big Bend) U1.
https://code.engineering.redhat.com/gerrit/#/c/16039/ should fix this. Can we have a run of tests for this with glusterfs-3.4.0.44.1u2rhs build?
Clearing the needinfo flag since this bug is now ON_QA for verification.
Tested with glusterfs-3.4.0.57rhs-1.el6rhs and rhevm IS32.2 All below operation are performed from RHEVM UI 1. Created a GlusterFS Data center (3.3 compatibility ) 2. Created a gluster enabled cluster (3.3 compatibility) 3. Added 4 RHSS Node, one by one, to the above created cluster 4. Once all the RHSS Nodes are up in UI, create a distribute replicate volume of type 6X2 5. Optimize the volume for virt-store 6. Started the volume 7. Create a Data domain using the above created volume 8. Create 2 App VMs with root disk of size 30GB 9. Install the App VMs with RHEL 6.5 10. Run "dd" command in loop inside these vms (i.e) dd if=/dev/urandom of=/home/file$i bs=1024k count=1000 11. From one of the RHSS Node (gluster cli), start remove brick with data migration (i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start 12. Check for the status of remove brick operation (i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> status This migration should get completed 13. Commit the bricks once the rebalance is completed (i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> commit Now the volume has become 5X2 14. Repeated step 11, step 12, step 13 till the volume becomes 2X2 (i.e) repeat the remove-brick 3 more times App VMs are healthy Rebooted the App VMs multiple times and again they are healthy
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html