Bug 1439753

Summary: Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance
Product: Red Hat Gluster Storage Reporter: Rejy M Cyriac <rcyriac>
Component: distributeAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, divya, kdhananj, knarra, rcyriac, rgowdapp, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.2.0 Async   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-18.1 Doc Type: Bug Fix
Doc Text:
Previously, there was a race between layout change on /.shard directory and creation of shards under it as part of parallel ongoing IO operations. This was causing the same shard to exist on multiple subvolumes with different copies of the same shard having witnessed different writes from the application. As a consequence, by virtue of neither shard having complete data, the image was corrupted, making the VM unbootable. With this fix, shard will send LOOKUP on a shard before trying to create it, so that DHT would identify any already existing shard and ensures there would always be one copy of every shard and writes will always be directed to it. Now, the VMs operate correctly when IO and rebalance operations are running in parallel.
Story Points: ---
Clone Of: 1434653
: 1440051 (view as bug list) Environment:
Last Closed: 2017-06-08 09:34:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1434653    
Bug Blocks: 1277939    

Comment 1 Atin Mukherjee 2017-04-07 08:39:21 UTC
upstream patch : https://review.gluster.org/#/c/17010/

Comment 3 Atin Mukherjee 2017-04-10 07:00:56 UTC
(In reply to Atin Mukherjee from comment #1)
> upstream patch : https://review.gluster.org/#/c/17010/

One more patch https://review.gluster.org/#/c/17014 is needed.

Comment 6 SATHEESARAN 2017-04-28 03:42:59 UTC
There are few more patches sent upstream for the fix

https://review.gluster.org/#/c/17085/

All the discussion about this bug and fixes are available as part of RHGS 3.3.0 bug[1]

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1434653

Comment 13 SATHEESARAN 2017-05-19 02:20:31 UTC
Tested with glusterfs-3.8.4-18.1 with the following tests:

1. Tried rebalance operation on the gluster volume, when VMs are getting installed
2. Triggered rebalance operation, while VMs are with active load
3. Rebooting VMs post rebalance
4. Remove brick with data migration, when VMs with active migration

with all the above mentioned tests, VMs are healthy.

Comment 14 Divya 2017-05-29 09:17:19 UTC
Krutika,

Please review and sign-off the edited doc text.

Comment 15 Krutika Dhananjay 2017-05-29 09:25:25 UTC
Looks good, Divya!

Comment 17 errata-xmlrpc 2017-06-08 09:34:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1418