Created attachment 1213155 [details] All the logs I could think of, include client logging from the VM Description of problem:Starting a rebalance after adding three bricks to a rep 3 cluster causes VM image corruption Version-Release number of selected component (if applicable): 3.8.4 How reproducible: Steps to Reproduce: 1. Create a rep 3 sharded volume 2. Copy a KVM Image file (qcow2) to the volume 3. Start the VM (using gfapi) 4. Add 3 Bricks, converting the volume to 2 x 3 = 6 5. Start a Rebalance 6. Start Heavy IO on the VM (I used Crystal DiskMark in the windows VM) Actual results: The VM rapidly crashes and becomes unbootable. "qemu-img check" shows corruption of the image. Expected results: Rebalance compete, VM's unaffected except for increased i/o load. Additional info:Heavy I/O after the rebalance seems to be the key, operation will complete ok otherwise. Also I tried a diskmark before the rebalance, then after and it also completed ok. But a diskmark only after the rebalance crashed within seconds. I believe that diskmark writes to the same areas each time, so maybe it is new writes after the rebalance that is the problem.
I also replicated the same issue using the fuse mount:Image was created using libgfapi, then actually run using the fuse mount.
Hi Lindsay, Do you mind trying the same test with write-behind turned off (on libgfapi perhaps since the issue is more easily recreatable there)? -Krutika
Do you need the logs for it?
I ran with cache=none and it took much longer, but it crashed with a unbootable disk in the end. One fairly large oddity - on a whim I stopped and restarted the volume. The VM was bootable and ran ok after that. I'll retest the same with cache=writeback.
Created attachment 1213482 [details] Logs for the test with no cache set
Sorry, I didn't make myself clear. Could you disable the gluster option named 'performance.write-behind' and run the test again and let us know whether the issue is still seen? Here's how you disable it: # gluster volume set VOL performance.write-behind off You can leave the rest of the parameters at their default values - the one with which you were able to recreate the issue consistently. -Krutika
Ok, did that - crashed again fairly quickly (in under a minute) with image corruption errors
Created attachment 1213877 [details] Logs from crash with performance.write-behind=off
Seems like this is a duplicate of bz 1420623. Hence marking this as a dependent. Please feel free to remove the dependency if it turns out to be not the case.
REVIEW: https://review.gluster.org/16749 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/16750 (features/shard: Fix EIO error on add-brick) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/16749 committed in release-3.8 by Niels de Vos (ndevos) ------ commit 4ffa06cc4ddd3e30c76ecbbb5c62e50a86dd1a85 Author: Krutika Dhananjay <kdhananj> Date: Wed Feb 22 14:43:46 2017 +0530 features/shard: Put onus of choosing the inode to resolve on individual fops Backport of: https://review.gluster.org/16709 ... as opposed to adding checks in "common" functions to choose the inode to resolve based local->fop, which is rather ugly and prone to errors. Change-Id: Ib26d3dd5a7ae43cd27839752bdae2cce56d73e8a BUG: 1387878 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/16749 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Niels de Vos <ndevos>
COMMIT: https://review.gluster.org/16750 committed in release-3.8 by Niels de Vos (ndevos) ------ commit 8462ab0fed6ff2f875909ba8fda146b72d535c2a Author: Krutika Dhananjay <kdhananj> Date: Tue May 17 15:37:18 2016 +0530 features/shard: Fix EIO error on add-brick Backport of: https://review.gluster.org/14419 DHT seems to link inode during lookup even before initializing inode ctx with layout information, which comes after directory healing. Consider two parallel writes. As part of the first write, shard sends lookup on .shard which in its return path would cause DHT to link .shard inode. Now at this point, when a second write is wound, inode_find() of .shard succeeds and as a result of this, shard goes to create the participant shards by issuing MKNODs under .shard. Since the layout is yet to be initialized, mknod fails in dht call path with EIO, leading to VM pauses. The fix involves shard maintaining a flag to denote whether a fresh lookup on .shard completed one network trip. If it didn't, all inode_find()s in fop path will be followed by a lookup before proceeding with the next stage of the fop. Big thanks to Raghavendra G and Pranith Kumar K for the RCA and subsequent inputs and feedback on the patch. Change-Id: I66a7adf177e338a7691f441f199dde7c2b90c292 BUG: 1387878 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/16750 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Niels de Vos <ndevos>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.10, please open a new bug report. glusterfs-3.8.10 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-March/000068.html [2] https://www.gluster.org/pipermail/gluster-users/