Bug 1387878 - Rebalance after add bricks corrupts files
Summary: Rebalance after add bricks corrupts files
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: sharding
Version: 3.8
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Krutika Dhananjay
QA Contact: bugs@gluster.org
URL:
Whiteboard:
Depends On: 1420623 1426508 1426512
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-22 22:52 UTC by Lindsay Mathieson
Modified: 2017-03-18 10:52 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.10
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-18 10:52:09 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
All the logs I could think of, include client logging from the VM (56.06 KB, application/zip)
2016-10-22 22:52 UTC, Lindsay Mathieson
no flags Details
Logs for the test with no cache set (92.10 KB, application/zip)
2016-10-24 13:30 UTC, Lindsay Mathieson
no flags Details
Logs from crash with performance.write-behind=off (45.30 KB, application/zip)
2016-10-25 10:56 UTC, Lindsay Mathieson
no flags Details

Description Lindsay Mathieson 2016-10-22 22:52:20 UTC
Created attachment 1213155 [details]
All the logs I could think of, include client logging from the VM

Description of problem:Starting a rebalance after adding three bricks to a rep 3 cluster causes VM image corruption


Version-Release number of selected component (if applicable): 3.8.4


How reproducible:


Steps to Reproduce:
1. Create a rep 3 sharded volume
2. Copy a KVM Image file (qcow2) to the volume
3. Start the VM (using gfapi)
4. Add 3 Bricks, converting the volume to  2 x 3 = 6
5. Start a Rebalance
6. Start Heavy IO on the VM (I used Crystal DiskMark in the windows VM)


Actual results: The VM rapidly crashes and becomes unbootable. "qemu-img check" shows corruption of the image.


Expected results: Rebalance compete, VM's unaffected except for increased i/o load.


Additional info:Heavy I/O after the rebalance seems to be the key, operation will complete ok otherwise. 

Also I tried a diskmark before the rebalance, then after and it also completed ok. But a diskmark only after the rebalance crashed within seconds. 

I believe that diskmark writes to the same areas each time, so maybe it is new writes after the rebalance that is the problem.

Comment 1 Lindsay Mathieson 2016-10-23 00:19:47 UTC
I also replicated the same issue using the fuse mount:Image was created using libgfapi, then actually run using the fuse mount.

Comment 2 Krutika Dhananjay 2016-10-24 10:33:12 UTC
Hi Lindsay,

Do you mind trying the same test with write-behind turned off (on libgfapi perhaps since the issue is more easily recreatable there)?

-Krutika

Comment 3 Lindsay Mathieson 2016-10-24 10:56:33 UTC
Do you need the logs for it?

Comment 4 Lindsay Mathieson 2016-10-24 12:10:56 UTC
I ran with cache=none and it took much longer, but it crashed with a unbootable disk in the end.

One fairly large oddity - on a whim I stopped and restarted the volume. The VM was bootable and ran ok after that.

I'll retest the same with cache=writeback.

Comment 5 Lindsay Mathieson 2016-10-24 13:30:50 UTC
Created attachment 1213482 [details]
Logs for the test with no cache set

Comment 6 Krutika Dhananjay 2016-10-24 14:18:05 UTC
Sorry, I didn't make myself clear.

Could you disable the gluster option named 'performance.write-behind' and run the test again and let us know whether the issue is still seen?

Here's how you disable it:

# gluster volume set VOL performance.write-behind off

You can leave the rest of the parameters at their default values - the one with which you were able to recreate the issue consistently.

-Krutika

Comment 7 Lindsay Mathieson 2016-10-25 10:55:46 UTC
Ok, did that - crashed again fairly quickly (in under a minute) with image corruption errors

Comment 8 Lindsay Mathieson 2016-10-25 10:56:41 UTC
Created attachment 1213877 [details]
Logs from crash with performance.write-behind=off

Comment 9 Raghavendra G 2017-02-09 06:18:27 UTC
Seems like this is a duplicate of bz 1420623. Hence marking this as a dependent. Please feel free to remove the dependency if it turns out to be not the case.

Comment 10 Worker Ant 2017-02-24 05:58:05 UTC
REVIEW: https://review.gluster.org/16749 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 11 Worker Ant 2017-02-24 05:58:10 UTC
REVIEW: https://review.gluster.org/16750 (features/shard: Fix EIO error on add-brick) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 12 Worker Ant 2017-03-10 22:44:50 UTC
COMMIT: https://review.gluster.org/16749 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 4ffa06cc4ddd3e30c76ecbbb5c62e50a86dd1a85
Author: Krutika Dhananjay <kdhananj>
Date:   Wed Feb 22 14:43:46 2017 +0530

    features/shard: Put onus of choosing the inode to resolve on individual fops
    
            Backport of: https://review.gluster.org/16709
    
    ... as opposed to adding checks in "common" functions to choose the inode
    to resolve based local->fop, which is rather ugly and prone to errors.
    
    Change-Id: Ib26d3dd5a7ae43cd27839752bdae2cce56d73e8a
    BUG: 1387878
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/16749
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>

Comment 13 Worker Ant 2017-03-10 22:47:09 UTC
COMMIT: https://review.gluster.org/16750 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 8462ab0fed6ff2f875909ba8fda146b72d535c2a
Author: Krutika Dhananjay <kdhananj>
Date:   Tue May 17 15:37:18 2016 +0530

    features/shard: Fix EIO error on add-brick
    
            Backport of: https://review.gluster.org/14419
    
    DHT seems to link inode during lookup even before initializing
    inode ctx with layout information, which comes after
    directory healing.
    
    Consider two parallel writes. As part of the first write,
    shard sends lookup on .shard which in its return path would
    cause DHT to link .shard inode. Now at this point, when a
    second write is wound, inode_find() of .shard succeeds and
    as a result of this, shard goes to create the participant
    shards by issuing MKNODs under .shard. Since the layout is
    yet to be initialized, mknod fails in dht call path with EIO,
    leading to VM pauses.
    
    The fix involves shard maintaining a flag to denote whether
    a fresh lookup on .shard completed one network trip. If it
    didn't, all inode_find()s in fop path will be followed by a
    lookup before proceeding with the next stage of the fop.
    
    Big thanks to Raghavendra G and Pranith Kumar K for the RCA
    and subsequent inputs and feedback on the patch.
    
    Change-Id: I66a7adf177e338a7691f441f199dde7c2b90c292
    BUG: 1387878
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/16750
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>

Comment 14 Niels de Vos 2017-03-18 10:52:09 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.10, please open a new bug report.

glusterfs-3.8.10 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-March/000068.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.