Bug 1440635

Summary:	Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	distribute	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.8	CC:	amukherj, bugs, kdhananj, knarra, ndevos, rcyriac, rgowdapp, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	Reopened, Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.12	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1440051	Environment:
Last Closed:	2017-05-29 04:58:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1426508, 1440051, 1440637
Bug Blocks:	1431410, 1440754

Description Krutika Dhananjay 2017-04-10 07:26:19 UTC

+++ This bug was initially created as a clone of Bug #1440051 +++

+++ This bug was initially created as a clone of Bug #1439753 +++

+++ This bug was initially created as a clone of Bug #1434653 +++

Description of problem:
-----------------------
5 VM disk images are created on the fuse mounted sharded replica 3 volume of type 1x3. 5 VMs are installed, rebooted and are up. 3 more bricks are added to this volume to make it as 2x3. After performing rebalance, observed some wierd errors not allowing to login in to these VMs. When rebooted these VMs, they are unable to boot, which means that the VM disks are corrupted.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.10

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a sharded replica 3 volume
2. Optimize the volume for virt store usecase ( gluster volume set <vol> group virt ) and start the volume
3. Fuse mount the volume on another RHEL 7.3 Server ( used as hypervisor )
4. Create few disk images of size 10GB each
5. Start the VMs, install OS (RHEL 7.3) and reboot
6. When the VMs are up post installation, add 3 more bricks to the volume
7. Start rebalance process

Actual results:
---------------
VMs showed some errors on the console, which prevented from logging in. 
Post rebalance, when the VMs are rebooted, they are unable to boot with boot prompt showing up messages related to XFS inode corruption

Expected results:
-----------------
VM disks should not get corrupted.

--- Additional comment from SATHEESARAN on 2017-03-21 23:20:28 EDT ---

Setup Information
------------------


3. Volume info
--------------
# gluster volume info
 
Volume Name: trappist1
Type: Distributed-Replicate
Volume ID: 30e12835-0c21-4037-9f83-5556f3c637b6
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: server1/gluster/brick1/b1
Brick2: server2:/gluster/brick1/b1
Brick3: server3:/gluster/brick1/b1
Brick4: server3::/gluster/brick2/b2 --> new brick added
Brick5: server1:/gluster/brick2/b2 --> new brick added
Brick6: server2:/gluster/brick2/b2 --> new brick added
Options Reconfigured:
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

4. sharding related info
-------------------------
sharding is enabled on this volume with shard-block-size set to 4MB ( which is default )
[ granular-entry-heal enabled ]

5. Hypervisor details
----------------------
Host: rhs-client15.lab.eng.blr.redhat.com
mountpoint: /mnt/repvol

6.Virtual machine details
--------------------------
There are 4 virtual machines running in this host namely vm1,vm2,vm3,vm4,vm5 with their disk images on the fuse mounted gluster volume

[root@rhs-client15 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 6     vm1                            running
 7     vm2                            running
 8     vm3                            running
 9     vm4                            running
 10    vm5                            running


I have tested again with all the application VMs powered off. All VMs could boot healthy. The following are the test steps :

1. Create a sharded replica 3 volume and optimized the volume for virt store usecase
2. Created 5 VM image files on the fuse mounted gluster volume
3. Created 5 Application VMs with the above created VM images and installed OS ( RHEL7.3 ). Rebooted the VMs post OS installation.
4. Checked the health of all the VMs ( all VMs are healthy )
5. Powered off all the application VMs
6. Added 3 more bricks to convert 1x3 replicate volume to 2x3 distribute-replicate volume 
7. Initiated rebalance
8. Post rebalance has completed, started all the VMs. ( All VMs booted up healthy ) 

So, its the running VMs that are getting affected because of rebalance operation.

--- Additional comment from Raghavendra G on 2017-03-26 21:27:18 EDT ---

Conversation over mail:

> Raghu,
>
> In one of my test iteration, fix-layout itself caused corruption with VM
> disk.
> It happened only once, when I tried twice after that it never happened

One test is good enough to prove that we are dealing with at least one corruption issue, that is not the same as bz 1376757.

We need more analysis to figure out RCA.

>
> Thanks,
> Satheesaran S ( sas )

--- Additional comment from SATHEESARAN on 2017-03-27 04:12:12 EDT ---

I have ran the test with the following combinations:
- Turning off strict-o-direct, and enabling remote-dio
I could still observe that VM disks are getting corrupted.

Also I did another test with sharding turned off, this issue was not seen.

--- Additional comment from Nithya Balachandran on 2017-03-29 23:28:52 EDT ---

Hi,

Is the system on which the issue was hit still available?

Thanks,
Nithya

--- Additional comment from Raghavendra G on 2017-04-01 00:52:31 EDT ---

Following is a rough algorithm of shard_writev:

1. Based on the offset, calculate the shards touched by current write.
2. Look for inodes corresponding to these shard files in itable.
3. If one or more inodes are missing from itable, issue mknod for corresponding shard files and ignore EEXIST in cbk.
4. resume writes on respective shards.

Now, imagine a write which falls to an existing "shard_file". For the sake of discussion lets consider a distribute of three subvols - s1, s2, s3

1. "shard_file" hashes to subvolume s2 and is present on s2
2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is fixed to include s4 and hash ranges are changed.
3. write that touches "shard_file" is issued.
4. The inode for "shard_file" is not present in itable after a graph switch and features/shard issues an mknod.
5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod (shard_file) on s3 succeeds. But, the shard_file is already present on s2.

So, we have two files on two different subvols of dht representing same shard and this will lead to corruption.

To prove the above hypothesis we need to look for one or more files (say "shard_file") in .shard present in more than one subvolume of dht. IOW, more than one subvolume of dht should have the file "/.shard/shard_file".

@Sas,

Is the setup still available? If yes, can you please take a look? Or if you can give me login details, I'll take a look. If the setup is not available, can you recreate the issue one more time so that I can take a look?

regards,
Raghavendra

--- Additional comment from Krutika Dhananjay on 2017-04-03 04:13:44 EDT ---

Whatever Raghavendra suspected in comment #12 is what we observed on sas' setup just now.

Following are the duplicate shards that exist on both subvolumes of DHT:

[root@dhcp37-65 tmp]# cat /tmp/shards-replicate-1 | sort | uniq -c | grep -v "1 "              
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1397
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1864
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.487
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.552
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.7
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.487
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.509
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.521
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.7
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1397
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1398
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.576
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.7
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1397
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1398
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1867
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1868
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.2
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.487
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.552
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.576
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.941
      2 ede69d31-f048-41b7-9173-448c7046d537.1397
      2 ede69d31-f048-41b7-9173-448c7046d537.1398
      2 ede69d31-f048-41b7-9173-448c7046d537.487
      2 ede69d31-f048-41b7-9173-448c7046d537.552
      2 ede69d31-f048-41b7-9173-448c7046d537.576
      2 ede69d31-f048-41b7-9173-448c7046d537.7

Worse yet, the md5sums of the two copies differ.

For instance,

On replicate-0:
[root@dhcp37-65 tmp]# md5sum /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
1e86d0a097c724965413d07af71c0809  /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

On replicate-1:
[root@dhcp37-85 tmp]# md5sum /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
e72cc949c7ba9b76d350a77be932ba3f  /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

Raghavendra will be sending out a fix in DHT for this issue.

--- Additional comment from SATHEESARAN on 2017-04-03 04:23:01 EDT ---

(In reply to Raghavendra G from comment #12)

> @Sas,
> 
> Is the setup still available? If yes, can you please take a look? Or if you
> can give me login details, I'll take a look. If the setup is not available,
> can you recreate the issue one more time so that I can take a look?
> 
> regards,
> Raghavendra

I have already shared the setup details in the mail.
Let me know, if you need anything more

--- Additional comment from Raghavendra G on 2017-04-04 00:44:56 EDT ---

The fix itself is fairly simple:

In all entry fops - create, mknod, symlink, open with O_CREATE, link, rename, mkdir etc, we have to do:

Check volume commit hash is equal to the commit hash on parent inode
1. If yes, proceed with the dentry fop
2. else, 
   a. initiate a lookup(frame, this, loc). IOW, Wind the lookup on the location structure passed as an arg to DHT (not directly to its subvols)
   b. Once all lookups initiated in "a." are complete, resume the dentry fop.

For the scope of this bug its sufficient to fix dht_mknod. But, for completeness sake (and to avoid similar bugs in other codepaths [2]) I would prefer to fix all codepaths. So, the codepaths affected are more and hence more testing.

[1] is another VM corruption issue during rebalance.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1276062
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1286127

--- Additional comment from Worker Ant on 2017-04-07 04:36:38 EDT ---

REVIEW: https://review.gluster.org/17010 (features/shard: Fix vm corruption upon fix-layout) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-04-07 14:52:03 EDT ---

COMMIT: https://review.gluster.org/17010 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 99c8c0b03a3368d81756440ab48091e1f2430a5f
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 6 18:10:41 2017 +0530

    features/shard: Fix vm corruption upon fix-layout
    
    shard's writev implementation, as part of identifying
    presence of participant shards that aren't in memory,
    first sends an MKNOD on these shards, and upon EEXIST error,
    looks up the shards before proceeding with the writes.
    
    The VM corruption was caused when the following happened:
    1. DHT had n subvolumes initially.
    2. Upon add-brick + fix-layout, the layout of .shard changed
       although the existing shards under it were yet to be migrated
       to their new hashed subvolumes.
    3. During this time, there were writes on the VM falling in regions
       of the file whose corresponding shards were already existing under
       .shard.
    4. Sharding xl sent MKNOD on these shards, now creating them in their
       new hashed subvolumes although there already exist shard blocks for
       this region with valid data.
    5. All subsequent writes were wound on these newly created copies.
    
    The net outcome is that both copies of the shard didn't have the correct
    data. This caused the affected VMs to be unbootable.
    
    FIX:
    For want of better alternatives in DHT, the fix changes shard fops to do
    a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup.
    
    Change-Id: I8a2e97d91ba3275fbc7174a008c7234fa5295d36
    BUG: 1440051
    RCA'd-by: Raghavendra Gowdappa <rgowdapp>
    Reported-by: Mahdi Adnan <mahdi.adnan>
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17010
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

--- Additional comment from Worker Ant on 2017-04-10 02:00:02 EDT ---

REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-04-10 02:56:41 EDT ---

REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 1 Worker Ant 2017-04-10 07:27:25 UTC

REVIEW: https://review.gluster.org/17019 (features/shard: Fix vm corruption upon fix-layout) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2017-04-10 07:27:30 UTC

REVIEW: https://review.gluster.org/17020 (features/shard: Initialize local->fop in readv) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 3 Worker Ant 2017-04-10 11:27:47 UTC

COMMIT: https://review.gluster.org/17020 committed in release-3.8 by jiffin tony Thottan (jthottan) 
------
commit d5d599abaa598062885abc7ad8226faf26d11e64
Author: Krutika Dhananjay <kdhananj>
Date:   Mon Apr 10 11:04:31 2017 +0530

    features/shard: Initialize local->fop in readv
    
            Backport of: https://review.gluster.org/17014
    
    Change-Id: I4d2f0a3f533009038d48579db5a8a2a048b77ca1
    BUG: 1440635
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17020
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Worker Ant 2017-04-10 13:21:52 UTC

REVIEW: https://review.gluster.org/17019 (features/shard: Fix vm corruption upon fix-layout) posted (#2) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 5 Worker Ant 2017-04-10 15:45:54 UTC

COMMIT: https://review.gluster.org/17019 committed in release-3.8 by jiffin tony Thottan (jthottan) 
------
commit d71ec72b981d110199c3376f39f91b704241975c
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 6 18:10:41 2017 +0530

    features/shard: Fix vm corruption upon fix-layout
    
            Backport of: https://review.gluster.org/17010
    
    shard's writev implementation, as part of identifying
    presence of participant shards that aren't in memory,
    first sends an MKNOD on these shards, and upon EEXIST error,
    looks up the shards before proceeding with the writes.
    
    The VM corruption was caused when the following happened:
    1. DHT had n subvolumes initially.
    2. Upon add-brick + fix-layout, the layout of .shard changed
       although the existing shards under it were yet to be migrated
       to their new hashed subvolumes.
    3. During this time, there were writes on the VM falling in regions
       of the file whose corresponding shards were already existing under
       .shard.
    4. Sharding xl sent MKNOD on these shards, now creating them in their
       new hashed subvolumes although there already exist shard blocks for
       this region with valid data.
    5. All subsequent writes were wound on these newly created copies.
    
    The net outcome is that both copies of the shard didn't have the correct
    data. This caused the affected VMs to be unbootable.
    
    FIX:
    For want of better alternatives in DHT, the fix changes shard fops to do
    a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup.
    
    Change-Id: I1a5d3515b42e2e5583c407d1b4aff44d7ce472eb
    BUG: 1440635
    RCA'd-by: Raghavendra Gowdappa <rgowdapp>
    Reported-by: Mahdi Adnan <mahdi.adnan>
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17019
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>

Comment 6 Atin Mukherjee 2017-04-11 05:06:54 UTC

Fix is not yet complete as there are still issues around this use case. Moving the bug status back to POST.

Comment 7 Niels de Vos 2017-04-11 09:07:22 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.11, please open a new bug report.

glusterfs-3.8.11 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/packaging/2017-April/000289.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 8 Krutika Dhananjay 2017-04-11 13:51:08 UTC

I'm moving this bug back to ASSIGNED state as Satheesaran is seeing VM pause issue post rebalance. Upon first look at the logs, it seems like DHT is looking up the vm in the wrong sub-volume leading to fop failure with ENOENT which qemu acts on by pausing the VM.

-Krutika

Comment 9 Worker Ant 2017-04-26 08:57:27 UTC

REVIEW: https://review.gluster.org/17121 (cluster/dht: Pass the req dict instead of NULL in dht_attr2()) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 10 Worker Ant 2017-04-29 11:27:31 UTC

COMMIT: https://review.gluster.org/17121 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit ba17362ea9eb642614a69c4f8a6ea2c2648cb5d8
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 20 10:08:02 2017 +0530

    cluster/dht: Pass the req dict instead of NULL in dht_attr2()
    
            Backport of: https://review.gluster.org/17085
    
    This bug was causing VMs to pause during rebalance. When qemu winds
    down a STAT, shard fills the trusted.glusterfs.shard.file-size attribute
    in the req dict which DHT doesn't wind its STAT fop with upon detecting
    the file has undergone migration. As a result shard doesn't find the
    value to this key in the unwind path, causing it to fail the STAT
    with EINVAL.
    
    Also, the same bug exists in other fops too, which is also fixed in
    this patch.
    
    Change-Id: I56273b1a65347dabd38bc6bdd12d618f68287a00
    BUG: 1440635
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17121
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 11 Worker Ant 2017-05-02 05:02:42 UTC

REVIEW: https://review.gluster.org/17148 (cluster/dht: Pass the correct xdata in fremovexattr fop) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 12 Worker Ant 2017-05-03 11:16:46 UTC

REVIEW: https://review.gluster.org/17148 (cluster/dht: Pass the correct xdata in fremovexattr fop) posted (#2) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 13 Worker Ant 2017-05-03 20:43:02 UTC

COMMIT: https://review.gluster.org/17148 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 5dbe4fa649b8c486b2abdba660a53f7ae1198ef0
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 27 11:53:24 2017 +0530

    cluster/dht: Pass the correct xdata in fremovexattr fop
    
            Backport of: https://review.gluster.org/17126
    
    Change-Id: Id84bc87e48f435573eba3b24d3fb3c411fd2445d
    BUG: 1440635
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17148
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>

Comment 14 Niels de Vos 2017-05-29 04:58:58 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.12, please open a new bug report.

glusterfs-3.8.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2017-May/000072.html
[2] https://www.gluster.org/pipermail/gluster-users/