Bug 1440051

Summary:	Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	distribute	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, kdhananj, knarra, rcyriac, rgowdapp, rhinduja, rhs-bugs, sasundar, storage-qa-internal, vnosov
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1439753
Clones:	1440635 1440637 (view as bug list)		Environment:
Last Closed:	2017-05-30 18:49:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1440635, 1440637

Description Krutika Dhananjay 2017-04-07 08:36:08 UTC

+++ This bug was initially created as a clone of Bug #1439753 +++

+++ This bug was initially created as a clone of Bug #1434653 +++

Description of problem:
-----------------------
5 VM disk images are created on the fuse mounted sharded replica 3 volume of type 1x3. 5 VMs are installed, rebooted and are up. 3 more bricks are added to this volume to make it as 2x3. After performing rebalance, observed some wierd errors not allowing to login in to these VMs. When rebooted these VMs, they are unable to boot, which means that the VM disks are corrupted.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.10

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a sharded replica 3 volume
2. Optimize the volume for virt store usecase ( gluster volume set <vol> group virt ) and start the volume
3. Fuse mount the volume on another RHEL 7.3 Server ( used as hypervisor )
4. Create few disk images of size 10GB each
5. Start the VMs, install OS (RHEL 7.3) and reboot
6. When the VMs are up post installation, add 3 more bricks to the volume
7. Start rebalance process

Actual results:
---------------
VMs showed some errors on the console, which prevented from logging in. 
Post rebalance, when the VMs are rebooted, they are unable to boot with boot prompt showing up messages related to XFS inode corruption

Expected results:
-----------------
VM disks should not get corrupted.

--- Additional comment from SATHEESARAN on 2017-03-21 23:20:28 EDT ---

Setup Information
------------------


3. Volume info
--------------
# gluster volume info
 
Volume Name: trappist1
Type: Distributed-Replicate
Volume ID: 30e12835-0c21-4037-9f83-5556f3c637b6
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: server1/gluster/brick1/b1
Brick2: server2:/gluster/brick1/b1
Brick3: server3:/gluster/brick1/b1
Brick4: server3::/gluster/brick2/b2 --> new brick added
Brick5: server1:/gluster/brick2/b2 --> new brick added
Brick6: server2:/gluster/brick2/b2 --> new brick added
Options Reconfigured:
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

4. sharding related info
-------------------------
sharding is enabled on this volume with shard-block-size set to 4MB ( which is default )
[ granular-entry-heal enabled ]

5. Hypervisor details
----------------------
Host: rhs-client15.lab.eng.blr.redhat.com
mountpoint: /mnt/repvol

6.Virtual machine details
--------------------------
There are 4 virtual machines running in this host namely vm1,vm2,vm3,vm4,vm5 with their disk images on the fuse mounted gluster volume

[root@rhs-client15 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 6     vm1                            running
 7     vm2                            running
 8     vm3                            running
 9     vm4                            running
 10    vm5                            running


I have tested again with all the application VMs powered off. All VMs could boot healthy. The following are the test steps :

1. Create a sharded replica 3 volume and optimized the volume for virt store usecase
2. Created 5 VM image files on the fuse mounted gluster volume
3. Created 5 Application VMs with the above created VM images and installed OS ( RHEL7.3 ). Rebooted the VMs post OS installation.
4. Checked the health of all the VMs ( all VMs are healthy )
5. Powered off all the application VMs
6. Added 3 more bricks to convert 1x3 replicate volume to 2x3 distribute-replicate volume 
7. Initiated rebalance
8. Post rebalance has completed, started all the VMs. ( All VMs booted up healthy ) 

So, its the running VMs that are getting affected because of rebalance operation.

--- Additional comment from Raghavendra G on 2017-03-26 21:27:18 EDT ---

Conversation over mail:

> Raghu,
>
> In one of my test iteration, fix-layout itself caused corruption with VM
> disk.
> It happened only once, when I tried twice after that it never happened

One test is good enough to prove that we are dealing with at least one corruption issue, that is not the same as bz 1376757.

We need more analysis to figure out RCA.

>
> Thanks,
> Satheesaran S ( sas )

--- Additional comment from SATHEESARAN on 2017-03-27 04:12:12 EDT ---

I have ran the test with the following combinations:
- Turning off strict-o-direct, and enabling remote-dio
I could still observe that VM disks are getting corrupted.

Also I did another test with sharding turned off, this issue was not seen.

--- Additional comment from Nithya Balachandran on 2017-03-29 23:28:52 EDT ---

Hi,

Is the system on which the issue was hit still available?

Thanks,
Nithya

--- Additional comment from Raghavendra G on 2017-04-01 00:52:31 EDT ---

Following is a rough algorithm of shard_writev:

1. Based on the offset, calculate the shards touched by current write.
2. Look for inodes corresponding to these shard files in itable.
3. If one or more inodes are missing from itable, issue mknod for corresponding shard files and ignore EEXIST in cbk.
4. resume writes on respective shards.

Now, imagine a write which falls to an existing "shard_file". For the sake of discussion lets consider a distribute of three subvols - s1, s2, s3

1. "shard_file" hashes to subvolume s2 and is present on s2
2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is fixed to include s4 and hash ranges are changed.
3. write that touches "shard_file" is issued.
4. The inode for "shard_file" is not present in itable after a graph switch and features/shard issues an mknod.
5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod (shard_file) on s3 succeeds. But, the shard_file is already present on s2.

So, we have two files on two different subvols of dht representing same shard and this will lead to corruption.

To prove the above hypothesis we need to look for one or more files (say "shard_file") in .shard present in more than one subvolume of dht. IOW, more than one subvolume of dht should have the file "/.shard/shard_file".

@Sas,

Is the setup still available? If yes, can you please take a look? Or if you can give me login details, I'll take a look. If the setup is not available, can you recreate the issue one more time so that I can take a look?

regards,
Raghavendra

--- Additional comment from Krutika Dhananjay on 2017-04-03 04:13:44 EDT ---

Whatever Raghavendra suspected in comment #12 is what we observed on sas' setup just now.

Following are the duplicate shards that exist on both subvolumes of DHT:

[root@dhcp37-65 tmp]# cat /tmp/shards-replicate-1 | sort | uniq -c | grep -v "1 "              
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1397
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1864
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.487
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.552
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.7
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.487
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.509
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.521
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.7
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1397
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1398
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.576
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.7
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1397
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1398
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1867
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1868
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.2
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.487
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.552
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.576
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.941
      2 ede69d31-f048-41b7-9173-448c7046d537.1397
      2 ede69d31-f048-41b7-9173-448c7046d537.1398
      2 ede69d31-f048-41b7-9173-448c7046d537.487
      2 ede69d31-f048-41b7-9173-448c7046d537.552
      2 ede69d31-f048-41b7-9173-448c7046d537.576
      2 ede69d31-f048-41b7-9173-448c7046d537.7

Worse yet, the md5sums of the two copies differ.

For instance,

On replicate-0:
[root@dhcp37-65 tmp]# md5sum /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
1e86d0a097c724965413d07af71c0809  /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

On replicate-1:
[root@dhcp37-85 tmp]# md5sum /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
e72cc949c7ba9b76d350a77be932ba3f  /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

Raghavendra will be sending out a fix in DHT for this issue.

--- Additional comment from SATHEESARAN on 2017-04-03 04:23:01 EDT ---

(In reply to Raghavendra G from comment #12)

> @Sas,
> 
> Is the setup still available? If yes, can you please take a look? Or if you
> can give me login details, I'll take a look. If the setup is not available,
> can you recreate the issue one more time so that I can take a look?
> 
> regards,
> Raghavendra

I have already shared the setup details in the mail.
Let me know, if you need anything more

--- Additional comment from Raghavendra G on 2017-04-04 00:44:56 EDT ---

The fix itself is fairly simple:

In all entry fops - create, mknod, symlink, open with O_CREATE, link, rename, mkdir etc, we have to do:

Check volume commit hash is equal to the commit hash on parent inode
1. If yes, proceed with the dentry fop
2. else, 
   a. initiate a lookup(frame, this, loc). IOW, Wind the lookup on the location structure passed as an arg to DHT (not directly to its subvols)
   b. Once all lookups initiated in "a." are complete, resume the dentry fop.

For the scope of this bug its sufficient to fix dht_mknod. But, for completeness sake (and to avoid similar bugs in other codepaths [2]) I would prefer to fix all codepaths. So, the codepaths affected are more and hence more testing.

[1] is another VM corruption issue during rebalance.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1276062
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1286127

Comment 1 Worker Ant 2017-04-07 08:36:38 UTC

REVIEW: https://review.gluster.org/17010 (features/shard: Fix vm corruption upon fix-layout) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2017-04-07 18:52:03 UTC

COMMIT: https://review.gluster.org/17010 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 99c8c0b03a3368d81756440ab48091e1f2430a5f
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 6 18:10:41 2017 +0530

    features/shard: Fix vm corruption upon fix-layout
    
    shard's writev implementation, as part of identifying
    presence of participant shards that aren't in memory,
    first sends an MKNOD on these shards, and upon EEXIST error,
    looks up the shards before proceeding with the writes.
    
    The VM corruption was caused when the following happened:
    1. DHT had n subvolumes initially.
    2. Upon add-brick + fix-layout, the layout of .shard changed
       although the existing shards under it were yet to be migrated
       to their new hashed subvolumes.
    3. During this time, there were writes on the VM falling in regions
       of the file whose corresponding shards were already existing under
       .shard.
    4. Sharding xl sent MKNOD on these shards, now creating them in their
       new hashed subvolumes although there already exist shard blocks for
       this region with valid data.
    5. All subsequent writes were wound on these newly created copies.
    
    The net outcome is that both copies of the shard didn't have the correct
    data. This caused the affected VMs to be unbootable.
    
    FIX:
    For want of better alternatives in DHT, the fix changes shard fops to do
    a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup.
    
    Change-Id: I8a2e97d91ba3275fbc7174a008c7234fa5295d36
    BUG: 1440051
    RCA'd-by: Raghavendra Gowdappa <rgowdapp>
    Reported-by: Mahdi Adnan <mahdi.adnan>
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17010
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 3 Worker Ant 2017-04-10 06:00:02 UTC

REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Worker Ant 2017-04-10 06:56:41 UTC

REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 5 Worker Ant 2017-04-11 13:32:22 UTC

COMMIT: https://review.gluster.org/17014 committed in master by Vijay Bellur (vbellur) 
------
commit 594a7c6a187cf780bd666e7343c39a2d92fc67ef
Author: Krutika Dhananjay <kdhananj>
Date:   Mon Apr 10 11:04:31 2017 +0530

    features/shard: Initialize local->fop in readv
    
    Change-Id: I9008ca9960df4821636501ae84f93a68f370c67f
    BUG: 1440051
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17014
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 6 Worker Ant 2017-04-20 04:43:40 UTC

REVIEW: https://review.gluster.org/17085 (cluster/dht: Pass the req dict instead of NULL in dht_attr2()) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 7 Worker Ant 2017-04-20 04:47:54 UTC

REVIEW: https://review.gluster.org/17086 (mount/fuse: Replace GF_LOG_OCCASIONALLY with gf_log() to report fop failure at all times) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 8 Worker Ant 2017-04-20 04:55:55 UTC

REVIEW: https://review.gluster.org/17087 (cluster/dht: Do not sync xattrs between src and dst twice during rebalance) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 9 Worker Ant 2017-04-20 04:59:58 UTC

REVIEW: https://review.gluster.org/17086 (mount/fuse: Replace GF_LOG_OCCASIONALLY with gf_log() to report fop failure at all times) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 10 Worker Ant 2017-04-21 10:00:33 UTC

REVIEW: https://review.gluster.org/17085 (cluster/dht: Pass the req dict instead of NULL in dht_attr2()) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 11 Worker Ant 2017-04-24 04:13:00 UTC

COMMIT: https://review.gluster.org/17085 committed in master by Raghavendra G (rgowdapp) 
------
commit d60ca8e96bbc16b13f8f3456f30ebeb16d0d1e47
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 20 10:08:02 2017 +0530

    cluster/dht: Pass the req dict instead of NULL in dht_attr2()
    
    This bug was causing VMs to pause during rebalance. When qemu winds
    down a STAT, shard fills the trusted.glusterfs.shard.file-size attribute
    in the req dict which DHT doesn't wind its STAT fop with upon detecting
    the file has undergone migration. As a result shard doesn't find the
    value to this key in the unwind path, causing it to fail the STAT
    with EINVAL.
    
    Also, the same bug exists in other fops too, which is also fixed in
    this patch.
    
    Change-Id: Id7823fd932b4e5a9b8779ebb2b612a399c0ef5f0
    BUG: 1440051
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17085
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 12 Worker Ant 2017-04-27 06:24:10 UTC

REVIEW: https://review.gluster.org/17126 (cluster/dht: Pass the correct xdata in fremovexattr fop) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 13 Worker Ant 2017-04-28 04:27:17 UTC

COMMIT: https://review.gluster.org/17126 committed in master by Raghavendra G (rgowdapp) 
------
commit ab88f655e6423f51e2f2fac9265ff4d4f5c3e579
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 27 11:53:24 2017 +0530

    cluster/dht: Pass the correct xdata in fremovexattr fop
    
    Change-Id: Id84bc87e48f435573eba3b24d3fb3c411fd2445d
    BUG: 1440051
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17126
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 14 Worker Ant 2017-04-30 08:35:50 UTC

COMMIT: https://review.gluster.org/17086 committed in master by Raghavendra G (rgowdapp) 
------
commit ef60a29703f520c5bd06467efc4a0d0a33552a06
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Apr 20 10:17:07 2017 +0530

    mount/fuse: Replace GF_LOG_OCCASIONALLY with gf_log() to report fop failure at all times
    
    Change-Id: Ibd8e1c6172812951092ff6097ba4bed943051b7c
    BUG: 1440051
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/17086
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 15 Shyamsundar 2017-05-30 18:49:22 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/