+++ This bug was initially created as a clone of Bug #1440051 +++ +++ This bug was initially created as a clone of Bug #1439753 +++ +++ This bug was initially created as a clone of Bug #1434653 +++ Description of problem: ----------------------- 5 VM disk images are created on the fuse mounted sharded replica 3 volume of type 1x3. 5 VMs are installed, rebooted and are up. 3 more bricks are added to this volume to make it as 2x3. After performing rebalance, observed some wierd errors not allowing to login in to these VMs. When rebooted these VMs, they are unable to boot, which means that the VM disks are corrupted. Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.8.10 How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Create a sharded replica 3 volume 2. Optimize the volume for virt store usecase ( gluster volume set <vol> group virt ) and start the volume 3. Fuse mount the volume on another RHEL 7.3 Server ( used as hypervisor ) 4. Create few disk images of size 10GB each 5. Start the VMs, install OS (RHEL 7.3) and reboot 6. When the VMs are up post installation, add 3 more bricks to the volume 7. Start rebalance process Actual results: --------------- VMs showed some errors on the console, which prevented from logging in. Post rebalance, when the VMs are rebooted, they are unable to boot with boot prompt showing up messages related to XFS inode corruption Expected results: ----------------- VM disks should not get corrupted. --- Additional comment from SATHEESARAN on 2017-03-21 23:20:28 EDT --- Setup Information ------------------ 3. Volume info -------------- # gluster volume info Volume Name: trappist1 Type: Distributed-Replicate Volume ID: 30e12835-0c21-4037-9f83-5556f3c637b6 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: server1/gluster/brick1/b1 Brick2: server2:/gluster/brick1/b1 Brick3: server3:/gluster/brick1/b1 Brick4: server3::/gluster/brick2/b2 --> new brick added Brick5: server1:/gluster/brick2/b2 --> new brick added Brick6: server2:/gluster/brick2/b2 --> new brick added Options Reconfigured: network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on 4. sharding related info ------------------------- sharding is enabled on this volume with shard-block-size set to 4MB ( which is default ) [ granular-entry-heal enabled ] 5. Hypervisor details ---------------------- Host: rhs-client15.lab.eng.blr.redhat.com mountpoint: /mnt/repvol 6.Virtual machine details -------------------------- There are 4 virtual machines running in this host namely vm1,vm2,vm3,vm4,vm5 with their disk images on the fuse mounted gluster volume [root@rhs-client15 ~]# virsh list --all Id Name State ---------------------------------------------------- 6 vm1 running 7 vm2 running 8 vm3 running 9 vm4 running 10 vm5 running I have tested again with all the application VMs powered off. All VMs could boot healthy. The following are the test steps : 1. Create a sharded replica 3 volume and optimized the volume for virt store usecase 2. Created 5 VM image files on the fuse mounted gluster volume 3. Created 5 Application VMs with the above created VM images and installed OS ( RHEL7.3 ). Rebooted the VMs post OS installation. 4. Checked the health of all the VMs ( all VMs are healthy ) 5. Powered off all the application VMs 6. Added 3 more bricks to convert 1x3 replicate volume to 2x3 distribute-replicate volume 7. Initiated rebalance 8. Post rebalance has completed, started all the VMs. ( All VMs booted up healthy ) So, its the running VMs that are getting affected because of rebalance operation. --- Additional comment from Raghavendra G on 2017-03-26 21:27:18 EDT --- Conversation over mail: > Raghu, > > In one of my test iteration, fix-layout itself caused corruption with VM > disk. > It happened only once, when I tried twice after that it never happened One test is good enough to prove that we are dealing with at least one corruption issue, that is not the same as bz 1376757. We need more analysis to figure out RCA. > > Thanks, > Satheesaran S ( sas ) --- Additional comment from SATHEESARAN on 2017-03-27 04:12:12 EDT --- I have ran the test with the following combinations: - Turning off strict-o-direct, and enabling remote-dio I could still observe that VM disks are getting corrupted. Also I did another test with sharding turned off, this issue was not seen. --- Additional comment from Nithya Balachandran on 2017-03-29 23:28:52 EDT --- Hi, Is the system on which the issue was hit still available? Thanks, Nithya --- Additional comment from Raghavendra G on 2017-04-01 00:52:31 EDT --- Following is a rough algorithm of shard_writev: 1. Based on the offset, calculate the shards touched by current write. 2. Look for inodes corresponding to these shard files in itable. 3. If one or more inodes are missing from itable, issue mknod for corresponding shard files and ignore EEXIST in cbk. 4. resume writes on respective shards. Now, imagine a write which falls to an existing "shard_file". For the sake of discussion lets consider a distribute of three subvols - s1, s2, s3 1. "shard_file" hashes to subvolume s2 and is present on s2 2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is fixed to include s4 and hash ranges are changed. 3. write that touches "shard_file" is issued. 4. The inode for "shard_file" is not present in itable after a graph switch and features/shard issues an mknod. 5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod (shard_file) on s3 succeeds. But, the shard_file is already present on s2. So, we have two files on two different subvols of dht representing same shard and this will lead to corruption. To prove the above hypothesis we need to look for one or more files (say "shard_file") in .shard present in more than one subvolume of dht. IOW, more than one subvolume of dht should have the file "/.shard/shard_file". @Sas, Is the setup still available? If yes, can you please take a look? Or if you can give me login details, I'll take a look. If the setup is not available, can you recreate the issue one more time so that I can take a look? regards, Raghavendra --- Additional comment from Krutika Dhananjay on 2017-04-03 04:13:44 EDT --- Whatever Raghavendra suspected in comment #12 is what we observed on sas' setup just now. Following are the duplicate shards that exist on both subvolumes of DHT: [root@dhcp37-65 tmp]# cat /tmp/shards-replicate-1 | sort | uniq -c | grep -v "1 " 2 702cd056-84d5-4c83-9232-cca363f2b3a7.1397 2 702cd056-84d5-4c83-9232-cca363f2b3a7.1864 2 702cd056-84d5-4c83-9232-cca363f2b3a7.487 2 702cd056-84d5-4c83-9232-cca363f2b3a7.552 2 702cd056-84d5-4c83-9232-cca363f2b3a7.7 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.487 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.509 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.521 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.7 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1397 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1398 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.576 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.7 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1397 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1398 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1867 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1868 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.2 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.487 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.552 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.576 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.941 2 ede69d31-f048-41b7-9173-448c7046d537.1397 2 ede69d31-f048-41b7-9173-448c7046d537.1398 2 ede69d31-f048-41b7-9173-448c7046d537.487 2 ede69d31-f048-41b7-9173-448c7046d537.552 2 ede69d31-f048-41b7-9173-448c7046d537.576 2 ede69d31-f048-41b7-9173-448c7046d537.7 Worse yet, the md5sums of the two copies differ. For instance, On replicate-0: [root@dhcp37-65 tmp]# md5sum /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 1e86d0a097c724965413d07af71c0809 /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 On replicate-1: [root@dhcp37-85 tmp]# md5sum /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 e72cc949c7ba9b76d350a77be932ba3f /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 Raghavendra will be sending out a fix in DHT for this issue. --- Additional comment from SATHEESARAN on 2017-04-03 04:23:01 EDT --- (In reply to Raghavendra G from comment #12) > @Sas, > > Is the setup still available? If yes, can you please take a look? Or if you > can give me login details, I'll take a look. If the setup is not available, > can you recreate the issue one more time so that I can take a look? > > regards, > Raghavendra I have already shared the setup details in the mail. Let me know, if you need anything more --- Additional comment from Raghavendra G on 2017-04-04 00:44:56 EDT --- The fix itself is fairly simple: In all entry fops - create, mknod, symlink, open with O_CREATE, link, rename, mkdir etc, we have to do: Check volume commit hash is equal to the commit hash on parent inode 1. If yes, proceed with the dentry fop 2. else, a. initiate a lookup(frame, this, loc). IOW, Wind the lookup on the location structure passed as an arg to DHT (not directly to its subvols) b. Once all lookups initiated in "a." are complete, resume the dentry fop. For the scope of this bug its sufficient to fix dht_mknod. But, for completeness sake (and to avoid similar bugs in other codepaths [2]) I would prefer to fix all codepaths. So, the codepaths affected are more and hence more testing. [1] is another VM corruption issue during rebalance. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1276062 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1286127 --- Additional comment from Worker Ant on 2017-04-07 04:36:38 EDT --- REVIEW: https://review.gluster.org/17010 (features/shard: Fix vm corruption upon fix-layout) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Worker Ant on 2017-04-07 14:52:03 EDT --- COMMIT: https://review.gluster.org/17010 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 99c8c0b03a3368d81756440ab48091e1f2430a5f Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 6 18:10:41 2017 +0530 features/shard: Fix vm corruption upon fix-layout shard's writev implementation, as part of identifying presence of participant shards that aren't in memory, first sends an MKNOD on these shards, and upon EEXIST error, looks up the shards before proceeding with the writes. The VM corruption was caused when the following happened: 1. DHT had n subvolumes initially. 2. Upon add-brick + fix-layout, the layout of .shard changed although the existing shards under it were yet to be migrated to their new hashed subvolumes. 3. During this time, there were writes on the VM falling in regions of the file whose corresponding shards were already existing under .shard. 4. Sharding xl sent MKNOD on these shards, now creating them in their new hashed subvolumes although there already exist shard blocks for this region with valid data. 5. All subsequent writes were wound on these newly created copies. The net outcome is that both copies of the shard didn't have the correct data. This caused the affected VMs to be unbootable. FIX: For want of better alternatives in DHT, the fix changes shard fops to do a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup. Change-Id: I8a2e97d91ba3275fbc7174a008c7234fa5295d36 BUG: 1440051 RCA'd-by: Raghavendra Gowdappa <rgowdapp> Reported-by: Mahdi Adnan <mahdi.adnan> Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17010 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> --- Additional comment from Worker Ant on 2017-04-10 02:00:02 EDT --- REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Worker Ant on 2017-04-10 02:56:41 EDT --- REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17019 (features/shard: Fix vm corruption upon fix-layout) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17020 (features/shard: Initialize local->fop in readv) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17020 committed in release-3.8 by jiffin tony Thottan (jthottan) ------ commit d5d599abaa598062885abc7ad8226faf26d11e64 Author: Krutika Dhananjay <kdhananj> Date: Mon Apr 10 11:04:31 2017 +0530 features/shard: Initialize local->fop in readv Backport of: https://review.gluster.org/17014 Change-Id: I4d2f0a3f533009038d48579db5a8a2a048b77ca1 BUG: 1440635 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17020 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> CentOS-regression: Gluster Build System <jenkins.org>
REVIEW: https://review.gluster.org/17019 (features/shard: Fix vm corruption upon fix-layout) posted (#2) for review on release-3.8 by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17019 committed in release-3.8 by jiffin tony Thottan (jthottan) ------ commit d71ec72b981d110199c3376f39f91b704241975c Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 6 18:10:41 2017 +0530 features/shard: Fix vm corruption upon fix-layout Backport of: https://review.gluster.org/17010 shard's writev implementation, as part of identifying presence of participant shards that aren't in memory, first sends an MKNOD on these shards, and upon EEXIST error, looks up the shards before proceeding with the writes. The VM corruption was caused when the following happened: 1. DHT had n subvolumes initially. 2. Upon add-brick + fix-layout, the layout of .shard changed although the existing shards under it were yet to be migrated to their new hashed subvolumes. 3. During this time, there were writes on the VM falling in regions of the file whose corresponding shards were already existing under .shard. 4. Sharding xl sent MKNOD on these shards, now creating them in their new hashed subvolumes although there already exist shard blocks for this region with valid data. 5. All subsequent writes were wound on these newly created copies. The net outcome is that both copies of the shard didn't have the correct data. This caused the affected VMs to be unbootable. FIX: For want of better alternatives in DHT, the fix changes shard fops to do a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup. Change-Id: I1a5d3515b42e2e5583c407d1b4aff44d7ce472eb BUG: 1440635 RCA'd-by: Raghavendra Gowdappa <rgowdapp> Reported-by: Mahdi Adnan <mahdi.adnan> Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17019 CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: jiffin tony Thottan <jthottan>
Fix is not yet complete as there are still issues around this use case. Moving the bug status back to POST.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.11, please open a new bug report. glusterfs-3.8.11 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/packaging/2017-April/000289.html [2] https://www.gluster.org/pipermail/gluster-users/
I'm moving this bug back to ASSIGNED state as Satheesaran is seeing VM pause issue post rebalance. Upon first look at the logs, it seems like DHT is looking up the vm in the wrong sub-volume leading to fop failure with ENOENT which qemu acts on by pausing the VM. -Krutika
REVIEW: https://review.gluster.org/17121 (cluster/dht: Pass the req dict instead of NULL in dht_attr2()) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17121 committed in release-3.8 by Niels de Vos (ndevos) ------ commit ba17362ea9eb642614a69c4f8a6ea2c2648cb5d8 Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 20 10:08:02 2017 +0530 cluster/dht: Pass the req dict instead of NULL in dht_attr2() Backport of: https://review.gluster.org/17085 This bug was causing VMs to pause during rebalance. When qemu winds down a STAT, shard fills the trusted.glusterfs.shard.file-size attribute in the req dict which DHT doesn't wind its STAT fop with upon detecting the file has undergone migration. As a result shard doesn't find the value to this key in the unwind path, causing it to fail the STAT with EINVAL. Also, the same bug exists in other fops too, which is also fixed in this patch. Change-Id: I56273b1a65347dabd38bc6bdd12d618f68287a00 BUG: 1440635 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17121 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp> CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org>
REVIEW: https://review.gluster.org/17148 (cluster/dht: Pass the correct xdata in fremovexattr fop) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17148 (cluster/dht: Pass the correct xdata in fremovexattr fop) posted (#2) for review on release-3.8 by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17148 committed in release-3.8 by Niels de Vos (ndevos) ------ commit 5dbe4fa649b8c486b2abdba660a53f7ae1198ef0 Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 27 11:53:24 2017 +0530 cluster/dht: Pass the correct xdata in fremovexattr fop Backport of: https://review.gluster.org/17126 Change-Id: Id84bc87e48f435573eba3b24d3fb3c411fd2445d BUG: 1440635 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17148 NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Niels de Vos <ndevos>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.12, please open a new bug report. glusterfs-3.8.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2017-May/000072.html [2] https://www.gluster.org/pipermail/gluster-users/