+++ This bug was initially created as a clone of Bug #1439753 +++ +++ This bug was initially created as a clone of Bug #1434653 +++ Description of problem: ----------------------- 5 VM disk images are created on the fuse mounted sharded replica 3 volume of type 1x3. 5 VMs are installed, rebooted and are up. 3 more bricks are added to this volume to make it as 2x3. After performing rebalance, observed some wierd errors not allowing to login in to these VMs. When rebooted these VMs, they are unable to boot, which means that the VM disks are corrupted. Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.8.10 How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Create a sharded replica 3 volume 2. Optimize the volume for virt store usecase ( gluster volume set <vol> group virt ) and start the volume 3. Fuse mount the volume on another RHEL 7.3 Server ( used as hypervisor ) 4. Create few disk images of size 10GB each 5. Start the VMs, install OS (RHEL 7.3) and reboot 6. When the VMs are up post installation, add 3 more bricks to the volume 7. Start rebalance process Actual results: --------------- VMs showed some errors on the console, which prevented from logging in. Post rebalance, when the VMs are rebooted, they are unable to boot with boot prompt showing up messages related to XFS inode corruption Expected results: ----------------- VM disks should not get corrupted. --- Additional comment from SATHEESARAN on 2017-03-21 23:20:28 EDT --- Setup Information ------------------ 3. Volume info -------------- # gluster volume info Volume Name: trappist1 Type: Distributed-Replicate Volume ID: 30e12835-0c21-4037-9f83-5556f3c637b6 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: server1/gluster/brick1/b1 Brick2: server2:/gluster/brick1/b1 Brick3: server3:/gluster/brick1/b1 Brick4: server3::/gluster/brick2/b2 --> new brick added Brick5: server1:/gluster/brick2/b2 --> new brick added Brick6: server2:/gluster/brick2/b2 --> new brick added Options Reconfigured: network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on 4. sharding related info ------------------------- sharding is enabled on this volume with shard-block-size set to 4MB ( which is default ) [ granular-entry-heal enabled ] 5. Hypervisor details ---------------------- Host: rhs-client15.lab.eng.blr.redhat.com mountpoint: /mnt/repvol 6.Virtual machine details -------------------------- There are 4 virtual machines running in this host namely vm1,vm2,vm3,vm4,vm5 with their disk images on the fuse mounted gluster volume [root@rhs-client15 ~]# virsh list --all Id Name State ---------------------------------------------------- 6 vm1 running 7 vm2 running 8 vm3 running 9 vm4 running 10 vm5 running I have tested again with all the application VMs powered off. All VMs could boot healthy. The following are the test steps : 1. Create a sharded replica 3 volume and optimized the volume for virt store usecase 2. Created 5 VM image files on the fuse mounted gluster volume 3. Created 5 Application VMs with the above created VM images and installed OS ( RHEL7.3 ). Rebooted the VMs post OS installation. 4. Checked the health of all the VMs ( all VMs are healthy ) 5. Powered off all the application VMs 6. Added 3 more bricks to convert 1x3 replicate volume to 2x3 distribute-replicate volume 7. Initiated rebalance 8. Post rebalance has completed, started all the VMs. ( All VMs booted up healthy ) So, its the running VMs that are getting affected because of rebalance operation. --- Additional comment from Raghavendra G on 2017-03-26 21:27:18 EDT --- Conversation over mail: > Raghu, > > In one of my test iteration, fix-layout itself caused corruption with VM > disk. > It happened only once, when I tried twice after that it never happened One test is good enough to prove that we are dealing with at least one corruption issue, that is not the same as bz 1376757. We need more analysis to figure out RCA. > > Thanks, > Satheesaran S ( sas ) --- Additional comment from SATHEESARAN on 2017-03-27 04:12:12 EDT --- I have ran the test with the following combinations: - Turning off strict-o-direct, and enabling remote-dio I could still observe that VM disks are getting corrupted. Also I did another test with sharding turned off, this issue was not seen. --- Additional comment from Nithya Balachandran on 2017-03-29 23:28:52 EDT --- Hi, Is the system on which the issue was hit still available? Thanks, Nithya --- Additional comment from Raghavendra G on 2017-04-01 00:52:31 EDT --- Following is a rough algorithm of shard_writev: 1. Based on the offset, calculate the shards touched by current write. 2. Look for inodes corresponding to these shard files in itable. 3. If one or more inodes are missing from itable, issue mknod for corresponding shard files and ignore EEXIST in cbk. 4. resume writes on respective shards. Now, imagine a write which falls to an existing "shard_file". For the sake of discussion lets consider a distribute of three subvols - s1, s2, s3 1. "shard_file" hashes to subvolume s2 and is present on s2 2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is fixed to include s4 and hash ranges are changed. 3. write that touches "shard_file" is issued. 4. The inode for "shard_file" is not present in itable after a graph switch and features/shard issues an mknod. 5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod (shard_file) on s3 succeeds. But, the shard_file is already present on s2. So, we have two files on two different subvols of dht representing same shard and this will lead to corruption. To prove the above hypothesis we need to look for one or more files (say "shard_file") in .shard present in more than one subvolume of dht. IOW, more than one subvolume of dht should have the file "/.shard/shard_file". @Sas, Is the setup still available? If yes, can you please take a look? Or if you can give me login details, I'll take a look. If the setup is not available, can you recreate the issue one more time so that I can take a look? regards, Raghavendra --- Additional comment from Krutika Dhananjay on 2017-04-03 04:13:44 EDT --- Whatever Raghavendra suspected in comment #12 is what we observed on sas' setup just now. Following are the duplicate shards that exist on both subvolumes of DHT: [root@dhcp37-65 tmp]# cat /tmp/shards-replicate-1 | sort | uniq -c | grep -v "1 " 2 702cd056-84d5-4c83-9232-cca363f2b3a7.1397 2 702cd056-84d5-4c83-9232-cca363f2b3a7.1864 2 702cd056-84d5-4c83-9232-cca363f2b3a7.487 2 702cd056-84d5-4c83-9232-cca363f2b3a7.552 2 702cd056-84d5-4c83-9232-cca363f2b3a7.7 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.487 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.509 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.521 2 7a56bb45-91a0-49f4-a983-a8a46c418e04.7 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1397 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1398 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.576 2 a37ab9d5-2f18-4916-9315-52476bd7ff54.7 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1397 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1398 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1867 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1868 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.2 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.487 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.552 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.576 2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.941 2 ede69d31-f048-41b7-9173-448c7046d537.1397 2 ede69d31-f048-41b7-9173-448c7046d537.1398 2 ede69d31-f048-41b7-9173-448c7046d537.487 2 ede69d31-f048-41b7-9173-448c7046d537.552 2 ede69d31-f048-41b7-9173-448c7046d537.576 2 ede69d31-f048-41b7-9173-448c7046d537.7 Worse yet, the md5sums of the two copies differ. For instance, On replicate-0: [root@dhcp37-65 tmp]# md5sum /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 1e86d0a097c724965413d07af71c0809 /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 On replicate-1: [root@dhcp37-85 tmp]# md5sum /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 e72cc949c7ba9b76d350a77be932ba3f /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397 Raghavendra will be sending out a fix in DHT for this issue. --- Additional comment from SATHEESARAN on 2017-04-03 04:23:01 EDT --- (In reply to Raghavendra G from comment #12) > @Sas, > > Is the setup still available? If yes, can you please take a look? Or if you > can give me login details, I'll take a look. If the setup is not available, > can you recreate the issue one more time so that I can take a look? > > regards, > Raghavendra I have already shared the setup details in the mail. Let me know, if you need anything more --- Additional comment from Raghavendra G on 2017-04-04 00:44:56 EDT --- The fix itself is fairly simple: In all entry fops - create, mknod, symlink, open with O_CREATE, link, rename, mkdir etc, we have to do: Check volume commit hash is equal to the commit hash on parent inode 1. If yes, proceed with the dentry fop 2. else, a. initiate a lookup(frame, this, loc). IOW, Wind the lookup on the location structure passed as an arg to DHT (not directly to its subvols) b. Once all lookups initiated in "a." are complete, resume the dentry fop. For the scope of this bug its sufficient to fix dht_mknod. But, for completeness sake (and to avoid similar bugs in other codepaths [2]) I would prefer to fix all codepaths. So, the codepaths affected are more and hence more testing. [1] is another VM corruption issue during rebalance. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1276062 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1286127
REVIEW: https://review.gluster.org/17010 (features/shard: Fix vm corruption upon fix-layout) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17010 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 99c8c0b03a3368d81756440ab48091e1f2430a5f Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 6 18:10:41 2017 +0530 features/shard: Fix vm corruption upon fix-layout shard's writev implementation, as part of identifying presence of participant shards that aren't in memory, first sends an MKNOD on these shards, and upon EEXIST error, looks up the shards before proceeding with the writes. The VM corruption was caused when the following happened: 1. DHT had n subvolumes initially. 2. Upon add-brick + fix-layout, the layout of .shard changed although the existing shards under it were yet to be migrated to their new hashed subvolumes. 3. During this time, there were writes on the VM falling in regions of the file whose corresponding shards were already existing under .shard. 4. Sharding xl sent MKNOD on these shards, now creating them in their new hashed subvolumes although there already exist shard blocks for this region with valid data. 5. All subsequent writes were wound on these newly created copies. The net outcome is that both copies of the shard didn't have the correct data. This caused the affected VMs to be unbootable. FIX: For want of better alternatives in DHT, the fix changes shard fops to do a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup. Change-Id: I8a2e97d91ba3275fbc7174a008c7234fa5295d36 BUG: 1440051 RCA'd-by: Raghavendra Gowdappa <rgowdapp> Reported-by: Mahdi Adnan <mahdi.adnan> Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17010 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org>
REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop in readv) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17014 committed in master by Vijay Bellur (vbellur) ------ commit 594a7c6a187cf780bd666e7343c39a2d92fc67ef Author: Krutika Dhananjay <kdhananj> Date: Mon Apr 10 11:04:31 2017 +0530 features/shard: Initialize local->fop in readv Change-Id: I9008ca9960df4821636501ae84f93a68f370c67f BUG: 1440051 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17014 NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: https://review.gluster.org/17085 (cluster/dht: Pass the req dict instead of NULL in dht_attr2()) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17086 (mount/fuse: Replace GF_LOG_OCCASIONALLY with gf_log() to report fop failure at all times) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17087 (cluster/dht: Do not sync xattrs between src and dst twice during rebalance) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17086 (mount/fuse: Replace GF_LOG_OCCASIONALLY with gf_log() to report fop failure at all times) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/17085 (cluster/dht: Pass the req dict instead of NULL in dht_attr2()) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17085 committed in master by Raghavendra G (rgowdapp) ------ commit d60ca8e96bbc16b13f8f3456f30ebeb16d0d1e47 Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 20 10:08:02 2017 +0530 cluster/dht: Pass the req dict instead of NULL in dht_attr2() This bug was causing VMs to pause during rebalance. When qemu winds down a STAT, shard fills the trusted.glusterfs.shard.file-size attribute in the req dict which DHT doesn't wind its STAT fop with upon detecting the file has undergone migration. As a result shard doesn't find the value to this key in the unwind path, causing it to fail the STAT with EINVAL. Also, the same bug exists in other fops too, which is also fixed in this patch. Change-Id: Id7823fd932b4e5a9b8779ebb2b612a399c0ef5f0 BUG: 1440051 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17085 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp>
REVIEW: https://review.gluster.org/17126 (cluster/dht: Pass the correct xdata in fremovexattr fop) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/17126 committed in master by Raghavendra G (rgowdapp) ------ commit ab88f655e6423f51e2f2fac9265ff4d4f5c3e579 Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 27 11:53:24 2017 +0530 cluster/dht: Pass the correct xdata in fremovexattr fop Change-Id: Id84bc87e48f435573eba3b24d3fb3c411fd2445d BUG: 1440051 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17126 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp>
COMMIT: https://review.gluster.org/17086 committed in master by Raghavendra G (rgowdapp) ------ commit ef60a29703f520c5bd06467efc4a0d0a33552a06 Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 20 10:17:07 2017 +0530 mount/fuse: Replace GF_LOG_OCCASIONALLY with gf_log() to report fop failure at all times Change-Id: Ibd8e1c6172812951092ff6097ba4bed943051b7c BUG: 1440051 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/17086 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra Bhat <raghavendra>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/