Bug 1434653 - Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance
Summary: Application VMs with their disk images on sharded-replica 3 volume are unable...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Krutika Dhananjay
QA Contact: SATHEESARAN
URL:
Whiteboard: dht-rebalance
Depends On: 1463907
Blocks: Gluster-HC-3 1417151 1439753 RHHI-1.1-Approved-Backlog-BZs
TreeView+ depends on / blocked
 
Reported: 2017-03-22 02:27 UTC by SATHEESARAN
Modified: 2021-08-30 13:05 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.8.4-25
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1439753 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:33:25 UTC
Embargoed:


Attachments (Terms of Use)
fuse mount logs from hypervisor (664.38 KB, text/plain)
2017-03-22 03:50 UTC, SATHEESARAN
no flags Details
sosreport from RHGS_Node_1 (7.70 MB, application/x-xz)
2017-03-22 04:23 UTC, SATHEESARAN
no flags Details
sosreport from RHGS_Node_2 (7.70 MB, application/x-xz)
2017-03-22 04:44 UTC, SATHEESARAN
no flags Details
sosreport from RHGS_Node_3 (7.68 MB, application/x-xz)
2017-03-22 04:46 UTC, SATHEESARAN
no flags Details
fuse mount log from the hypervisor (193.55 KB, text/plain)
2017-04-26 16:53 UTC, SATHEESARAN
no flags Details
fuse mount logs from the hypervisor when performing different test (1.66 MB, text/plain)
2017-04-26 16:59 UTC, SATHEESARAN
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1463907 0 unspecified CLOSED Application VMs, wth the disk images on replica 3 volume, paused post rebalance 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3002571 0 None None None 2017-04-28 17:56:56 UTC
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Internal Links: 1463907

Description SATHEESARAN 2017-03-22 02:27:08 UTC
Description of problem:
-----------------------
5 VM disk images are created on the fuse mounted sharded replica 3 volume of type 1x3. 5 VMs are installed, rebooted and are up. 3 more bricks are added to this volume to make it as 2x3. After performing rebalance, observed some wierd errors not allowing to login in to these VMs. When rebooted these VMs, they are unable to boot, which means that the VM disks are corrupted.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHEL 7.3 as hypervisor
RHGS 3.1.2 layered installed on RHEL 7.3
glusterfs-3.8.4-18.el7rhgs

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a sharded replica 3 volume
2. Optimize the volume for virt store usecase ( gluster volume set <vol> group virt ) and start the volume
3. Fuse mount the volume on another RHEL 7.3 Server ( used as hypervisor )
4. Create few disk images of size 10GB each
5. Start the VMs, install OS (RHEL 7.3) and reboot
6. When the VMs are up post installation, add 3 more bricks to the volume
7. Start rebalance process

Actual results:
---------------
VMs showed some errors on the console, which prevented from logging in. 
Post rebalance, when the VMs are rebooted, they are unable to boot with boot prompt showing up messages related to XFS inode corruption

Expected results:
-----------------
VM disks should not get corrupted.

Comment 2 SATHEESARAN 2017-03-22 03:50:32 UTC
Created attachment 1265289 [details]
fuse mount logs from hypervisor

Comment 4 SATHEESARAN 2017-03-22 04:23:40 UTC
Created attachment 1265290 [details]
sosreport from RHGS_Node_1

Comment 5 SATHEESARAN 2017-03-22 04:44:35 UTC
Created attachment 1265292 [details]
sosreport from RHGS_Node_2

Comment 6 SATHEESARAN 2017-03-22 04:46:23 UTC
Created attachment 1265293 [details]
sosreport from RHGS_Node_3

Comment 8 SATHEESARAN 2017-03-22 08:47:41 UTC
I have tested again with all the application VMs powered off. All VMs could boot healthy. The following are the test steps :

1. Create a sharded replica 3 volume and optimized the volume for virt store usecase
2. Created 5 VM image files on the fuse mounted gluster volume
3. Created 5 Application VMs with the above created VM images and installed OS ( RHEL7.3 ). Rebooted the VMs post OS installation.
4. Checked the health of all the VMs ( all VMs are healthy )
5. Powered off all the application VMs
6. Added 3 more bricks to convert 1x3 replicate volume to 2x3 distribute-replicate volume 
7. Initiated rebalance
8. Post rebalance has completed, started all the VMs. ( All VMs booted up healthy ) 

So, its the running VMs that are getting affected because of rebalance operation.

Comment 9 Raghavendra G 2017-03-27 01:27:18 UTC
Conversation over mail:

> ​Raghu,
>
> In one of my test iteration, fix-layout itself caused corruption with VM
> disk.
> It happened only once, when I tried twice after that it never happened

One test is good enough to prove that we are dealing with at least one corruption issue, that is not the same as bz 1376757.

We need more analysis to figure out RCA.

>
> Thanks,
> Satheesaran S ( sas )​

Comment 10 SATHEESARAN 2017-03-27 08:12:12 UTC
I have ran the test with the following combinations:
- Turning off strict-o-direct, and enabling remote-dio
I could still observe that VM disks are getting corrupted.

Also I did another test with sharding turned off, this issue was not seen.

Comment 12 Raghavendra G 2017-04-01 04:52:31 UTC
Following is a rough algorithm of shard_writev:

1. Based on the offset, calculate the shards touched by current write.
2. Look for inodes corresponding to these shard files in itable.
3. If one or more inodes are missing from itable, issue mknod for corresponding shard files and ignore EEXIST in cbk.
4. resume writes on respective shards.

Now, imagine a write which falls to an existing "shard_file". For the sake of discussion lets consider a distribute of three subvols - s1, s2, s3

1. "shard_file" hashes to subvolume s2 and is present on s2
2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is fixed to include s4 and hash ranges are changed.
3. write that touches "shard_file" is issued.
4. The inode for "shard_file" is not present in itable after a graph switch and features/shard issues an mknod.
5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod (shard_file) on s3 succeeds. But, the shard_file is already present on s2.

So, we have two files on two different subvols of dht representing same shard and this will lead to corruption.

To prove the above hypothesis we need to look for one or more files (say "shard_file") in .shard present in more than one subvolume of dht. IOW, more than one subvolume of dht should have the file "/.shard/shard_file".

@Sas,

Is the setup still available? If yes, can you please take a look? Or if you can give me login details, I'll take a look. If the setup is not available, can you recreate the issue one more time so that I can take a look?

regards,
Raghavendra

Comment 13 Krutika Dhananjay 2017-04-03 08:13:44 UTC
Whatever Raghavendra suspected in comment #12 is what we observed on sas' setup just now.

Following are the duplicate shards that exist on both subvolumes of DHT:

[root@dhcp37-65 tmp]# cat /tmp/shards-replicate-1 | sort | uniq -c | grep -v "1 "              
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1397
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1864
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.487
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.552
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.7
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.487
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.509
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.521
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.7
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1397
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1398
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.576
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.7
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1397
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1398
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1867
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1868
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.2
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.487
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.552
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.576
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.941
      2 ede69d31-f048-41b7-9173-448c7046d537.1397
      2 ede69d31-f048-41b7-9173-448c7046d537.1398
      2 ede69d31-f048-41b7-9173-448c7046d537.487
      2 ede69d31-f048-41b7-9173-448c7046d537.552
      2 ede69d31-f048-41b7-9173-448c7046d537.576
      2 ede69d31-f048-41b7-9173-448c7046d537.7

Worse yet, the md5sums of the two copies differ.

For instance,

On replicate-0:
[root@dhcp37-65 tmp]# md5sum /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
1e86d0a097c724965413d07af71c0809  /gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

On replicate-1:
[root@dhcp37-85 tmp]# md5sum /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
e72cc949c7ba9b76d350a77be932ba3f  /gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

Raghavendra will be sending out a fix in DHT for this issue.

Comment 16 Raghavendra G 2017-04-04 04:44:56 UTC
The fix itself is fairly simple:

In all entry fops - create, mknod, symlink, open with O_CREATE, link, rename, mkdir etc, we have to do:

Check volume commit hash is equal to the commit hash on parent inode
1. If yes, proceed with the dentry fop
2. else, 
   a. initiate a lookup(frame, this, loc). IOW, Wind the lookup on the location structure passed as an arg to DHT (not directly to its subvols)
   b. Once all lookups initiated in "a." are complete, resume the dentry fop.

For the scope of this bug its sufficient to fix dht_mknod. But, for completeness sake (and to avoid similar bugs in other codepaths [2]) I would prefer to fix all codepaths. So, the codepaths affected are more and hence more testing.

[1] is another VM corruption issue during rebalance.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1276062
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1286127

Comment 23 Atin Mukherjee 2017-04-07 08:39:01 UTC
upstream patch : https://review.gluster.org/#/c/17010/

Comment 24 Atin Mukherjee 2017-04-10 07:00:35 UTC
(In reply to Atin Mukherjee from comment #23)
> upstream patch : https://review.gluster.org/#/c/17010/

One more patch https://review.gluster.org/#/c/17014 is needed.

Comment 33 Krutika Dhananjay 2017-04-19 12:35:12 UTC
First off, BIG fat thanks to Satheesaran for willing to recreate the bug as many times as the devs asked him to!!

The VMs paused because of the following error:

<log>
...
...
[2017-04-18 14:31:01.285318] E [shard.c:426:shard_modify_size_and_block_count] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x62f97) [0x7f9829bc3f97] -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xb6f0) [0x7f982994a6f0] -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xaf5d) [0x7f9829949f5d] ) 2-cosmos-shard: Failed to get trusted.glusterfs.shard.file-size for 8152fcad-80c3-4349-b21c-29627c9e3d77
[2017-04-18 14:31:01.285361] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 1382073: STAT() <gfid:8152fcad-80c3-4349-b21c-29627c9e3d77> => -1 (Invalid argument)
...
...
</log>

Basically STAT fop from qemu failed with EINVAL. And that was because shard was looking for the xattr "trusted.glusterfs.shard.file-size" in the STAT response and upon not finding it, it bailed out with EINVAL.

Upon checking the tcpdump output for the FUSE mount, I found that for this particular STAT, the mount process was NOT even sending a request for "trusted.glusterfs.shard.file-size" xattr in the dictionary, for it to receive its value in the response.

We (Pranith, Raghavendra and I) checked the lookup and stat codepaths in shard, DHT, AFR, to see where the key was getting dropped and it turned out to be dht_attr2() (as part of dht_stat()) that had the bug. After realising the file is migrated, and figuring out the new subvolume of DHT where it resides, DHT sends the STAT on the correct sub-volume but it passes NULL instead of the req dict in the call.

<code>
...
...
12         if (local->fop == GF_FOP_FSTAT) {                                                  
 11                 STACK_WIND (frame, dht_file_attr_cbk, subvol,                              
 10                             subvol->fops->fstat, local->fd, NULL);  <--- NULL instead of passing local->xattr_req!                   
  9         } else {                                                                           
  8                 STACK_WIND (frame, dht_file_attr_cbk, subvol,                              
  7                             subvol->fops->stat, &local->loc, NULL); <--- NULL instead of passing local->xattr_req!  
  6         }                                                                                  
  5                                                                                            
  4         return 0;                                                                          
  3                                                                          
...
...
</code>

This bug has been existing since the time this code was written and is not a regression.

While the fix to THIS particular case of VM pause is simple, it turns out we're not done yet wrt eliminating corruption/pauses as there is a whole class of issues identified in DHT and rebalance that need fixing in terms of:

1. fixing {F}XATTROP fop in DHT to identify files under migration and wind the fop on the dst file too.

2. synchronisation between reading of xattrs from the src file and writing it on the dst file without letting a client modify the xattr in between.


Raghavendra will be able to comment more on these other issues in DHT.

-Krutika

Comment 34 Raghavendra G 2017-04-20 04:47:57 UTC
(In reply to Krutika Dhananjay from comment #33)
> First off, BIG fat thanks to Satheesaran for willing to recreate the bug as
> many times as the devs asked him to!!
> 
> The VMs paused because of the following error:
> 
> <log>
> ...
> ...
> [2017-04-18 14:31:01.285318] E
> [shard.c:426:shard_modify_size_and_block_count]
> (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x62f97)
> [0x7f9829bc3f97]
> -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xb6f0)
> [0x7f982994a6f0]
> -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xaf5d)
> [0x7f9829949f5d] ) 2-cosmos-shard: Failed to get
> trusted.glusterfs.shard.file-size for 8152fcad-80c3-4349-b21c-29627c9e3d77
> [2017-04-18 14:31:01.285361] W [fuse-bridge.c:767:fuse_attr_cbk]
> 0-glusterfs-fuse: 1382073: STAT()
> <gfid:8152fcad-80c3-4349-b21c-29627c9e3d77> => -1 (Invalid argument)
> ...
> ...
> </log>
> 
> Basically STAT fop from qemu failed with EINVAL. And that was because shard
> was looking for the xattr "trusted.glusterfs.shard.file-size" in the STAT
> response and upon not finding it, it bailed out with EINVAL.
> 
> Upon checking the tcpdump output for the FUSE mount, I found that for this
> particular STAT, the mount process was NOT even sending a request for
> "trusted.glusterfs.shard.file-size" xattr in the dictionary, for it to
> receive its value in the response.
> 
> We (Pranith, Raghavendra and I) checked the lookup and stat codepaths in
> shard, DHT, AFR, to see where the key was getting dropped and it turned out
> to be dht_attr2() (as part of dht_stat()) that had the bug. After realising
> the file is migrated, and figuring out the new subvolume of DHT where it
> resides, DHT sends the STAT on the correct sub-volume but it passes NULL
> instead of the req dict in the call.
> 
> <code>
> ...
> ...
> 12         if (local->fop == GF_FOP_FSTAT) {                                
> 
>  11                 STACK_WIND (frame, dht_file_attr_cbk, subvol,           
> 
>  10                             subvol->fops->fstat, local->fd, NULL);  <---
> NULL instead of passing local->xattr_req!                   
>   9         } else {                                                        
> 
>   8                 STACK_WIND (frame, dht_file_attr_cbk, subvol,           
> 
>   7                             subvol->fops->stat, &local->loc, NULL); <---
> NULL instead of passing local->xattr_req!  
>   6         }                                                               
> 
>   5                                                                         
> 
>   4         return 0;                                                       
> 
>   3                                                                          
> ...
> ...
> </code>
> 
> This bug has been existing since the time this code was written and is not a
> regression.
> 
> While the fix to THIS particular case of VM pause is simple, it turns out
> we're not done yet wrt eliminating corruption/pauses as there is a whole
> class of issues identified in DHT and rebalance that need fixing in terms of:
> 
> 1. fixing {F}XATTROP fop in DHT to identify files under migration and wind
> the fop on the dst file too.
> 
> 2. synchronisation between reading of xattrs from the src file and writing
> it on the dst file without letting a client modify the xattr in between.
> 
> 
> Raghavendra will be able to comment more on these other issues in DHT.

Thanks Krutika for the update.

Sharding stores its own metadata as xattrs on files (I assume on base shard). However, during/after rebalance there are races due to which consistency/integrity of xattrs on file _might_ be lost. Krutika has already mentioned some of these above. To add to the list above,

3. Remove copying of xattrs by rebalance process from src to dst at the end of migration. Since xattrs are copied at the beginning of the migration process _and_ any on-going modifications to xattrs to file being migrated are replayed on dst too, recopying xattrs at the end of data migration looks redundant. Since (read src, write dst) is not atomic wrt to modifications from client, we might end up with stale values. This in principle, is similar to point 2 mentioned above.

4. dht_(f)getxattr doesn't handle file migration scenario. It needs to check for ENOENT/ESTALE errors and check for potential migration and get xattrs from the file on new subvol if file is migrated. Not sure this is relevant to shard as it might not use getxattr.

Notes:

1. A fix to problem 2 would require locking from clients while writing xattrs ((f)setxattr, (f)xattrop, (f)removexattr). This would bring down performance on ops touching xattrs _even_ while rebalance is not running.

2. A word of caution is that there seems to be multiple issues before we can get the use case tracked by this bz working. Note that we've already fixed issue in fix-layout and also comment #30 seems to indicate an issue that might have an RCA different from dht xattr consistency.

regards,
Raghavendra

> 
> -Krutika

Comment 36 Nithya Balachandran 2017-04-20 15:00:28 UTC
(In reply to Raghavendra G from comment #34)
> (In reply to Krutika Dhananjay from comment #33)
> > First off, BIG fat thanks to Satheesaran for willing to recreate the bug as
> > many times as the devs asked him to!!
> > 
> > The VMs paused because of the following error:
> > 
> > <log>
> > ...
> > ...
> > [2017-04-18 14:31:01.285318] E
> > [shard.c:426:shard_modify_size_and_block_count]
> > (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x62f97)
> > [0x7f9829bc3f97]
> > -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xb6f0)
> > [0x7f982994a6f0]
> > -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xaf5d)
> > [0x7f9829949f5d] ) 2-cosmos-shard: Failed to get
> > trusted.glusterfs.shard.file-size for 8152fcad-80c3-4349-b21c-29627c9e3d77
> > [2017-04-18 14:31:01.285361] W [fuse-bridge.c:767:fuse_attr_cbk]
> > 0-glusterfs-fuse: 1382073: STAT()
> > <gfid:8152fcad-80c3-4349-b21c-29627c9e3d77> => -1 (Invalid argument)
> > ...
> > ...
> > </log>
> > 
> > Basically STAT fop from qemu failed with EINVAL. And that was because shard
> > was looking for the xattr "trusted.glusterfs.shard.file-size" in the STAT
> > response and upon not finding it, it bailed out with EINVAL.
> > 
> > Upon checking the tcpdump output for the FUSE mount, I found that for this
> > particular STAT, the mount process was NOT even sending a request for
> > "trusted.glusterfs.shard.file-size" xattr in the dictionary, for it to
> > receive its value in the response.
> > 
> > We (Pranith, Raghavendra and I) checked the lookup and stat codepaths in
> > shard, DHT, AFR, to see where the key was getting dropped and it turned out
> > to be dht_attr2() (as part of dht_stat()) that had the bug. After realising
> > the file is migrated, and figuring out the new subvolume of DHT where it
> > resides, DHT sends the STAT on the correct sub-volume but it passes NULL
> > instead of the req dict in the call.
> > 
> > <code>
> > ...
> > ...
> > 12         if (local->fop == GF_FOP_FSTAT) {                                
> > 
> >  11                 STACK_WIND (frame, dht_file_attr_cbk, subvol,           
> > 
> >  10                             subvol->fops->fstat, local->fd, NULL);  <---
> > NULL instead of passing local->xattr_req!                   
> >   9         } else {                                                        
> > 
> >   8                 STACK_WIND (frame, dht_file_attr_cbk, subvol,           
> > 
> >   7                             subvol->fops->stat, &local->loc, NULL); <---
> > NULL instead of passing local->xattr_req!  
> >   6         }                                                               
> > 
> >   5                                                                         
> > 
> >   4         return 0;                                                       
> > 
> >   3                                                                          
> > ...
> > ...
> > </code>
> > 
> > This bug has been existing since the time this code was written and is not a
> > regression.
> > 
> > While the fix to THIS particular case of VM pause is simple, it turns out
> > we're not done yet wrt eliminating corruption/pauses as there is a whole
> > class of issues identified in DHT and rebalance that need fixing in terms of:
> > 
> > 1. fixing {F}XATTROP fop in DHT to identify files under migration and wind
> > the fop on the dst file too.
> > 
> > 2. synchronisation between reading of xattrs from the src file and writing
> > it on the dst file without letting a client modify the xattr in between.
> > 
> > 
> > Raghavendra will be able to comment more on these other issues in DHT.
> 
> Thanks Krutika for the update.
> 
> Sharding stores its own metadata as xattrs on files (I assume on base
> shard). However, during/after rebalance there are races due to which
> consistency/integrity of xattrs on file _might_ be lost. Krutika has already
> mentioned some of these above. To add to the list above,
> 
> 3. Remove copying of xattrs by rebalance process from src to dst at the end
> of migration. Since xattrs are copied at the beginning of the migration
> process _and_ any on-going modifications to xattrs to file being migrated
> are replayed on dst too, recopying xattrs at the end of data migration looks
> redundant. Since (read src, write dst) is not atomic wrt to modifications
> from client, we might end up with stale values. This in principle, is
> similar to point 2 mentioned above.
> 

No, this will break the posix acls - the second copy was done precisely for this reason.


> 4. dht_(f)getxattr doesn't handle file migration scenario. It needs to check
> for ENOENT/ESTALE errors and check for potential migration and get xattrs
> from the file on new subvol if file is migrated. Not sure this is relevant
> to shard as it might not use getxattr.
> 

This is already in place for files- we use the DHT_IATT_IN_XDATA_KEY key to get this data.


> Notes:
> 
> 1. A fix to problem 2 would require locking from clients while writing
> xattrs ((f)setxattr, (f)xattrop, (f)removexattr). This would bring down
> performance on ops touching xattrs _even_ while rebalance is not running.
> 
> 2. A word of caution is that there seems to be multiple issues before we can
> get the use case tracked by this bz working. Note that we've already fixed
> issue in fix-layout and also comment #30 seems to indicate an issue that
> might have an RCA different from dht xattr consistency.
> 
> regards,
> Raghavendra
> 
> > 
> > -Krutika

Comment 39 Krutika Dhananjay 2017-04-21 07:44:33 UTC
(In reply to Nithya Balachandran from comment #36)
> (In reply to Raghavendra G from comment #34)
> > (In reply to Krutika Dhananjay from comment #33)
> > > First off, BIG fat thanks to Satheesaran for willing to recreate the bug as
> > > many times as the devs asked him to!!
> > > 
> > > The VMs paused because of the following error:
> > > 
> > > <log>
> > > ...
> > > ...
> > > [2017-04-18 14:31:01.285318] E
> > > [shard.c:426:shard_modify_size_and_block_count]
> > > (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x62f97)
> > > [0x7f9829bc3f97]
> > > -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xb6f0)
> > > [0x7f982994a6f0]
> > > -->/usr/lib64/glusterfs/3.8.4/xlator/features/shard.so(+0xaf5d)
> > > [0x7f9829949f5d] ) 2-cosmos-shard: Failed to get
> > > trusted.glusterfs.shard.file-size for 8152fcad-80c3-4349-b21c-29627c9e3d77
> > > [2017-04-18 14:31:01.285361] W [fuse-bridge.c:767:fuse_attr_cbk]
> > > 0-glusterfs-fuse: 1382073: STAT()
> > > <gfid:8152fcad-80c3-4349-b21c-29627c9e3d77> => -1 (Invalid argument)
> > > ...
> > > ...
> > > </log>
> > > 
> > > Basically STAT fop from qemu failed with EINVAL. And that was because shard
> > > was looking for the xattr "trusted.glusterfs.shard.file-size" in the STAT
> > > response and upon not finding it, it bailed out with EINVAL.
> > > 
> > > Upon checking the tcpdump output for the FUSE mount, I found that for this
> > > particular STAT, the mount process was NOT even sending a request for
> > > "trusted.glusterfs.shard.file-size" xattr in the dictionary, for it to
> > > receive its value in the response.
> > > 
> > > We (Pranith, Raghavendra and I) checked the lookup and stat codepaths in
> > > shard, DHT, AFR, to see where the key was getting dropped and it turned out
> > > to be dht_attr2() (as part of dht_stat()) that had the bug. After realising
> > > the file is migrated, and figuring out the new subvolume of DHT where it
> > > resides, DHT sends the STAT on the correct sub-volume but it passes NULL
> > > instead of the req dict in the call.
> > > 
> > > <code>
> > > ...
> > > ...
> > > 12         if (local->fop == GF_FOP_FSTAT) {                                
> > > 
> > >  11                 STACK_WIND (frame, dht_file_attr_cbk, subvol,           
> > > 
> > >  10                             subvol->fops->fstat, local->fd, NULL);  <---
> > > NULL instead of passing local->xattr_req!                   
> > >   9         } else {                                                        
> > > 
> > >   8                 STACK_WIND (frame, dht_file_attr_cbk, subvol,           
> > > 
> > >   7                             subvol->fops->stat, &local->loc, NULL); <---
> > > NULL instead of passing local->xattr_req!  
> > >   6         }                                                               
> > > 
> > >   5                                                                         
> > > 
> > >   4         return 0;                                                       
> > > 
> > >   3                                                                          
> > > ...
> > > ...
> > > </code>
> > > 
> > > This bug has been existing since the time this code was written and is not a
> > > regression.
> > > 
> > > While the fix to THIS particular case of VM pause is simple, it turns out
> > > we're not done yet wrt eliminating corruption/pauses as there is a whole
> > > class of issues identified in DHT and rebalance that need fixing in terms of:
> > > 
> > > 1. fixing {F}XATTROP fop in DHT to identify files under migration and wind
> > > the fop on the dst file too.
> > > 
> > > 2. synchronisation between reading of xattrs from the src file and writing
> > > it on the dst file without letting a client modify the xattr in between.
> > > 
> > > 
> > > Raghavendra will be able to comment more on these other issues in DHT.
> > 
> > Thanks Krutika for the update.
> > 
> > Sharding stores its own metadata as xattrs on files (I assume on base
> > shard). However, during/after rebalance there are races due to which
> > consistency/integrity of xattrs on file _might_ be lost. Krutika has already
> > mentioned some of these above. To add to the list above,
> > 
> > 3. Remove copying of xattrs by rebalance process from src to dst at the end
> > of migration. Since xattrs are copied at the beginning of the migration
> > process _and_ any on-going modifications to xattrs to file being migrated
> > are replayed on dst too, recopying xattrs at the end of data migration looks
> > redundant. Since (read src, write dst) is not atomic wrt to modifications
> > from client, we might end up with stale values. This in principle, is
> > similar to point 2 mentioned above.
> > 
> 
> No, this will break the posix acls - the second copy was done precisely for
> this reason.
> 
> 

Does restoration of acls really require a listxattr followed by a bulk setxattr/overwrite of not only the acl xattrs but *all* xattrs associated with the file, which is what rebalance does in its current state?

-Krutika


> > 4. dht_(f)getxattr doesn't handle file migration scenario. It needs to check
> > for ENOENT/ESTALE errors and check for potential migration and get xattrs
> > from the file on new subvol if file is migrated. Not sure this is relevant
> > to shard as it might not use getxattr.
> > 
> 
> This is already in place for files- we use the DHT_IATT_IN_XDATA_KEY key to
> get this data.
> 
> 
> > Notes:
> > 
> > 1. A fix to problem 2 would require locking from clients while writing
> > xattrs ((f)setxattr, (f)xattrop, (f)removexattr). This would bring down
> > performance on ops touching xattrs _even_ while rebalance is not running.
> > 
> > 2. A word of caution is that there seems to be multiple issues before we can
> > get the use case tracked by this bz working. Note that we've already fixed
> > issue in fix-layout and also comment #30 seems to indicate an issue that
> > might have an RCA different from dht xattr consistency.
> > 
> > regards,
> > Raghavendra
> > 
> > > 
> > > -Krutika

Comment 40 Raghavendra G 2017-04-22 07:02:19 UTC
(In reply to Nithya Balachandran from comment #36)
> > 
> > 3. Remove copying of xattrs by rebalance process from src to dst at the end
> > of migration. Since xattrs are copied at the beginning of the migration
> > process _and_ any on-going modifications to xattrs to file being migrated
> > are replayed on dst too, recopying xattrs at the end of data migration looks
> > redundant. Since (read src, write dst) is not atomic wrt to modifications
> > from client, we might end up with stale values. This in principle, is
> > similar to point 2 mentioned above.
> > 
> 
> No, this will break the posix acls - the second copy was done precisely for
> this reason.

I was aware of adding second copy as a bug fix. But couldn't recall exactly the reason. Thanks for the info. I'll dig into that. Irrespective of the reason second copy is added, any copy from src to dst by rebalance process without synchronizing with ongoing modifications from clients is a potential xattr corruption scenario (though probabilities can vary).

Comment 46 SATHEESARAN 2017-04-26 16:53:13 UTC
Created attachment 1274317 [details]
fuse mount log from the hypervisor

Testcase performed:
0. Created replica 3 sharded volume (1x3) and optimized it for VM image store usecase
1. Created VMs and installed them with RHEL 7.3 os
2. After VM installation is done, added 3 more bricks to make the volume 2x3
3. Triggered rebalance

Comment 47 SATHEESARAN 2017-04-26 16:59:09 UTC
Created attachment 1274330 [details]
fuse mount logs from the hypervisor when performing different test

This test is little different in which the rebalance is triggered, during VM installation is in progress ( which ideally means some I/O happening )

Testcase performed:
0. Created replica 3 sharded volume (1x3) and optimized it for VM image store usecase
1. Created VMs and started installing with RHEL 7.3 os
2. While the installation is in progress, added 3 more bricks to make the volume 2x3
3. Triggered rebalance, when VM installation is still in progress ( same as I/O happening inside the VM )

Comment 55 SATHEESARAN 2017-06-21 16:13:51 UTC
I tested with glusterfs-3.8.4-28.el7rhgs and I see there is one VM that went in to paused state.

Currently not sure, this issue is because of this bug or because of new bug.

Comment 56 SATHEESARAN 2017-06-22 06:00:58 UTC
Verification of this bug is blocked with the other issue[1] as commented in comment55

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1463907

Comment 57 SATHEESARAN 2017-08-12 04:17:07 UTC
Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-39.elrhgs with the followin steps
1. created replica 3 volume(1x3) and optimized it for vmstore usecase
2. Fuse mounted the volume
3. created 5 VM images on the volume
4. Initiate creation of 5 VMs
5. While VM installation is in progress, add 3 more bricks to convert 1x3 volume to 2x3 volume
6. start rebalance

Observation

VMs are healthy post rebalance is completed. Repeated the test for 25 iterations and there are no problems

Comment 58 SATHEESARAN 2017-08-12 04:24:03 UTC
(In reply to SATHEESARAN from comment #57)
> Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-39.elrhgs with the
> followin steps
> 1. created replica 3 volume(1x3) and optimized it for vmstore usecase
> 2. Fuse mounted the volume
> 3. created 5 VM images on the volume
> 4. Initiate creation of 5 VMs
> 5. While VM installation is in progress, add 3 more bricks to convert 1x3
> volume to 2x3 volume
> 6. start rebalance
> 
> Observation
> 
> VMs are healthy post rebalance is completed. Repeated the test for 25
> iterations and there are no problems

As part of this test, also performed remove-brick operation, with VM's accessing the VM disk images. This too worked well.

Comment 60 errata-xmlrpc 2017-09-21 04:33:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 61 errata-xmlrpc 2017-09-21 04:57:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.