1530146 – dht_(f)xattrop does not implement migration checks

Bug 1530146 - dht_(f)xattrop does not implement migration checks

Summary: dht_(f)xattrop does not implement migration checks

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Nithya Balachandran
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1471031 1540224
Blocks:	1498081 1503137 1515434
TreeView+	depends on / blocked

Reported:	2018-01-02 07:36 UTC by Atin Mukherjee
Modified:	2018-09-04 06:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.12.2-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1471031
Environment:
Last Closed:	2018-09-04 06:40:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:42:11 UTC

Description Atin Mukherjee 2018-01-02 07:36:04 UTC

+++ This bug was initially created as a clone of Bug #1471031 +++

Description of problem:

Sharding uses xattrops to set the size info on the file. DHT does not currently implement migration phase1/phase2 checks for these and can possibly lead to issues when sharding is enabled.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2017-07-14 05:55:24 EDT ---

REVIEW: https://review.gluster.org/17776 (cluster/dht: Add migration checks to dht_(f)xattrop) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-07-14 06:05:31 EDT ---

REVIEW: https://review.gluster.org/17776 (cluster/dht: Add migration checks to dht_(f)xattrop) posted (#2) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-07-17 01:38:41 EDT ---

REVIEW: https://review.gluster.org/17776 (cluster/dht: Add migration checks to dht_(f)xattrop) posted (#3) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-07-24 08:13:30 EDT ---

REVIEW: https://review.gluster.org/17776 (cluster/dht: Add migration checks to dht_(f)xattrop) posted (#4) for review on master by N Balachandran (nbalacha)

--- Additional comment from Nithya Balachandran on 2017-12-04 01:04:04 EST ---

We may run into issues with dht_migrate_file and xattrops sent by the shard xlator where ,as the xattrop is an ADD operation, so this the wrong value could end up being set.

file  = data file
file' = linkto file

Time                Source                                  Target
-----------------------------------------------------------------------

t0                  file                                   file' 
                    (shard.size = x1)

t1                  file  (listxattr)                      file'
                    (shard.size = x1)

t2                  file                                   file' (setxattr) 
                    shard.size=x1                          shard.size=x1

t3                  file (xattrop (ADD))                   file' 
                    shard.size = x2                        (shard.size = x1)
 [operation not performed on the target as this is not marked PHASE1 yet]

Now, a write + xattrop is performed after the S+T bits have been set.

t4                  file (S+T) (xattrop(ADD))             file'
                    shard.size = x3                       (shard.size = x3')
The operation is now performed on the target as well but the value will be different as it is an ADD performed in a different initial value.


Convert target to data file
t5                   file (S+T)                           file
                     shard.size = x3                      (shard.size = x3')

A write + xattrop is now performed but is sent only to the target file (as it is now a data file on the hashed subvol)

t6                   file (S +T)                          file xattrop(ADD))
                     shard.size = x3                      shard.size = x4

Copy xattrs again from src to target

t7                   file (S + T ) listxattr             file (setxattr
                     shard.size = x3                     shard.size=x3


Convert source to linkto

t8                   file'                               file 
                     shard.size= x3                      (shard.size = x3)



The listxattr+setxattr at t7 would fix the race at t3. However, it introduces its own race and hence possible file corruption. 


A possible solution would be to:

1. Set the S+T bits before performing the initial listxattr + setxattr. This does not fix the race where xattrops may reach the dst out of order.
2. Set only the Posix ACLS in the second setxattr.

Initially I thought the f/xattrops need only check for PHASE2 and that the xattrs could be copied across from the second listxattr+setxattr values. However, that leaves a window where the values could go out of sync.

A write could hit the src before the dst linkto as been converted to a data file.  After the dst has been converted to a data file but before the listxattr+setxattr, a lookup from somewhere could update the inode ctx to point to the new hashed subvol (dst) and the xattrop would be sent only on the dst. However, if phase1 checks were not implemented, this add would be on the wrong value and the listxattr+setxattr would then overwrite it with the wrong value.



@Krutika, is the above understanding of how xattrops work with shard correct?

--- Additional comment from Raghavendra G on 2017-12-26 00:20:29 EST ---

(In reply to Nithya Balachandran from comment #5)
> We may run into issues with dht_migrate_file and xattrops sent by the shard
> xlator where ,as the xattrop is an ADD operation, so this the wrong value
> could end up being set.
> 
> file  = data file
> file' = linkto file
> 
> Time                Source                                  Target
> -----------------------------------------------------------------------
> 
> t0                  file                                   file' 
>                     (shard.size = x1)
> 
> t1                  file  (listxattr)                      file'
>                     (shard.size = x1)
> 
> t2                  file                                   file' (setxattr) 
>                     shard.size=x1                          shard.size=x1
> 
> t3                  file (xattrop (ADD))                   file' 
>                     shard.size = x2                        (shard.size = x1)
>  [operation not performed on the target as this is not marked PHASE1 yet]
> 
> Now, a write + xattrop is performed after the S+T bits have been set.
> 
> t4                  file (S+T) (xattrop(ADD))             file'
>                     shard.size = x3                       (shard.size = x3')
> The operation is now performed on the target as well but the value will be
> different as it is an ADD performed in a different initial value.
> 
> 
> Convert target to data file
> t5                   file (S+T)                           file
>                      shard.size = x3                      (shard.size = x3')
> 
> A write + xattrop is now performed but is sent only to the target file (as
> it is now a data file on the hashed subvol)
> 
> t6                   file (S +T)                          file xattrop(ADD))
>                      shard.size = x3                      shard.size = x4
> 
> Copy xattrs again from src to target
> 
> t7                   file (S + T ) listxattr             file (setxattr
>                      shard.size = x3                     shard.size=x3
> 
> 
> Convert source to linkto
> 
> t8                   file'                               file 
>                      shard.size= x3                      (shard.size = x3)
> 
> 
> 
> The listxattr+setxattr at t7 would fix the race at t3. However, it
> introduces its own race and hence possible file corruption. 
> 
> 
> A possible solution would be to:
> 
> 1. Set the S+T bits before performing the initial listxattr + setxattr. This
> does not fix the race where xattrops may reach the dst out of order.

That's correct. This can still lead to a race where setxattr and xattrop can reach dst out-of-order. This is very much similar to read from rebalance racing with writes from a client discussed in [1][2]. Solution discussed in [2], when extended to setxattr and xattrop can solve this race.

[1] https://github.com/gluster/glusterfs/issues/308
[2] https://github.com/gluster/glusterfs/issues/347

--- Additional comment from Worker Ant on 2017-12-26 00:26:03 EST ---

COMMIT: https://review.gluster.org/17776 committed in master by \"N Balachandran\" <nbalacha> with a commit message- cluster/dht: Add migration checks to dht_(f)xattrop

The dht_(f)xattrop implementation did not implement
migration phase1/phase2 checks which could cause issues
with rebalance on sharded volumes.
This does not solve the issue where fops may reach the target
out of order.

Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c
BUG: 1471031
Signed-off-by: N Balachandran <nbalacha>

Comment 2 Atin Mukherjee 2018-01-02 07:37:02 UTC

upstream patch : https://review.gluster.org/#/c/17776/

Comment 6 Nithya Balachandran 2018-02-22 05:49:20 UTC

I have no doc required for now. Do we have anything in the documentation which states users should not perform rebalance on sharded volumes? If yes, we might need doc changes.

Comment 7 Krutika Dhananjay 2018-02-22 10:19:26 UTC

(In reply to Nithya Balachandran from comment #6)
> I have no doc required for now. Do we have anything in the documentation
> which states users should not perform rebalance on sharded volumes? If yes,
> we might need doc changes.

I spoke to Sahina about this. So turns out we still don't support scaling out of the default 1x3 RHHI volumes (where shard is enabled by default) through add-brick and this will continue to be the status for RHHI 2.0 (which corresponds to RHGS-3.4.0) as well.

Comment 8 SATHEESARAN 2018-08-28 12:07:53 UTC

Tested with RHGS 3.4.0 nightly build with glusterfs-3.12.2-17.el7rhgs with following steps:

1. Created replicate volume with 3 bricks and use it as RHV storage domain
2. Create few VMs with its disks on the storage domain.
3. Run some I/O on the VMs
4. Added 3 more bricks.
5. Trigger rebalance.

All looks good post rebalance.

Comment 10 errata-xmlrpc 2018-09-04 06:40:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.