Bug 1465123

Summary:	Fd based fops fail with EBADF on file migration
Product:	[Community] GlusterFS	Reporter:	Shyamsundar <srangana>
Component:	distribute	Assignee:	Nithya Balachandran <nbalacha>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.12	CC:	alexander, atumball, bugs, kdhananj, nbalacha, stefano.stagnaro
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-4.1.4	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1465075	Environment:
Last Closed:	2018-10-08 10:41:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1465075
Bug Blocks:

Description Shyamsundar 2017-06-26 17:22:30 UTC

+++ This bug was initially created as a clone of Bug #1465075 +++

Description of problem:

In DHT, there exists a scenario where fd based fops may be sent on the dst subvolume after the file has been migrated but before the fd has been opened on it. This is because certain operations update the cached subvol in the dht inode ctx without checking to see if an fd has been opened on it on the original subvol. Dht fd based fops currently rely on a phase1/phase2 migration checks to open fds on the dst subvol. However, no such check is made causing the fop to fail with EBADF.

This is seen with dist-rep volumes.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1. Create a 2x2 volume
2. Create a file FILE1. Assume it is created on subvol1. Rename it to NFILE1 so it hashes to subvol2.
3. Open an fd on NFILE1 (on subvol1).
2. Perform a rebalance so the file is migrated to subvol2.                                           
3. On the same mount point, perform a lookup/readdirp so the cached subvol in the inode_ctx of NFILE1 is updated to subvol2.          
4. Perform a write on the fd.
                                                  
The write is sent to subvol2 on an fd which has been opened only on subvol1.         Since the migration phase checks don't kick in, the fd is not opened on subvol2 and the fop fails with EBADF.

Actual results:


Expected results:


Additional info:

This is being fixed by having every fd based fop check if the fd has been opened on the cached subvol before winding the fop down.

--- Additional comment from Worker Ant on 2017-06-26 11:51:07 EDT ---

REVIEW: https://review.gluster.org/17630 (cluster/dht: Check if fd is opened on dst subvol) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 1 Worker Ant 2017-06-26 17:28:59 UTC

REVIEW: https://review.gluster.org/17631 (Revert "Revert "glusterd: disallow rebalance & remove-brick on a sharded volume"") posted (#2) for review on release-3.11 by Shyamsundar Ranganathan (srangana)

Comment 2 Worker Ant 2017-06-27 14:04:59 UTC

COMMIT: https://review.gluster.org/17631 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit c5ca4e8981bcf6a854eb3f708aa96fe6ff1f8e05
Author: Shyamsundar Ranganathan <srangana>
Date:   Mon Jun 26 16:23:45 2017 +0000

    Revert "Revert "glusterd: disallow rebalance & remove-brick on a sharded volume""
    
    This is being reverted as a new bug around rebalance has been uncovered.
    As a result we would like to retain the warning in the code and in the
    release-notes.
    
    The new bug being, https://bugzilla.redhat.com/show_bug.cgi?id=1465075
    
    This reverts commit 9c5403587517b5922cb87bff75033839e96d56ab.
    
    Change-Id: Ifd38ae0a41539aeb67723eb3ee704c18c50571b0
    BUG: 1465123
    Signed-off-by: Shyam <srangana>
    Reviewed-on: https://review.gluster.org/17631
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 3 Worker Ant 2017-06-28 12:40:46 UTC

REVIEW: https://review.gluster.org/17643 (cluster/dht: Check if fd is opened on dst subvol) posted (#1) for review on release-3.11 by N Balachandran (nbalacha)

Comment 4 Shyamsundar 2017-06-28 18:29:36 UTC

Once this bug is fixed, please use the same bug to revert the change introduced in https://review.gluster.org/17631

Comment 5 Worker Ant 2017-07-03 13:00:55 UTC

COMMIT: https://review.gluster.org/17643 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 26a81d878c6669cfecb81892a4460bd904512196
Author: N Balachandran <nbalacha>
Date:   Mon Jun 26 21:12:56 2017 +0530

    cluster/dht: Check if fd is opened on dst subvol
    
    If an fd is opened on a file, the file is migrated
    and the cached subvol is updated in the inode_ctx
    before an fd based fop is sent, the fop is sent to
    the dst subvol on which the fd is not opened.
    This causes the FOP to fail with EBADF.
    
    Now, every fd based fop will check to see that the fd
    has been opened on the dst subvol before winding it down.
    
    > BUG: 1465075
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: https://review.gluster.org/17630
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra G <rgowdapp>
    > Reviewed-by: Susant Palai <spalai>
    > CentOS-regression: Gluster Build System <jenkins.org>
    (cherry picked from commit 91db0d47ca267aecfc6124a3f337a4e2f2c9f1e2)
    Change-Id: Id92ef5eb7a5b5226688e2d2868b15e383f5f240e
    BUG: 1465123
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17643
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 6 Worker Ant 2017-07-12 06:54:59 UTC

REVIEW: https://review.gluster.org/17751 (cluster/dht: Fix fd check race) posted (#1) for review on release-3.11 by N Balachandran (nbalacha)

Comment 7 Worker Ant 2017-07-19 11:25:14 UTC

COMMIT: https://review.gluster.org/17751 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 856aa0ff090fa08629493e7848e5c446a93a9f2d
Author: N Balachandran <nbalacha>
Date:   Mon Jul 10 09:38:54 2017 +0530

    cluster/dht: Fix fd check race
    
    There is a another race between the cached subvol
    being updated in the inode_ctx and the fd being opened on
    the target.
    
    1. fop1 -> fd1 -> subvol0
    2. file migrated from subvol0 to subvol1 and cached_subvol
       changed to subvol1 in inode_ctx
    3. fop2 -> fd1 -> subvol1 [takes new cached subvol]
    4. fop2 -> checks fd ctx (fd not open on subvol1) -> opens fd1 on subvol1
    5. fop1 -> checks fd ctx (fd not open on subvol0)
       -> tries to open fd1 on subvol0 -> fails with "No such file on directory".
    
    Fix:
    If dht_fd_open_on_dst fails with ENOENT or ESTALE, wind to old subvol
    and let the phase1/phase2 checks handle it.
    
    > BUG: 1465075
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: https://review.gluster.org/17731
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra G <rgowdapp>
    > Reviewed-by: Amar Tumballi <amarts>
    (cherry picked from commit f7a450c17fee7e43c544473366220887f0534ed7)
    Change-Id: I34f8011574a8b72e3bcfe03b0cc4f024b352f225
    BUG: 1465123
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17751
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 8 Shyamsundar 2017-08-29 14:40:57 UTC

@nithya for 3.12.1 I think we need to revert the glusterd changes to allow rebalance on sharded volumes without the warning. This fix (among others) have been available to users for some time (from 3.11 and 3.10 update releases as well), and we have not heard back on any new problems.

As a result, moving the version on this bug to 3.12 (as 3.11 will be EOL'd when 3.12 releases) and we could target 3.12.1 to remove the warning as stated in comment #4

Comment 9 Nithya Balachandran 2017-08-30 05:52:05 UTC

I'm concerned about the fxattrops not handling migrating files. That could cause issues with updating the size attribute.

@Krutika, what do you think? We have not managed to hit this in our tests but we know it exists.

Comment 10 Krutika Dhananjay 2017-08-30 06:25:52 UTC

It depends on which copy DHT on the client uses to serve metadata fops.

1. If it is guaranteed that the xattr value on the src and dst copy will eventually be in sync without any out-of-order updates causing them to deviate, then i'm guessing it should be OK.

2. If until migration completes, DHT on the client serves iatts as part of lookup, stat etc from the correct copy (the copy that has witnessed all the xattrops wound by shard as part of ongoing writes), then it is probably ok.

Comment 11 Nithya Balachandran 2017-09-27 04:36:50 UTC

(In reply to Krutika Dhananjay from comment #10)
> It depends on which copy DHT on the client uses to serve metadata fops.
> 
> 1. If it is guaranteed that the xattr value on the src and dst copy will
> eventually be in sync without any out-of-order updates causing them to
> deviate, then i'm guessing it should be OK.

> 
> 2. If until migration completes, DHT on the client serves iatts as part of
> lookup, stat etc from the correct copy (the copy that has witnessed all the
> xattrops wound by shard as part of ongoing writes), then it is probably ok.


What happens if an fxattrop is sent to a migrating file after all xattrs have been copied off the src file? That information would not be copied to the dst as dht_fxattrop does not perform migration checks.

Comment 12 Krutika Dhananjay 2017-09-28 07:44:17 UTC

(In reply to Nithya Balachandran from comment #11)
> (In reply to Krutika Dhananjay from comment #10)
> > It depends on which copy DHT on the client uses to serve metadata fops.
> > 
> > 1. If it is guaranteed that the xattr value on the src and dst copy will
> > eventually be in sync without any out-of-order updates causing them to
> > deviate, then i'm guessing it should be OK.
> 
> > 
> > 2. If until migration completes, DHT on the client serves iatts as part of
> > lookup, stat etc from the correct copy (the copy that has witnessed all the
> > xattrops wound by shard as part of ongoing writes), then it is probably ok.
> 
> 
> What happens if an fxattrop is sent to a migrating file after all xattrs
> have been copied off the src file? That information would not be copied to
> the dst as dht_fxattrop does not perform migration checks.

Do you mean to say that the fxattrop will only be wound on the src file which will subsequently be unlinked without any further syncing to destination copy? If so, then it will be a problem. VM operations are very sensitive to incorrect image size. Lot of our vm-corruption/vm-not-booting problems in the past have been because somewhere the size read/cached of the image was wrong.

-Krutika

Comment 13 Nithya Balachandran 2017-09-29 03:39:14 UTC

> Do you mean to say that the fxattrop will only be wound on the src file
> which will subsequently be unlinked without any further syncing to
> destination copy? 
Yes, that is correct. I had started on https://review.gluster.org/#/c/17776/ for this reason.

If so, then it will be a problem. VM operations are very
> sensitive to incorrect image size. Lot of our vm-corruption/vm-not-booting
> problems in the past have been because somewhere the size read/cached of the
> image was wrong.
> 
> -Krutika

Comment 14 Nithya Balachandran 2018-02-20 04:53:46 UTC

The patch https://review.gluster.org/#/c/19382/ handles fxattrops for migrating files. This should fix all the corruption issues.


Where can I find the warning displayed when attempting to rebalancing sharded volumes?

Comment 15 Krutika Dhananjay 2018-02-20 05:06:26 UTC

(In reply to Nithya Balachandran from comment #14)
> The patch https://review.gluster.org/#/c/19382/ handles fxattrops for
> migrating files. This should fix all the corruption issues.
> 
> 
> Where can I find the warning displayed when attempting to rebalancing
> sharded volumes?

So this was the original patch on master that introduced the warning - https://review.gluster.org/#/c/17160/

... which I'd reverted @ https://review.gluster.org/#/c/17506/ based on initial feedback from some of the users after they tested some of the fixes.