1142423 – [DHT-REBALANCE]-DataLoss: The data appended to a file during its migration will be lost once the migration is done

Bug 1142423 - [DHT-REBALANCE]-DataLoss: The data appended to a file during its migration will be lost once the migration is done

Summary: [DHT-REBALANCE]-DataLoss: The data appended to a file during its migration wi...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1140506 1225809
TreeView+	depends on / blocked

Reported:	2014-09-16 18:03 UTC by Shyamsundar
Modified:	2016-06-16 12:39 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:	1140506
Clones:	1225809 (view as bug list)
Environment:
Last Closed:	2016-06-16 12:39:37 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shyamsundar 2014-09-16 18:03:50 UTC

+++ This bug was initially created as a clone of Bug #1140506 +++

Description of problem:
While a file migration in progress if we append any data it will be lost once the migration is over

How reproducible:
Always

Steps to Reproduce:
1.created a 54 brick dist-rep volume
2.created a big file of size 3GB using urandom
dd if=/dev/urandom of=FILE bs=512M count=6
3.rename the file to something else so that subsequent rebalance will migrates the same file
4. mv FILE abc
5. check the file size before migration 
6. start rebalance force
gluster volume rebalance <vol> start force
7. while migration is in progress append some data to the file
   use the program attached with bug
8.check the file size during migration
9. check the files size after migaration

Actual results:



Before migration
=======
[root@localhost mnt]# ll
total 3145737
drwxr-xr-x 2 root root        162 Sep 10 15:09 2
-rw-r--r-- 1 root root         24 Sep 10 15:12 f1
-rw-r-Sr-T 1 root root 3221225523 Sep 11 03:06 FILE1
-rwxr-xr-x 1 root root       8139 Sep 10 15:11 slow

after migration
=========
[root@localhost mnt]# ll
total 3145737
drwxr-xr-x 2 root root        162 Sep 10 15:09 2
-rw-r--r-- 1 root root         24 Sep 10 15:12 f1
-rw-r--r-- 1 root root 3221225522 Sep 11 03:06 FILE1
-rwxr-xr-x 1 root root       8139 Sep 10 15:11 slow

 

Additional info:

[2014-09-11 07:06:26.508878] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33829: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:27.510004] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33831: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:28.510720] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33833: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:29.511543] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33835: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:30.512182] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33837: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:31.517089] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33839: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:32.517822] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33841: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:33.518535] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33843: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:34.519160] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33845: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:35.519623] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33847: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:36.520035] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33849: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:37.520552] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33851: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:37.521722] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 33855: FLUSH() ERR => -1 (Invalid argument)
[2014-09-11 07:07:10.539605] W [client-rpc-fops.c:2761:client3_3_lookup_cbk] 6-dongra-client-4: remote operation failed: No such file or directory. Path: /FILE1 (fcbe026d-b646-4391-9f03-849a381e8a84)
[2014-09-11 07:07:10.539656] W [client-rpc-fops.c:2761:client3_3_lookup_cbk] 6-dongra-client-5: remote operation failed: No such file or directory. Path: /FILE1 (fcbe026d-b646-4391-9f03-849a381e8a84)

attaching the logs

--- Additional comment from shylesh on 2014-09-12 05:46:52 EDT ---

Just an update on the bug.

This bug is not reproducible on fresh mount. i.e if rebalance is running for the first time after the mount and data is appended to it , everything works fine.

If same mount persists and subsequent rebalance with data append leads to this bug.

2.1u2 had different issue for the same test case which is captured in the following bugs.
https://bugzilla.redhat.com/show_bug.cgi?id=1059687
https://bugzilla.redhat.com/show_bug.cgi?id=1058569
https://bugzilla.redhat.com/show_bug.cgi?id=1054782

--- Additional comment from Shyamsundar on 2014-09-12 16:18:40 EDT ---

This bug has 2 code related issues. This is split as "Issue 1: Invalid stashed value in inode ctx 1" and "Issue 2: Incorrect Phase 2 cached/hashed determination on open fd". I am detailing Issue 1 and Du would detail the second one.

Issue 1: Invalid stashed value in inode ctx1

Test case to reproduce this,
- Create a nx2 or even nx1 volume
- Mount on FUSE
- Create a 2 GB file (say FINAL)
1- Rename FINAL to ABCDE
2- Ensure that ABCDE hashes to a different subvolume (for the next rebalance step to work)
3- Run a rebalance force
4- When rebalance has started on ABCDE, start an appending write for ABCDE
5- Check sizes on brick for file
- Repeat steps 1..5 without restarting the mount or remounting

The second time the test is done the appending write can demonstrate a couple of behaviors,
   - The dht_write and dht_write2 write to the same subvol which is the old cached subvol (so new location does not receive the bytes thus written)
   - The dht_write2 is called with a cached subvol where the fd we send is invalid (this is not caught by the application due to write behind)

Finally, the data is either written to the older location only and no errors popped to the application, or the data is not written anywhere, or the first write appends the data to the older location and not to the newer location (i.e the files hashed volume in this case) 

(( older is the cached location and newer is the hashed location, so any appending writes not replayed to newer location will result in a loss post rebalance is done with the file))

Code problem:
The dht_migration_complete_check_task (i.e migration phase 2 detection) never gets called, as we finish the appending writes before the file is completely migrated (hence the large file size). Due to this, inode_ctx_reset1 is never called, so we have stashed a subvol here that we think we should send future writes to in case we detect a rebalance in progress during the FOP (say write, could be for other FOPs as well that check, dht_inode_ctx_get1) and blindly send the FOP without opening the fd to the returned subvol).

So the issue is that the stashed data in ctx1 should be invalidated (post a rebalance?) somehow, otherwise we end up in troubled waters with a data loss.

Phase 1 migration function dht_rebalance_in_progress_check is not called as there is data already in inode ctx 1 for optimization reasons (i.e for each write or FOP that needs this do not determine this again).

Solution proposed:
Stash this ctx1 information on the fd instead, so that it's life is the life of the fd, and in case there are overruns of the fd (i.e it remaining open even after rebalance is complete), it would still work as the brick would retain the open fd on the brick, and we will detect Phase 2 of migration in progress/complete when we reuse this fd (unlink will not delete the file, till last fd is closed).

Other soltions welcome, else we will go ahead with this one for Issue 1 presented here.

Comment 1 Anand Avati 2014-10-07 20:20:47 UTC

REVIEW: http://review.gluster.org/8912 (cluster/dht: Fix stale subvol cache for files under migration) posted (#1) for review on master by Shyamsundar Ranganathan (srangana)

Comment 2 Anand Avati 2014-10-09 20:35:05 UTC

REVIEW: http://review.gluster.org/8912 (cluster/dht: Fix stale subvol cache for files under migration) posted (#2) for review on master by Shyamsundar Ranganathan (srangana)

Comment 3 Anand Avati 2014-10-13 21:21:23 UTC

REVIEW: http://review.gluster.org/8912 (cluster/dht: Fix stale subvol cache for files under migration) posted (#3) for review on master by Shyamsundar Ranganathan (srangana)

Comment 4 Anand Avati 2014-10-14 18:03:24 UTC

REVIEW: http://review.gluster.org/8912 (cluster/dht: Fix stale subvol cache for files under migration) posted (#4) for review on master by Shyamsundar Ranganathan (srangana)

Comment 5 Anand Avati 2014-10-15 18:48:46 UTC

REVIEW: http://review.gluster.org/8912 (cluster/dht: Fix stale subvol cache for files under migration) posted (#5) for review on master by Shyamsundar Ranganathan (srangana)

Comment 6 Anand Avati 2015-05-19 18:06:10 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 7 Anand Avati 2015-05-21 09:06:24 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 8 Nagaprasad Sathyanarayana 2015-05-22 07:10:09 UTC

http://review.gluster.org/10805

Comment 9 Anand Avati 2015-05-25 13:25:15 UTC

REVIEW: http://review.gluster.org/10805 (cluster/dht: Don't rely on linkto xattr to find destination subvol during phase 2 of migration.) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 10 Anand Avati 2015-05-27 12:31:14 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#1) for review on master by Raghavendra G (rgowdapp)

Comment 11 Anand Avati 2015-05-28 04:20:34 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 12 Anand Avati 2015-05-28 04:23:40 UTC

REVIEW: http://review.gluster.org/10805 (cluster/dht: Don't rely on linkto xattr to find destination subvol during phase 2 of migration.) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 13 Anand Avati 2015-05-28 04:23:52 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 14 Anand Avati 2015-05-28 04:50:02 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 15 Anand Avati 2015-05-28 09:23:30 UTC

COMMIT: http://review.gluster.org/10805 committed in master by Raghavendra G (rgowdapp) 
------
commit 4df3ea9ab4d8a1aff98784460983b5f0cb4a9ee9
Author: Raghavendra G <rgowdapp>
Date:   Wed May 13 19:56:47 2015 +0530

    cluster/dht: Don't rely on linkto xattr to find destination subvol during phase 2 of migration.
    
    linkto xattr on source file cannot be relied to find where the data
    file currently resides. This can happen if there are multiple
    migrations before phase 2 detection by a client. For eg.,
    
    * migration (M1, node1, node2) starts.
    * application writes some data. DHT correctly stores the state in
      inode context that phase-1 of migration is in progress
    * migration M1 completes
    * migration (M2, node2, node3) is triggered and completed
    * application resumes writes to the file. DHT identifies it as phase-2
      of migration. However, linkto xattr on node1 points to node2, but
      the file is on node3. A lookup correctly identifies node3 as cached
      subvol
    
    TBD:
       When we identify phase-2 of a previous migration (say M1), there
       might be a migration in progress - say (M3, node3, node4). In this
       case we need to send writes to both (node3, node4) not just
       node3. Also, the inode state needs to correctly indicate that its in
       phase-1 of migration. I'll send this as a different patch.
    
    Change-Id: I1a861f766258170af2f6c0935468edb6be687b95
    BUG: 1142423
    Signed-off-by: Raghavendra G <rgowdapp>
    Reviewed-on: http://review.gluster.org/10805
    Tested-by: NetBSD Build System

Comment 16 Anand Avati 2015-05-28 10:37:04 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#4) for review on master by Raghavendra G (rgowdapp)

Comment 17 Anand Avati 2015-05-28 10:37:07 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#4) for review on master by Raghavendra G (rgowdapp)

Comment 18 Anand Avati 2015-05-29 07:03:24 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#5) for review on master by Raghavendra G (rgowdapp)

Comment 19 Anand Avati 2015-05-29 07:03:27 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#5) for review on master by Raghavendra G (rgowdapp)

Comment 20 Anand Avati 2015-05-29 11:05:10 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#6) for review on master by Raghavendra G (rgowdapp)

Comment 21 Anand Avati 2015-05-29 11:05:13 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#6) for review on master by Raghavendra G (rgowdapp)

Comment 22 Anand Avati 2015-05-30 10:50:20 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#7) for review on master by Niels de Vos (ndevos)

Comment 23 Anand Avati 2015-06-01 04:17:01 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#7) for review on master by Raghavendra G (rgowdapp)

Comment 24 Anand Avati 2015-06-01 04:17:04 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#8) for review on master by Raghavendra G (rgowdapp)

Comment 25 Anand Avati 2015-06-02 05:20:08 UTC

REVIEW: http://review.gluster.org/10834 (cluster/dht: fix incorrect dst subvol info in inode_ctx) posted (#8) for review on master by Raghavendra G (rgowdapp)

Comment 26 Anand Avati 2015-06-02 05:20:11 UTC

REVIEW: http://review.gluster.org/10943 (cluster/dht: pass a destination subvol to fop2 variants to avoid races.) posted (#9) for review on master by Raghavendra G (rgowdapp)

Comment 27 Anand Avati 2015-06-03 05:33:22 UTC

COMMIT: http://review.gluster.org/10943 committed in master by Raghavendra G (rgowdapp) 
------
commit b6eda067d2e2a0b56718ea71522f6c7b06a09f13
Author: Raghavendra G <rgowdapp>
Date:   Thu May 28 16:03:12 2015 +0530

    cluster/dht: pass a destination subvol to fop2 variants to avoid races.
    
    The destination subvol used in the fop2 variants is either stored in
    inode-ctx1 or local->cached_subvol. However, it is not guaranteed that
    a value stored in these locations before invocation of fop2 is still
    present after the invocation as these locations are shared among
    different concurrent operations. So, to preserve the atomicity of
    "check dst-subvol and invoke fop2 variant if dst-subvol found", we
    pass down the dst-subvol to fop2 variant.
    
    This patch also fixes error handling in some fop2 variants.
    
    Change-Id: Icc226228a246d3f223e3463519736c4495b364d2
    BUG: 1142423
    Signed-off-by: Raghavendra G <rgowdapp>
    Reviewed-on: http://review.gluster.org/10943
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: N Balachandran <nbalacha>

Comment 28 Anand Avati 2015-06-03 05:36:32 UTC

COMMIT: http://review.gluster.org/10834 committed in master by Raghavendra G (rgowdapp) 
------
commit 9684b90526d03a15d451e341521d7df44adae73e
Author: Nithya Balachandran <nbalacha>
Date:   Tue May 19 23:27:35 2015 +0530

    cluster/dht: fix incorrect dst subvol info in inode_ctx
    
    Stashing additional information in the inode_ctx to help
    decide whether the migration information is stale, which could
    happen if a file was migrated several times but FOPs only detected
    the P1 migration phase. If no FOP detects the P2 phase, the inode
    ctx1 is never reset.
    We now save the src subvol as well as the dst subvol in the
    inode ctx. The src subvol is the subvol on which the FOP was sent
    when the mig info was set in the inode ctx. This information is
    considered stale if:
    1. The subvol on which the current FOP is sent is the same as
    the dst subvol in the ctx
    2. The subvol on which the current FOP is sent is not the same
    as the src subvol in the ctx
    
    This does not handle the case where the same file might have been
    renamed such that the src subvol is the same but the dst subvol
    is different. However, that is unlikely to happen very often.
    
    Change-Id: I05a2e9b107ee64750c7ca629aee03b03a02ef75f
    BUG: 1142423
    Signed-off-by: Nithya Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/10834
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Tested-by: Raghavendra G <rgowdapp>

Comment 29 Anand Avati 2015-06-11 09:00:48 UTC

REVIEW: http://review.gluster.org/11175 (cluster/dht Use additional dst_info in inode_ctx) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 30 Anand Avati 2015-06-15 05:15:43 UTC

REVIEW: http://review.gluster.org/11175 (cluster/dht Use additional dst_info in inode_ctx) posted (#2) for review on master by N Balachandran (nbalacha)

Comment 31 Niels de Vos 2016-06-16 12:39:37 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.