Bug 1140506 - [DHT-REBALANCE]-DataLoss: The data appended to a file during its migration will be lost once the migration is done
Summary: [DHT-REBALANCE]-DataLoss: The data appended to a file during its migration wi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.1.0
Assignee: Raghavendra G
QA Contact: Triveni Rao
URL:
Whiteboard: dht-data-loss
Depends On: 1142423 1225809
Blocks: 1202842
TreeView+ depends on / blocked
 
Reported: 2014-09-11 07:25 UTC by shylesh
Modified: 2016-05-16 04:38 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.7.1-2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1142423 (view as bug list)
Environment:
Last Closed: 2015-07-29 04:35:53 UTC
Embargoed:


Attachments (Terms of Use)
mount logs (2.56 MB, application/octet-stream)
2014-09-11 07:27 UTC, shylesh
no flags Details
test program (1.17 KB, text/plain)
2014-09-11 07:40 UTC, shylesh
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description shylesh 2014-09-11 07:25:50 UTC
Description of problem:
While a file migration in progress if we append any data it will be lost once the migration is over

Version-Release number of selected component (if applicable):
3.6.0.28-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1.created a 54 brick dist-rep volume
2.created a big file of size 3GB using urandom
dd if=/dev/urandom of=FILE bs=512M count=6
3.rename the file to something else so that subsequent rebalance will migrates the same file
4. mv FILE abc
5. check the file size before migration 
6. start rebalance force
gluster volume rebalance <vol> start force
7. while migration is in progress append some data to the file
   use the program attached with bug
8.check the file size during migration
9. check the files size after migaration

Actual results:



Before migration
=======
[root@localhost mnt]# ll
total 3145737
drwxr-xr-x 2 root root        162 Sep 10 15:09 2
-rw-r--r-- 1 root root         24 Sep 10 15:12 f1
-rw-r-Sr-T 1 root root 3221225523 Sep 11 03:06 FILE1
-rwxr-xr-x 1 root root       8139 Sep 10 15:11 slow

after migration
=========
[root@localhost mnt]# ll
total 3145737
drwxr-xr-x 2 root root        162 Sep 10 15:09 2
-rw-r--r-- 1 root root         24 Sep 10 15:12 f1
-rw-r--r-- 1 root root 3221225522 Sep 11 03:06 FILE1
-rwxr-xr-x 1 root root       8139 Sep 10 15:11 slow

 

Additional info:

[2014-09-11 07:06:26.508878] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33829: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:27.510004] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33831: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:28.510720] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33833: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:29.511543] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33835: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:30.512182] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33837: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:31.517089] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33839: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:32.517822] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33841: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:33.518535] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33843: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:34.519160] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33845: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:35.519623] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33847: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:36.520035] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33849: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:37.520552] W [fuse-bridge.c:2238:fuse_writev_cbk] 0-glusterfs-fuse: 33851: WRITE => -1 (Invalid argument)
[2014-09-11 07:06:37.521722] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 33855: FLUSH() ERR => -1 (Invalid argument)
[2014-09-11 07:07:10.539605] W [client-rpc-fops.c:2761:client3_3_lookup_cbk] 6-dongra-client-4: remote operation failed: No such file or directory. Path: /FILE1 (fcbe026d-b646-4391-9f03-849a381e8a84)
[2014-09-11 07:07:10.539656] W [client-rpc-fops.c:2761:client3_3_lookup_cbk] 6-dongra-client-5: remote operation failed: No such file or directory. Path: /FILE1 (fcbe026d-b646-4391-9f03-849a381e8a84)

attaching the logs

Comment 1 shylesh 2014-09-11 07:27:17 UTC
Created attachment 936408 [details]
mount logs

Comment 3 shylesh 2014-09-11 07:40:37 UTC
Created attachment 936411 [details]
test program

Comment 7 shylesh 2014-09-12 09:46:52 UTC
Just an update on the bug.

This bug is not reproducible on fresh mount. i.e if rebalance is running for the first time after the mount and data is appended to it , everything works fine.

If same mount persists and subsequent rebalance with data append leads to this bug.

2.1u2 had different issue for the same test case which is captured in the following bugs.
https://bugzilla.redhat.com/show_bug.cgi?id=1059687
https://bugzilla.redhat.com/show_bug.cgi?id=1058569
https://bugzilla.redhat.com/show_bug.cgi?id=1054782

Comment 8 Shyamsundar 2014-09-12 20:18:40 UTC
This bug has 2 code related issues. This is split as "Issue 1: Invalid stashed value in inode ctx 1" and "Issue 2: Incorrect Phase 2 cached/hashed determination on open fd". I am detailing Issue 1 and Du would detail the second one.

Issue 1: Invalid stashed value in inode ctx1

Test case to reproduce this,
- Create a nx2 or even nx1 volume
- Mount on FUSE
- Create a 2 GB file (say FINAL)
1- Rename FINAL to ABCDE
2- Ensure that ABCDE hashes to a different subvolume (for the next rebalance step to work)
3- Run a rebalance force
4- When rebalance has started on ABCDE, start an appending write for ABCDE
5- Check sizes on brick for file
- Repeat steps 1..5 without restarting the mount or remounting

The second time the test is done the appending write can demonstrate a couple of behaviors,
   - The dht_write and dht_write2 write to the same subvol which is the old cached subvol (so new location does not receive the bytes thus written)
   - The dht_write2 is called with a cached subvol where the fd we send is invalid (this is not caught by the application due to write behind)

Finally, the data is either written to the older location only and no errors popped to the application, or the data is not written anywhere, or the first write appends the data to the older location and not to the newer location (i.e the files hashed volume in this case) 

(( older is the cached location and newer is the hashed location, so any appending writes not replayed to newer location will result in a loss post rebalance is done with the file))

Code problem:
The dht_migration_complete_check_task (i.e migration phase 2 detection) never gets called, as we finish the appending writes before the file is completely migrated (hence the large file size). Due to this, inode_ctx_reset1 is never called, so we have stashed a subvol here that we think we should send future writes to in case we detect a rebalance in progress during the FOP (say write, could be for other FOPs as well that check, dht_inode_ctx_get1) and blindly send the FOP without opening the fd to the returned subvol).

So the issue is that the stashed data in ctx1 should be invalidated (post a rebalance?) somehow, otherwise we end up in troubled waters with a data loss.

Phase 1 migration function dht_rebalance_in_progress_check is not called as there is data already in inode ctx 1 for optimization reasons (i.e for each write or FOP that needs this do not determine this again).

Solution proposed:
Stash this ctx1 information on the fd instead, so that it's life is the life of the fd, and in case there are overruns of the fd (i.e it remaining open even after rebalance is complete), it would still work as the brick would retain the open fd on the brick, and we will detect Phase 2 of migration in progress/complete when we reuse this fd (unlink will not delete the file, till last fd is closed).

Other soltions welcome, else we will go ahead with this one for Issue 1 presented here.

Comment 13 Amit Chaurasia 2015-06-15 13:43:54 UTC
Verified the bug by creating a file, renaming so that hashed and cached lie on different subvol, rebalanced and then appended. The file size properly appended.

Rebalance is successful.


[root@dht-rhs-23 test]# mv FILE-1-rename-2 FILE-1-rename-3
[root@dht-rhs-23 test]# ls -ltrh
total 2.2G
-rw-r--r--. 1 root root 2.2G Jun 16 00:18 FILE-1-rename-3
[root@dht-rhs-23 test]# 
[root@dht-rhs-23 test]# gluster v rebalance testvol start force
volume rebalance: testvol: success: Rebalance on testvol has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 2d2d0473-b374-4116-82a4-2dff55a09bbd


Rebalance in progress

[root@dht-rhs-23 test]# gluster v rebalance testvol status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             2             0             0          in progress              26.00
                            10.70.47.174                0        0Bytes             0             0             0            completed               0.00
volume rebalance: testvol: success: 
[root@dht-rhs-23 test]# 


Appending the file:

[root@dht-rhs-23 test]# dd oflag=append conv=notrunc if=/dev/zero of=FILE-1-rename-3 bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 105.431 s, 971 kB/s



[root@dht-rhs-23 test]# gluster v rebalance testvol status                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                1         2.1GB             2             0             0            completed             257.00
                            10.70.47.174                0        0Bytes             0             0             0            completed               0.00
volume rebalance: testvol: success: 
[root@dht-rhs-23 test]# 
[root@dht-rhs-23 test]# 

Rebalance completed.

===

File lies on diff subvol

[root@dht-rhs-24 ~]# for i in {0..9}; do ls -ltrh /bricks/brick$i/testvol/test; done
total 2.2G
-rw-r-Sr-T. 2 root root 2.2G Jun 16 00:18 FILE-1-rename-3
total 679M
---------T. 2 root root 2.2G Jun 16 00:20 FILE-1-rename-3

-----------------=================================

[root@dht-rhs-24 ~]# ls -l /mnt/glusterfs/test/
total 2245666
-rw-r-Sr-T. 1 root root 2299561984 Jun 16 00:18 FILE-1-rename-3
[root@dht-rhs-24 ~]# 

[root@dht-rhs-24 ~]# ls -l /mnt/glusterfs/test/total 2252067
-rw-r-Sr-T. 1 root root 2306116608 Jun 16 00:20 FILE-1-rename-3
[root@dht-rhs-24 ~]# ls -l /mnt/glusterfs/test/
total 2299427
-rw-r-Sr-T. 1 root root 2354613248 Jun 16 00:21 FILE-1-rename-3
[root@dht-rhs-24 ~]# ls -l /mnt/glusterfs/test/
total 2345666
-rw-r--r--. 1 root root 2401961984 Jun 16 00:22 FILE-1-rename-3


File size is increased appropriately.


Tried this for multiple times and each time , the file was appended appropriately. Marking the bug verified..

Comment 15 errata-xmlrpc 2015-07-29 04:35:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.