969020 – DHT - remove-brick - data loss - when remove-brick with 'start' is in progress, perform rename operation on files. commit remove-brick, after status is 'completed' and few files are missing.

Bug 969020 - DHT - remove-brick - data loss - when remove-brick with 'start' is in progress, perform rename operation on files. commit remove-brick, after status is 'completed' and few files are missing.

Summary: DHT - remove-brick - data loss - when remove-brick with 'start' is in progres...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nithya Balachandran
QA Contact:	amainkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1087818 1136349
TreeView+	depends on / blocked

Reported:	2013-05-30 13:43 UTC by Rachana Patel
Modified:	2015-05-13 18:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Renaming a file during remove-brick operation causes the file not to get migrated from the removed brick. Workaround (if any): Check the removed brick for any files that might not have been migrated and copy those to the gluster volume before decommissioning the brick.
Clone Of:
Clones:	1136349 (view as bug list)
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rachana Patel 2013-05-30 13:43:25 UTC

Description of problem:
DHT - remove-brick - data loss - when remove-brick with 'start' is in progress, perform rename operation on files. commit remove-brick operation, after status is 'completed' and few files are missing.
(not related to defect - 963896 - hash layout is not problem here)

Version-Release number of selected component (if applicable):
3.3.0.10rhs-1.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1.had a cluster of 3 RHS server and DHT volume having four bricks mounted as fuse

[root@rhsauto031 ~]# gluster v info dist
 
Volume Name: dist
Type: Distribute
Volume ID: 0130dae0-0573-491b-a4b2-14ac872624e7
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick2
Brick2: rhsauto038.lab.eng.blr.redhat.com:/rhs/brick2
Brick3: rhsauto031.lab.eng.blr.redhat.com:/rhs/brick2
Brick4: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10

[root@localhost rtest]# mount | grep test
glusterfs#rhsauto018.lab.eng.blr.redhat.com:/dist on /mnt/rtest type fuse (rw,default_permissions,allow_other,max_read=131072)


2. run remove brick with start option
[root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 start
Remove Brick start successful
[root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                0            0            0            0    not started
       rhsauto031.lab.eng.blr.redhat.com                0            0            0            0    not started
       rhsauto018.lab.eng.blr.redhat.com              668     29360128         2054            0    in progress


3. while data migration is in progress rename files from mount point
mount point :-
[root@localhost rtest]# for i in {1..100}; do mv d1/f$i  d1/filenew$i; done
[root@localhost rtest]#for i in {1..100};  do for j in {1..100}; do  mv d$j/f$i  d$j/filenew$i;done; done


[root@localhost rtest]# ls d25/filenew25
d25/filenew25


4. verify that data - migration is in progress or not and check hash range for removed brick

[root@rhsauto018 rpm]# getfattr -d -m . -e hex /rhs/brick10/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick10/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x0130dae00573491ba4b214ac872624e7

[root@rhsauto018 rpm]# getfattr -d -m . -e hex /rhs/brick10/d25
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick10/d25
trusted.gfid=0xb5cc8353864e4a2dbdee8afabd099686
trusted.glusterfs.dht=0x00000001000000000000000000000000


[root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                0            0            0            0    not started
       rhsauto031.lab.eng.blr.redhat.com                0            0            0            0    not started
       rhsauto018.lab.eng.blr.redhat.com             6839   1062207488        21350            0    in progress


5. Once migration is completed, perform commit for remove-brick
[root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                0            0            0            0    not started
       rhsauto031.lab.eng.blr.redhat.com                0            0            0            0    not started
       rhsauto018.lab.eng.blr.redhat.com             7165   1138753536        22257            0      completed
[root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Remove Brick commit successful

[root@rhsauto038 rpm]# gluster v info dist
 
Volume Name: dist
Type: Distribute
Volume ID: 0130dae0-0573-491b-a4b2-14ac872624e7
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick2
Brick2: rhsauto038.lab.eng.blr.redhat.com:/rhs/brick2
Brick3: rhsauto031.lab.eng.blr.redhat.com:/rhs/brick2


6. on mount point few files are missing

[root@localhost rtest]# ls d25/filenew25
ls: d25/filenew25: No such file or directory

[ this was present on mount point, check in step 3][root@rhsauto018 rpm]# ls -l /rhs/brick10/d25/filenew25
-rw-r--r-- 2 root root 0 May 29 15:16 /rhs/brick10/d25/filenew25


7. verify on all bricks, (including removed brick) in order to find this file

present brick
[root@rhsauto018 rpm]# ls -l /rhs/brick2/d25/filenew25
ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory

[root@rhsauto031 ~]# ls -l /rhs/brick2/d25/filenew25
ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory

[root@rhsauto038 rpm]# ls -l /rhs/brick2/d25/filenew25
ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory


removed - brick
[root@rhsauto018 rpm]# ls -l /rhs/brick10/d25/filenew25
-rw-r--r-- 2 root root 0 May 29 15:16 /rhs/brick10/d25/filenew25


Actual results:
files are missing

Expected results:
files should be present if remove-brick status says completed without any failures

Additional info:

more missing files :-
[root@localhost rtest]# ls d19/filenew9
ls: d19/filenew9: No such file or directory
[root@localhost rtest]# d24/filenew23
-bash: d24/filenew23: No such file or directory
[root@localhost rtest]# d27/filenew34
-bash: d27/filenew34: No such file or directory
[root@localhost rtest]# d27/filenew35
-bash: d27/filenew35: No such file or directory
[root@localhost rtest]# d27/filenew36
-bash: d27/filenew36: No such file or directory

Comment 8 Raghavendra G 2014-08-11 07:25:51 UTC

Thanks to Venkatesh for RCA below:

Its most likely a race condition between a rename and rebalance

1. rename (f25, filenew25) succeeded.
2. lookup (f25) done by rebalance will fail and rebalance process skips this fail.
3. Meanwhile filenew25 was never in the list of dentries read by rebalance through readdirp.

So, filenew25 never gets migrated.

To corroborate point 2, we found following in the logs

[2013-05-30 17:49:45.387823] E [dht-rebalance.c:1155:gf_defrag_migrate_data] 0-dist-dht: /d25/f25 lookup failed

Here rename is moving the file to a decommissioned bricks, which means rename is acting on the old layout which included brick4. A possible sequence of events which could've led to this is:

1. lookup triggered as part of rename read layout of d25, before it was "fixed" by rebalance process.
2. rebalance process did a "fix-layout" of d25.
3. rebalance readdirp read dentry f25. Since rename is not complete at this time, filenew25 would not be in the list of dentries read by rebalance readdirp
4. rename (f25, filenew25) succeeded
5. lookup (f25) by rebalance fails and skips migrating the file.

This can be classified as a stale layout issue since:
1.if rename read the new-layout on disk

and

2. rebalance didn't change layout of d25 while rename is in progress

Then either,
1. rebalance process would've picked up new entry filenew25
2. rename (f25, filenew25) would not have hashed filenew25 to brick4

In both the cases we wouldn't have lost the file.

Same issue can happen for directories also.

A possible fix is:

non-rebalance client during rename:
1. locks layout of src-parent
2. checks whether layout of src-parent has changed
3. if 2. is true, fail the rename
4. unlock

rebalance process during fix-layout of a directory,
1. locks directory
2. fix layout
3. unlocks directory

regards,
Raghavendra.

Comment 9 Susant Kumar Palai 2014-08-11 08:27:19 UTC

Hi Raghavendra,
     
   For the fix to the problem described above, is the second part of the fix required ? I think rename can take blocking locks on the parent's layout and based on the layout it read it can take decisions whether the rename should proceed.

second part: 

rebalance process during fix-layout of a directory,
1. locks directory
2. fix layout
3. unlocks directory

Comment 10 Shyamsundar 2014-08-11 14:35:17 UTC

@Susant, if rename takes a _blocking_ lock, what is it blocking? In this case the layout setting by rebalance, and for that to conflict with rename, that code path should also take the lock to ensure the _blocking_ behaviour, right?

Comment 11 Raghavendra G 2014-08-12 03:15:25 UTC

Susant,

As shyam pointed out, rebalance process has to take a lock, either to:
1. block rename while it is doing layout changes
or
2. block itself while rename is in progress.

A small correction in my earlier RCA:

<RCA>
This can be classified as a stale layout issue since:
1.if rename read the new-layout on disk 

and

2. rebalance didn't change layout of d25 while rename is in progress
</RCA>

Here it should be "or" b/w 1 and 2 instead of "and". Also 2 can be more verbose as below:

2. rebalance didn't "fix" layout of d25 while rename is in progress and hence picks up new-entry (filenew25) in readdirp (as readdirp is done _after_ fix-layout).

regards,
Raghavendra.

Comment 12 Nithya Balachandran 2014-08-14 05:47:51 UTC

The file still exists on the removed brick, it has just not been migrated off it.
The following are two approaches we can take to handle this:

1. Include this scenario as a failure to migrate and update the status accordingly. Modify the rebalance status message to ask the sysadmin to check the removed brick for any files that might have not been migrated and ensure that s/he moves them to the volume.

2. Provide a script to crawl the removed brick once the rebalance is complete and move any files found to the volume mount point. This would require additional testing and dev effort.

Comment 13 Shalaka 2014-09-20 11:03:08 UTC

Please review and sign-off edited doc text.

Comment 14 Shalaka 2014-09-26 04:38:27 UTC

Cancelling need_info as Nithya reviewed and signed off doc text during review meeting.

Comment 15 Vivek Agarwal 2015-03-23 07:40:10 UTC

The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version.







[1] https://rhn.redhat.com/errata/RHSA-2014-0821.html

Comment 16 Vivek Agarwal 2015-03-23 07:40:38 UTC

The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version.







[1] https://rhn.redhat.com/errata/RHSA-2014-0821.html

Note You need to log in before you can comment on or make changes to this bug.