Description of problem: DHT - remove-brick - data loss - when remove-brick with 'start' is in progress, perform rename operation on files. commit remove-brick operation, after status is 'completed' and few files are missing. (not related to defect - 963896 - hash layout is not problem here) Version-Release number of selected component (if applicable): 3.3.0.10rhs-1.el6.x86_64 How reproducible: always Steps to Reproduce: 1.had a cluster of 3 RHS server and DHT volume having four bricks mounted as fuse [root@rhsauto031 ~]# gluster v info dist Volume Name: dist Type: Distribute Volume ID: 0130dae0-0573-491b-a4b2-14ac872624e7 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick2 Brick2: rhsauto038.lab.eng.blr.redhat.com:/rhs/brick2 Brick3: rhsauto031.lab.eng.blr.redhat.com:/rhs/brick2 Brick4: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 [root@localhost rtest]# mount | grep test glusterfs#rhsauto018.lab.eng.blr.redhat.com:/dist on /mnt/rtest type fuse (rw,default_permissions,allow_other,max_read=131072) 2. run remove brick with start option [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 start Remove Brick start successful [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0 0 0 not started rhsauto031.lab.eng.blr.redhat.com 0 0 0 0 not started rhsauto018.lab.eng.blr.redhat.com 668 29360128 2054 0 in progress 3. while data migration is in progress rename files from mount point mount point :- [root@localhost rtest]# for i in {1..100}; do mv d1/f$i d1/filenew$i; done [root@localhost rtest]#for i in {1..100}; do for j in {1..100}; do mv d$j/f$i d$j/filenew$i;done; done [root@localhost rtest]# ls d25/filenew25 d25/filenew25 4. verify that data - migration is in progress or not and check hash range for removed brick [root@rhsauto018 rpm]# getfattr -d -m . -e hex /rhs/brick10/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick10/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000000000000 trusted.glusterfs.volume-id=0x0130dae00573491ba4b214ac872624e7 [root@rhsauto018 rpm]# getfattr -d -m . -e hex /rhs/brick10/d25 getfattr: Removing leading '/' from absolute path names # file: rhs/brick10/d25 trusted.gfid=0xb5cc8353864e4a2dbdee8afabd099686 trusted.glusterfs.dht=0x00000001000000000000000000000000 [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0 0 0 not started rhsauto031.lab.eng.blr.redhat.com 0 0 0 0 not started rhsauto018.lab.eng.blr.redhat.com 6839 1062207488 21350 0 in progress 5. Once migration is completed, perform commit for remove-brick [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0 0 0 not started rhsauto031.lab.eng.blr.redhat.com 0 0 0 0 not started rhsauto018.lab.eng.blr.redhat.com 7165 1138753536 22257 0 completed [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Remove Brick commit successful [root@rhsauto038 rpm]# gluster v info dist Volume Name: dist Type: Distribute Volume ID: 0130dae0-0573-491b-a4b2-14ac872624e7 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick2 Brick2: rhsauto038.lab.eng.blr.redhat.com:/rhs/brick2 Brick3: rhsauto031.lab.eng.blr.redhat.com:/rhs/brick2 6. on mount point few files are missing [root@localhost rtest]# ls d25/filenew25 ls: d25/filenew25: No such file or directory [ this was present on mount point, check in step 3][root@rhsauto018 rpm]# ls -l /rhs/brick10/d25/filenew25 -rw-r--r-- 2 root root 0 May 29 15:16 /rhs/brick10/d25/filenew25 7. verify on all bricks, (including removed brick) in order to find this file present brick [root@rhsauto018 rpm]# ls -l /rhs/brick2/d25/filenew25 ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory [root@rhsauto031 ~]# ls -l /rhs/brick2/d25/filenew25 ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory [root@rhsauto038 rpm]# ls -l /rhs/brick2/d25/filenew25 ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory removed - brick [root@rhsauto018 rpm]# ls -l /rhs/brick10/d25/filenew25 -rw-r--r-- 2 root root 0 May 29 15:16 /rhs/brick10/d25/filenew25 Actual results: files are missing Expected results: files should be present if remove-brick status says completed without any failures Additional info: more missing files :- [root@localhost rtest]# ls d19/filenew9 ls: d19/filenew9: No such file or directory [root@localhost rtest]# d24/filenew23 -bash: d24/filenew23: No such file or directory [root@localhost rtest]# d27/filenew34 -bash: d27/filenew34: No such file or directory [root@localhost rtest]# d27/filenew35 -bash: d27/filenew35: No such file or directory [root@localhost rtest]# d27/filenew36 -bash: d27/filenew36: No such file or directory
Thanks to Venkatesh for RCA below: Its most likely a race condition between a rename and rebalance 1. rename (f25, filenew25) succeeded. 2. lookup (f25) done by rebalance will fail and rebalance process skips this fail. 3. Meanwhile filenew25 was never in the list of dentries read by rebalance through readdirp. So, filenew25 never gets migrated. To corroborate point 2, we found following in the logs [2013-05-30 17:49:45.387823] E [dht-rebalance.c:1155:gf_defrag_migrate_data] 0-dist-dht: /d25/f25 lookup failed Here rename is moving the file to a decommissioned bricks, which means rename is acting on the old layout which included brick4. A possible sequence of events which could've led to this is: 1. lookup triggered as part of rename read layout of d25, before it was "fixed" by rebalance process. 2. rebalance process did a "fix-layout" of d25. 3. rebalance readdirp read dentry f25. Since rename is not complete at this time, filenew25 would not be in the list of dentries read by rebalance readdirp 4. rename (f25, filenew25) succeeded 5. lookup (f25) by rebalance fails and skips migrating the file. This can be classified as a stale layout issue since: 1.if rename read the new-layout on disk and 2. rebalance didn't change layout of d25 while rename is in progress Then either, 1. rebalance process would've picked up new entry filenew25 2. rename (f25, filenew25) would not have hashed filenew25 to brick4 In both the cases we wouldn't have lost the file. Same issue can happen for directories also. A possible fix is: non-rebalance client during rename: 1. locks layout of src-parent 2. checks whether layout of src-parent has changed 3. if 2. is true, fail the rename 4. unlock rebalance process during fix-layout of a directory, 1. locks directory 2. fix layout 3. unlocks directory regards, Raghavendra.
Hi Raghavendra, For the fix to the problem described above, is the second part of the fix required ? I think rename can take blocking locks on the parent's layout and based on the layout it read it can take decisions whether the rename should proceed. second part: rebalance process during fix-layout of a directory, 1. locks directory 2. fix layout 3. unlocks directory
@Susant, if rename takes a _blocking_ lock, what is it blocking? In this case the layout setting by rebalance, and for that to conflict with rename, that code path should also take the lock to ensure the _blocking_ behaviour, right?
Susant, As shyam pointed out, rebalance process has to take a lock, either to: 1. block rename while it is doing layout changes or 2. block itself while rename is in progress. A small correction in my earlier RCA: <RCA> This can be classified as a stale layout issue since: 1.if rename read the new-layout on disk and 2. rebalance didn't change layout of d25 while rename is in progress </RCA> Here it should be "or" b/w 1 and 2 instead of "and". Also 2 can be more verbose as below: 2. rebalance didn't "fix" layout of d25 while rename is in progress and hence picks up new-entry (filenew25) in readdirp (as readdirp is done _after_ fix-layout). regards, Raghavendra.
The file still exists on the removed brick, it has just not been migrated off it. The following are two approaches we can take to handle this: 1. Include this scenario as a failure to migrate and update the status accordingly. Modify the rebalance status message to ask the sysadmin to check the removed brick for any files that might have not been migrated and ensure that s/he moves them to the volume. 2. Provide a script to crawl the removed brick once the rebalance is complete and move any files found to the volume mount point. This would require additional testing and dev effort.
Please review and sign-off edited doc text.
Cancelling need_info as Nithya reviewed and signed off doc text during review meeting.
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version. [1] https://rhn.redhat.com/errata/RHSA-2014-0821.html