Bug 1136349
Summary: | DHT - remove-brick - data loss - when remove-brick with 'start' is in progress, perform rename operation on files. commit remove-brick, after status is 'completed' and few files are missing. | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Susant Kumar Palai <spalai> |
Component: | distribute | Assignee: | Susant Kumar Palai <spalai> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | mainline | CC: | gluster-bugs, jdarcy, joe, lmohanty, mzywusko, nbalacha, nsathyan, rgowdapp, rhs-bugs, rwheeler, spalai, srangana, vbellur |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.7.0 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | 969020 | Environment: | |
Last Closed: | 2015-05-15 17:08:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 969020 | ||
Bug Blocks: |
Comment 1
Anand Avati
2014-09-02 12:01:55 UTC
Is this issue applicable to 3.6 ,3.5 and 3.4 release branch? For those who can't see the private version (mostly internal process stuff) here are the good bits. Description of problem: DHT - remove-brick - data loss - when remove-brick with 'start' is in progress, perform rename operation on files. commit remove-brick operation, after status is 'completed' and few files are missing. (not related to defect - 963896 - hash layout is not problem here) Version-Release number of selected component (if applicable): 3.3.0.10rhs-1.el6.x86_64 How reproducible: always Steps to Reproduce: 1.had a cluster of 3 RHS server and DHT volume having four bricks mounted as fuse [root@rhsauto031 ~]# gluster v info dist Volume Name: dist Type: Distribute Volume ID: 0130dae0-0573-491b-a4b2-14ac872624e7 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick2 Brick2: rhsauto038.lab.eng.blr.redhat.com:/rhs/brick2 Brick3: rhsauto031.lab.eng.blr.redhat.com:/rhs/brick2 Brick4: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 [root@localhost rtest]# mount | grep test glusterfs#rhsauto018.lab.eng.blr.redhat.com:/dist on /mnt/rtest type fuse (rw,default_permissions,allow_other,max_read=131072) 2. run remove brick with start option [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 start Remove Brick start successful [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0 0 0 not started rhsauto031.lab.eng.blr.redhat.com 0 0 0 0 not started rhsauto018.lab.eng.blr.redhat.com 668 29360128 2054 0 in progress 3. while data migration is in progress rename files from mount point mount point :- [root@localhost rtest]# for i in {1..100}; do mv d1/f$i d1/filenew$i; done [root@localhost rtest]#for i in {1..100}; do for j in {1..100}; do mv d$j/f$i d$j/filenew$i;done; done [root@localhost rtest]# ls d25/filenew25 d25/filenew25 4. verify that data - migration is in progress or not and check hash range for removed brick [root@rhsauto018 rpm]# getfattr -d -m . -e hex /rhs/brick10/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick10/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000000000000 trusted.glusterfs.volume-id=0x0130dae00573491ba4b214ac872624e7 [root@rhsauto018 rpm]# getfattr -d -m . -e hex /rhs/brick10/d25 getfattr: Removing leading '/' from absolute path names # file: rhs/brick10/d25 trusted.gfid=0xb5cc8353864e4a2dbdee8afabd099686 trusted.glusterfs.dht=0x00000001000000000000000000000000 [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0 0 0 not started rhsauto031.lab.eng.blr.redhat.com 0 0 0 0 not started rhsauto018.lab.eng.blr.redhat.com 6839 1062207488 21350 0 in progress 5. Once migration is completed, perform commit for remove-brick [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0 0 0 not started rhsauto031.lab.eng.blr.redhat.com 0 0 0 0 not started rhsauto018.lab.eng.blr.redhat.com 7165 1138753536 22257 0 completed [root@rhsauto038 rpm]# gluster volume remove-brick dist rhsauto018.lab.eng.blr.redhat.com:/rhs/brick10 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Remove Brick commit successful [root@rhsauto038 rpm]# gluster v info dist Volume Name: dist Type: Distribute Volume ID: 0130dae0-0573-491b-a4b2-14ac872624e7 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: rhsauto018.lab.eng.blr.redhat.com:/rhs/brick2 Brick2: rhsauto038.lab.eng.blr.redhat.com:/rhs/brick2 Brick3: rhsauto031.lab.eng.blr.redhat.com:/rhs/brick2 6. on mount point few files are missing [root@localhost rtest]# ls d25/filenew25 ls: d25/filenew25: No such file or directory [ this was present on mount point, check in step 3][root@rhsauto018 rpm]# ls -l /rhs/brick10/d25/filenew25 -rw-r--r-- 2 root root 0 May 29 15:16 /rhs/brick10/d25/filenew25 7. verify on all bricks, (including removed brick) in order to find this file present brick [root@rhsauto018 rpm]# ls -l /rhs/brick2/d25/filenew25 ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory [root@rhsauto031 ~]# ls -l /rhs/brick2/d25/filenew25 ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory [root@rhsauto038 rpm]# ls -l /rhs/brick2/d25/filenew25 ls: cannot access /rhs/brick2/d25/filenew25: No such file or directory removed - brick [root@rhsauto018 rpm]# ls -l /rhs/brick10/d25/filenew25 -rw-r--r-- 2 root root 0 May 29 15:16 /rhs/brick10/d25/filenew25 Actual results: files are missing Expected results: files should be present if remove-brick status says completed without any failures Additional info: more missing files :- [root@localhost rtest]# ls d19/filenew9 ls: d19/filenew9: No such file or directory [root@localhost rtest]# d24/filenew23 -bash: d24/filenew23: No such file or directory [root@localhost rtest]# d27/filenew34 -bash: d27/filenew34: No such file or directory [root@localhost rtest]# d27/filenew35 -bash: d27/filenew35: No such file or directory [root@localhost rtest]# d27/filenew36 -bash: d27/filenew36: No such file or directory Along with comment #2, I have another question, does the single patch is the proposed fix or there would be more patches? Lala, The fix currently will show a message upon remove-brick commit saying admin should check the decommissioned bricks for any files that might not have been migrated as part of rebalance. This patch is not a complete fix. Will open a new bug for the remove-brick status part. And this patch is applicable for 3.{4,5,6}. Opened a new bug here https://bugzilla.redhat.com/show_bug.cgi?id=1136702. The solution is to tell the admin to fix it while leaving the volume in a broken state? That's unacceptable. We need a process that leaves the files *in* the volume in the event the migration fails and *then* informs the admin of the failure so the admin can find the cause and fix it in order to try again. This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |