Hide Forgot
Description of problem: After starting remove-brick on a distributed-replicate volume , if self-heal is triggered it will end up with data loss Version-Release number of selected component (if applicable): glusterfs-fuse-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-geo-replication-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-api-devel-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-rdma-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-debuginfo-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-libs-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-api-3.4.0.44rhs-1.el6rhs.x86_64 glusterfs-devel-3.4.0.44rhs-1.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. created a 5x2 distributed-replicate volume and enabled quota with limit 1TB 2. kill one of the brick from any replica pair 3. created some data of deep directory depth on the mount point 4. start remove-brick operation on any pair (except the one in which one of the brick is down) using remove-brick start gluster v remove-brick $vol <brick1> <brick2> start 5. while migration is in progress forcefully start the volume so that all the bricks will be up and heal starts gluster volume start <vol> force 6. check the remove-brick status till migration completes 7. once the migration is complete commit the remove-brick operation gluster v remove-brick <vol> <brick1> <brick2> commit 8. Now check the number of files on the mount point Actual results: There will be data loss, some of the files are missing For every file missing we can see heal info from rebalance logs [2013-11-19 04:27:49.791226] I [dht-common.c:2644:dht_setxattr] 0-rebal-dht: fixing the layout of /5/2/5/1 [2013-11-19 04:27:49.795429] I [dht-rebalance.c:1116:gf_defrag_migrate_data] 0-rebal-dht: migrate data called on /5/2/5/1 [2013-11-19 04:27:49.816696] I [afr-self-heal-common.c:2843:afr_log_self_heal_completion_status] 0-rebal-replicate-0: on /5/2/5/1 [2013-11-19 04:27:49.866120] I [afr-self-heal-common.c:2843:afr_log_self_heal_completion_status] 0-rebal-replicate-0: gfid or missing entry self hea l is successfully completed, on /5/2/5/1/file.1 [2013-11-19 04:27:49.867329] I [dht-rebalance.c:1333:gf_defrag_migrate_data] 0-rebal-dht: Migration operation on dir /5/2/5/1 took 0.07 secs [root@rhs-client4 mnt1]# ll 5/2/5/1/file.1 ls: cannot access 5/2/5/1/file.1: No such file or directory pair1 ------ [root@rhs-client4 1]# getfattr -d -m . -e hex /home/rebal0/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal0/5/2/5/1 trusted.afr.rebal-client-0=0x000000000000000000000000 trusted.afr.rebal-client-1=0x000000000000000000000000 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000400000 trusted.glusterfs.quota.size=0x0000000000400000 [root@rhs-client9 ~]# getfattr -d -m . -e hex /home/rebal1/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal1/5/2/5/1 trusted.afr.rebal-client-0=0x000000000000000000000000 trusted.afr.rebal-client-1=0x000000000000000000000000 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000400000 trusted.glusterfs.quota.size=0x0000000000400000 pair2 ------- [root@rhs-client39 ~]# getfattr -d -m . -e hex /home/rebal2/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal2/5/2/5/1 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x0000000100000000000000003ffffffe trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000100000 trusted.glusterfs.quota.size=0x0000000000100000 [root@rhs-client4 1]# getfattr -d -m . -e hex /home/rebal3/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal3/5/2/5/1 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x0000000100000000000000003ffffffe trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000100000 trusted.glusterfs.quota.size=0x0000000000100000 pair3 ------ [root@rhs-client9 ~]# getfattr -d -m . -e hex /home/rebal4/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal4/5/2/5/1 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000 trusted.glusterfs.quota.size=0x0000000000000000 [root@rhs-client39 ~]# getfattr -d -m . -e hex /home/rebal5/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal5/5/2/5/1 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000 trusted.glusterfs.quota.size=0x0000000000000000 pair4 ------ [root@rhs-client4 1]# getfattr -d -m . -e hex /home/rebal6/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal6/5/2/5/1 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000 trusted.glusterfs.quota.size=0x0000000000000000 [root@rhs-client9 ~]# getfattr -d -m . -e hex /home/rebal7/5/2/5/1 getfattr: Removing leading '/' from absolute path names # file: home/rebal7/5/2/5/1 trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3 trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc trusted.glusterfs.quota.dirty=0x3000 trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000 trusted.glusterfs.quota.size=0x0000000000000000 More info ---------- [root@rhs-client4 mnt1]# gluster v info rebal Volume Name: rebal Type: Distributed-Replicate Volume ID: d29ec985-e908-4f0d-9e51-39ed79bf24f2 Status: Started Number of Bricks: 5 x 2 = 8 Transport-type: tcp Bricks: Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/rebal0 Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/rebal1 Brick3: rhs-client39.lab.eng.blr.redhat.com:/home/rebal2 Brick4: rhs-client4.lab.eng.blr.redhat.com:/home/rebal3 Brick5: rhs-client9.lab.eng.blr.redhat.com:/home/rebal4 Brick6: rhs-client39.lab.eng.blr.redhat.com:/home/rebal5 Brick7: rhs-client4.lab.eng.blr.redhat.com:/home/rebal6 Brick8: rhs-client9.lab.eng.blr.redhat.com:/home/rebal7 Brick9: rhs-client39.lab.eng.blr.redhat.com:/home/rebal8------>decommissioned Brick10: rhs-client4.lab.eng.blr.redhat.com:/home/rebal9------>decommissioned Options Reconfigured: features.quota: on Cluster info ----------- rhs-client9.lab.eng.blr.redhat.com rhs-client39.lab.eng.blr.redhat.com rhs-client4.lab.eng.blr.redhat.com Mounted on ----------- rhs-client4.lab.eng.blr.redhat.com:/mnt1 Attached the sosreports
*** This bug has been marked as a duplicate of bug 1032558 ***