Bug 1031971

Summary: DHT:SELF-HEAL:Remove-brick with self-heal causes data loss
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: glusterfsAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED DUPLICATE QA Contact: Sudhir D <sdharane>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-20 11:58:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description shylesh 2013-11-19 09:26:57 UTC
Description of problem:
After starting remove-brick on a distributed-replicate volume , if self-heal is triggered it will end up with data loss

Version-Release number of selected component (if applicable):
glusterfs-fuse-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-api-devel-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.44rhs-1.el6rhs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. created a 5x2 distributed-replicate volume  and enabled quota with limit 1TB
2. kill one of the brick from any replica pair
3. created some data of deep directory depth on the mount point
4. start remove-brick operation on any pair (except the one in which one of the brick is down) using remove-brick start
 gluster v remove-brick $vol <brick1> <brick2> start
5. while migration is in progress forcefully start the volume so that all the bricks will be up and heal starts
 gluster volume start <vol> force
6. check the remove-brick status till migration completes
7. once the migration is complete commit the remove-brick operation
 gluster v remove-brick <vol> <brick1> <brick2> commit
8. Now check the number of files on the mount point

Actual results:
There will be data loss, some of the files are missing
 
For every file missing we can see heal info from rebalance logs

[2013-11-19 04:27:49.791226] I [dht-common.c:2644:dht_setxattr] 0-rebal-dht: fixing the layout of /5/2/5/1
[2013-11-19 04:27:49.795429] I [dht-rebalance.c:1116:gf_defrag_migrate_data] 0-rebal-dht: migrate data called on /5/2/5/1
[2013-11-19 04:27:49.816696] I [afr-self-heal-common.c:2843:afr_log_self_heal_completion_status] 0-rebal-replicate-0:  on /5/2/5/1
[2013-11-19 04:27:49.866120] I [afr-self-heal-common.c:2843:afr_log_self_heal_completion_status] 0-rebal-replicate-0:  gfid or missing entry self hea
l  is successfully completed, on /5/2/5/1/file.1
[2013-11-19 04:27:49.867329] I [dht-rebalance.c:1333:gf_defrag_migrate_data] 0-rebal-dht: Migration operation on dir /5/2/5/1 took 0.07 secs


[root@rhs-client4 mnt1]# ll 5/2/5/1/file.1
ls: cannot access 5/2/5/1/file.1: No such file or directory


pair1
------
[root@rhs-client4 1]# getfattr -d -m . -e hex /home/rebal0/5/2/5/1                                                                                
getfattr: Removing leading '/' from absolute path names
# file: home/rebal0/5/2/5/1
trusted.afr.rebal-client-0=0x000000000000000000000000
trusted.afr.rebal-client-1=0x000000000000000000000000
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000400000
trusted.glusterfs.quota.size=0x0000000000400000


[root@rhs-client9 ~]# getfattr -d -m . -e hex /home/rebal1/5/2/5/1
getfattr: Removing leading '/' from absolute path names
# file: home/rebal1/5/2/5/1
trusted.afr.rebal-client-0=0x000000000000000000000000
trusted.afr.rebal-client-1=0x000000000000000000000000
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000400000
trusted.glusterfs.quota.size=0x0000000000400000


pair2
-------
[root@rhs-client39 ~]#  getfattr -d -m . -e hex /home/rebal2/5/2/5/1
getfattr: Removing leading '/' from absolute path names
# file: home/rebal2/5/2/5/1
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x0000000100000000000000003ffffffe
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000100000
trusted.glusterfs.quota.size=0x0000000000100000

[root@rhs-client4 1]# getfattr -d -m . -e hex /home/rebal3/5/2/5/1                                                                                
getfattr: Removing leading '/' from absolute path names
# file: home/rebal3/5/2/5/1
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x0000000100000000000000003ffffffe
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000100000
trusted.glusterfs.quota.size=0x0000000000100000


pair3
------
[root@rhs-client9 ~]# getfattr -d -m . -e hex /home/rebal4/5/2/5/1
getfattr: Removing leading '/' from absolute path names
# file: home/rebal4/5/2/5/1
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000
trusted.glusterfs.quota.size=0x0000000000000000

[root@rhs-client39 ~]# getfattr -d -m . -e hex /home/rebal5/5/2/5/1
getfattr: Removing leading '/' from absolute path names
# file: home/rebal5/5/2/5/1
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000
trusted.glusterfs.quota.size=0x0000000000000000


pair4
------
[root@rhs-client4 1]# getfattr -d -m . -e hex /home/rebal6/5/2/5/1                                                                                
getfattr: Removing leading '/' from absolute path names
# file: home/rebal6/5/2/5/1
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000
trusted.glusterfs.quota.size=0x0000000000000000

[root@rhs-client9 ~]# getfattr -d -m . -e hex /home/rebal7/5/2/5/1
getfattr: Removing leading '/' from absolute path names
# file: home/rebal7/5/2/5/1
trusted.gfid=0xd21bafa9759c417d8e147e3444aa38e3
trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.f096b5c9-2558-4985-a570-fb2596026c1f.contri=0x0000000000000000
trusted.glusterfs.quota.size=0x0000000000000000






 

 
More info
----------

[root@rhs-client4 mnt1]# gluster v info rebal
 
Volume Name: rebal
Type: Distributed-Replicate
Volume ID: d29ec985-e908-4f0d-9e51-39ed79bf24f2
Status: Started
Number of Bricks: 5 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/rebal0
Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/rebal1
Brick3: rhs-client39.lab.eng.blr.redhat.com:/home/rebal2
Brick4: rhs-client4.lab.eng.blr.redhat.com:/home/rebal3
Brick5: rhs-client9.lab.eng.blr.redhat.com:/home/rebal4
Brick6: rhs-client39.lab.eng.blr.redhat.com:/home/rebal5
Brick7: rhs-client4.lab.eng.blr.redhat.com:/home/rebal6
Brick8: rhs-client9.lab.eng.blr.redhat.com:/home/rebal7
Brick9: rhs-client39.lab.eng.blr.redhat.com:/home/rebal8------>decommissioned
Brick10: rhs-client4.lab.eng.blr.redhat.com:/home/rebal9------>decommissioned
Options Reconfigured:
features.quota: on




Cluster info
-----------
rhs-client9.lab.eng.blr.redhat.com
rhs-client39.lab.eng.blr.redhat.com
rhs-client4.lab.eng.blr.redhat.com

Mounted on
-----------
rhs-client4.lab.eng.blr.redhat.com:/mnt1


Attached the sosreports

Comment 3 shylesh 2013-11-20 11:58:24 UTC

*** This bug has been marked as a duplicate of bug 1032558 ***