Bug 963896

Summary: DHT - remove-brick - data loss in remove-brick because in DHT 'remove-brick start' makes hash - layout 0000000000000000 for some other brick, no migration and data written after start operation also goes to that brick so on commit it ends in data loss
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachana Patel <racpatel>
Component: glusterfsAssignee: shishir gowda <sgowda>
Status: CLOSED ERRATA QA Contact: amainkar
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.1CC: aavati, amarts, nsathyan, rcyriac, rhs-bugs, vbellur
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.10rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 966845 (view as bug list) Environment:
Last Closed: 2013-09-23 22:29:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 923555, 961632, 966845, 996474    

Description Rachana Patel 2013-05-16 18:16:03 UTC
Description of problem:
DHT - remove-brick - data loss in remove-brick
In DHT 'remove-brick start' makes hash - layout 0000000000000000 for brick other than mentioned in command
+
no files are migrated from brick that will be removed
+
data written after start operation also goes to that brick so on commit it ends in data loss

Version-Release number of selected component (if applicable):
3.4.0.8rhs-1.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1.created a DHT volume, start and mount it

root@mia ~]# gluster volume create r1 fred.lab.eng.blr.redhat.com:/rhs/brick1/r1  cutlass.lab.eng.blr.redhat.com:/rhs/brick1/r1 fan.lab.eng.blr.redhat.com:/rhs/brick1/r1 mia.lab.eng.blr.redhat.com:/rhs/brick1/r1
volume create: r1: success: please start the volume to access data
[root@mia ~]# gluster volume start r1
volume start: r1: success
[root@mia ~]# gluster volume status r1
Status of volume: r1
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick fred.lab.eng.blr.redhat.com:/rhs/brick1/r1	49155	Y	6944
Brick cutlass.lab.eng.blr.redhat.com:/rhs/brick1/r1	49155	Y	6920
Brick fan.lab.eng.blr.redhat.com:/rhs/brick1/r1		49155	Y	4183
Brick mia.lab.eng.blr.redhat.com:/rhs/brick1/r1		49153	Y	3508
NFS Server on localhost					2049	Y	3518
NFS Server on a37ff566-da82-4ae4-90c6-17763466fd36	2049	Y	4193
NFS Server on c5154da1-be15-40e2-b5f3-9be6dadafd43	2049	Y	6930
NFS Server on ad0337ac-1756-4e04-aa6f-d9c46a24130d	2049	Y	6954
 
There are no active volume tasks

mount

[root@rhsauto037 mnt]# mount -t glusterfs   fan.lab.eng.blr.redhat.com:/r1 /mnt/rtest

2. create some files and dir inside it
[root@rhsauto037 mnt]# cd /mnt/rtest
[root@rhsauto037 rtest]#  for i in {1..20}; do mkdir d$i;  touch f"$i" ; touch d1/f"$i"; done
[root@rhsauto037 rtest]# ls
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f1   f11  f13  f15  f17  f19  f20  f4  f6  f8
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f10  f12  f14  f16  f18  f2   f3   f5  f7  f9

3. verify hash layout and file distribution on backend

on cutlass:-
[root@cutlass ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@cutlass ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f13  f6  f9
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f16  f7
[root@cutlass ~]# ls -l /rhs/brick1/r1 | grep T


on mia:-
[root@mia ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@mia ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f1   f18  f2
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f10  f19  f20
[root@mia ~]# ls -l /rhs/brick1/r1 | grep T


on fred:-
[root@fred ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000003ffffffe
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@fred ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f12  f4
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f17  f8
[root@fred ~]# ls -l /rhs/brick1/r1 | grep T


on fan :-
[root@fan ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@fan ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f11  f15  f5
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f14  f3
[root@fan ~]# ls -l /rhs/brick1/r1 | grep T


4. now remove one brick using start option
[root@mia ~]# gluster volume remove-brick r1 mia.lab.eng.blr.redhat.com:/rhs/brick1/r1 start
volume remove-brick start: success
ID: db943e14-85e4-44f1-ae10-4182f14c3995
[root@mia ~]# gluster volume remove-brick r1 mia.lab.eng.blr.redhat.com:/rhs/brick1/r1 status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes            20             0      completed             0.00
                             10.70.34.80                0        0Bytes             0             0    not started             0.00
                            10.70.34.116                0        0Bytes             0             0    not started             0.00
              fan.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00

status says completed but no files were migrated



5. verify on backend ..It changes hash layout of some other brick than mia, here it changes layout on fan and makes it
trusted.glusterfs.dht=0x00000001000000000000000000000000
+
no files are migrated from mia


[root@cutlass ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@cutlass ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8   f13  f16  f7
d10  d13  d16  d19  d3   d6  d9   f14  f3   f9
d11  d14  d17  d2   d4   d7  f11  f15  f6
[root@cutlass ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root  0 May 16 09:56 f11
---------T 2 root root  0 May 16 09:56 f14
---------T 2 root root  0 May 16 09:56 f15
---------T 2 root root  0 May 16 09:56 f3


[root@mia ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@mia ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f1   f18  f2   f5
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f10  f19  f20
[root@mia ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root  0 May 16 02:46 f5



[root@fred ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000055555554
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@fred ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f12  f17  f8
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f13  f4
[root@fred ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root  0 May 16 07:32 f13



[root@fan ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@fan ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f11  f15  f5
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f14  f3
[root@fan ~]# ls -l /rhs/brick1/r1 | grep T


6. now create new file from mount point and verify on backend it is going to mia

mount:-
[root@rhsauto037 rtest]#  for i in {1..20}; do   touch new"$i" ; touch d1/new"$i"; done
[root@rhsauto037 rtest]# ls
d1   d12  d15  d18  d20  d5  d8  f10  f13  f16  f19  f3  f6  f9     new11  new14  new17  new2   new4  new7
d10  d13  d16  d19  d3   d6  d9  f11  f14  f17  f2   f4  f7  new1   new12  new15  new18  new20  new5  new8
d11  d14  d17  d2   d4   d7  f1  f12  f15  f18  f20  f5  f8  new10  new13  new16  new19  new3   new6  new9

server:-
[root@cutlass ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8   f13  f16  f7     new15  new7
d10  d13  d16  d19  d3   d6  d9   f14  f3   f9     new2   new9
d11  d14  d17  d2   d4   d7  f11  f15  f6   new14  new5
[root@cutlass ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root   0 May 16 09:56 f11
---------T 2 root root   0 May 16 09:56 f14
---------T 2 root root   0 May 16 09:56 f15
---------T 2 root root   0 May 16 09:56 f3


[root@mia ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8  f10  f2   new10  new19  new4
d10  d13  d16  d19  d3   d6  d9  f18  f20  new12  new20  new6
d11  d14  d17  d2   d4   d7  f1  f19  f5   new13  new3   new8
[root@mia ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root   0 May 16 02:46 f5


[root@fred ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8   f13  f8     new16
d10  d13  d16  d19  d3   d6  d9   f17  new1   new17
d11  d14  d17  d2   d4   d7  f12  f4   new11  new18
[root@fred ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root  0 May 16 07:32 f13


[root@fan ~]# ls /rhs/brick1/r1
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f11  f15  f5
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f14  f3
[root@fan ~]# ls -l /rhs/brick1/r1 | grep T


7. now commit remove-brick and check on mount point that files are missing.
verify in backend and gluster volume info that it has removed mia but data loss is there

server:-
[root@mia ~]# gluster volume remove-brick r1 mia.lab.eng.blr.redhat.com:/rhs/brick1/r1 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success


before commit on mount:-
[root@rhsauto037 rtest]# ls
d1   d12  d15  d18  d20  d5  d8  f10  f13  f16  f19  f3  f6  f9     new11  new14  new17  new2   new4  new7
d10  d13  d16  d19  d3   d6  d9  f11  f14  f17  f2   f4  f7  new1   new12  new15  new18  new20  new5  new8
d11  d14  d17  d2   d4   d7  f1  f12  f15  f18  f20  f5  f8  new10  new13  new16  new19  new3   new6  new9

after commit on mount:-
[root@rhsauto037 rtest]# ls
d1   d11  d13  d15  d17  d19  d20  d4  d6  d8  f11  f13  f15  f17  f4  f6  f8  new1   new14  new16  new18  new5  new9
d10  d12  d14  d16  d18  d2   d3   d5  d7  d9  f12  f14  f16  f3   f5  f7  f9  new11  new15  new17  new2   new7

server:-


[root@cutlass ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000055555554
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@cutlass ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root   0 May 16 09:56 f11
---------T 2 root root   0 May 16 09:58 f12
---------T 2 root root   0 May 16 09:56 f14
---------T 2 root root   0 May 16 09:56 f15
---------T 2 root root   0 May 16 09:58 f17
---------T 2 root root   0 May 16 09:56 f3
---------T 2 root root   0 May 16 09:58 f4
---------T 2 root root   0 May 16 09:58 f8
---------T 2 root root   0 May 16 09:58 new1
---------T 2 root root   0 May 16 09:58 new11
---------T 2 root root   0 May 16 09:58 new16
---------T 2 root root   0 May 16 09:58 new17
---------T 2 root root   0 May 16 09:58 new18
[root@cutlass ~]# ls /rhs/brick1/r1
d1   d13  d17  d20  d6  f11  f15  f4  f9     new15  new2
d10  d14  d18  d3   d7  f12  f16  f6  new1   new16  new5
d11  d15  d19  d4   d8  f13  f17  f7  new11  new17  new7
d12  d16  d2   d5   d9  f14  f3   f8  new14  new18  new9



[root@mia ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@mia ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root   0 May 16 02:46 f5
[root@mia ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8  f10  f2   new10  new19  new4
d10  d13  d16  d19  d3   d6  d9  f18  f20  new12  new20  new6
d11  d14  d17  d2   d4   d7  f1  f19  f5   new13  new3   new8








[root@fred ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@fred ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root  0 May 16 07:32 f13
---------T 2 root root  0 May 16 07:34 f5
[root@fred ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8   f13  f5    new11  new18
d10  d13  d16  d19  d3   d6  d9   f17  f8    new16
d11  d14  d17  d2   d4   d7  f12  f4   new1  new17



[root@fan ~]# getfattr -d -m . -e hex /rhs/brick1/r1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/r1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
trusted.glusterfs.volume-id=0x7be735a3a5e9437086505841351bc419

[root@fan ~]# ls -l /rhs/brick1/r1 | grep T
---------T 2 root root  0 May 16 02:48 f16
---------T 2 root root  0 May 16 02:48 f6
---------T 2 root root  0 May 16 02:48 f7
---------T 2 root root  0 May 16 02:48 f9
---------T 2 root root  0 May 16 02:48 new14
---------T 2 root root  0 May 16 02:48 new15
---------T 2 root root  0 May 16 02:48 new2
---------T 2 root root  0 May 16 02:48 new5
---------T 2 root root  0 May 16 02:48 new7
---------T 2 root root  0 May 16 02:48 new9
[root@fan ~]# ls /rhs/brick1/r1
d1   d12  d15  d18  d20  d5  d8   f14  f3  f7     new15  new7
d10  d13  d16  d19  d3   d6  d9   f15  f5  f9     new2   new9
d11  d14  d17  d2   d4   d7  f11  f16  f6  new14  new5
e

[root@fan ~]# gluster volume status r1
Status of volume: r1
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick fred.lab.eng.blr.redhat.com:/rhs/brick1/r1	49155	Y6944
Brick cutlass.lab.eng.blr.redhat.com:/rhs/brick1/r1	49155	Y6920
Brick fan.lab.eng.blr.redhat.com:/rhs/brick1/r1		49155	Y4183
NFS Server on localhost					2049	Y4330
NFS Server on 8675332f-f033-4800-aa2c-9291fc868fbf	2049	Y3665
NFS Server on ad0337ac-1756-4e04-aa6f-d9c46a24130d	2049	Y7094
NFS Server on c5154da1-be15-40e2-b5f3-9be6dadafd43	2049	Y7062
 
There are no active volume tasks



Actual results:
data loss in remove-brick
In DHT 'remove-brick start' makes hash - layout 0000000000000000 for brick other than mentioned in command
+
no files are migrated from brick that will be removed
+
data written after start operation also goes to that brick so on commit it ends in data loss

Expected results:


Additional info:

Comment 4 shishir gowda 2013-05-20 06:29:08 UTC
[2013-05-19 11:12:40.218890] C [dht-selfheal.c:559:dht_get_layout_count] 0-shishir: brick2: sng1-client-2  <===subvolume being decommissioned

[2013-05-19 11:12:40.219014] C [dht-selfheal.c:781:dht_selfheal_layout_new_directory] 0-shishir: gave fix: 0 - 1431655764 on sng1-client-0 for /
[2013-05-19 11:12:40.219051] C [dht-selfheal.c:781:dht_selfheal_layout_new_directory] 0-shishir: gave fix: 1431655765 - 2863311529 on sng1-client-1 for /
[2013-05-19 11:12:40.219075] C [dht-selfheal.c:781:dht_selfheal_layout_new_directory] 0-shishir: gave fix: 2863311530 - 4294967294 on sng1-client-3 for /

<=== no layout given for subvolume sng1-client-2 (This is the correct op)

[2013-05-19 11:12:40.219099] C [dht-selfheal.c:736:dht_fix_layout_of_directory] 0-shishir: after overlapt: 0 - 1431655764 on sng1-client-0 for /
[2013-05-19 11:12:40.219122] C [dht-selfheal.c:736:dht_fix_layout_of_directory] 0-shishir: after overlapt: 0 - 0 on sng1-client-1 for /
<==== layout zeroed out for sng1-client-1 (incorrect)

[2013-05-19 11:12:40.219145] C [dht-selfheal.c:736:dht_fix_layout_of_directory] 0-shishir: after overlapt: 1431655765 - 2863311529 on sng1-client-2 for /
<=== overlap op gives layout for subvolume sng1-client-2 (incorrect)

[2013-05-19 11:12:40.219168] C [dht-selfheal.c:736:dht_fix_layout_of_directory] 0-shishir: after overlapt: 2863311530 - 4294967295 on sng1-client-3 for /

[2013-05-19 11:12:40.219201] C [dht-selfheal.c:170:dht_selfheal_dir_xattr_persubvol] 0-shishir: setting hash range 0 - 1431655764 (type 0) on subvolume sng1-client-0 for /
[2013-05-19 11:12:40.219544] C [dht-selfheal.c:170:dht_selfheal_dir_xattr_persubvol] 0-shishir: setting hash range 0 - 0 (type 0) on subvolume sng1-client-1 for /
[2013-05-19 11:12:40.219677] C [dht-selfheal.c:170:dht_selfheal_dir_xattr_persubvol] 0-shishir: setting hash range 1431655765 - 2863311529 (type 0) on subvolume sng1-client-2 for /
[2013-05-19 11:12:40.219996] C [dht-selfheal.c:170:dht_selfheal_dir_xattr_persubvol] 0-shishir: setting hash range 2863311530 - 4294967295 (type 0) on subvolume sng1-client-3 for /

<=== layouts written to the disk.

dht_selfheal_layout_maximize_overlap called in dht_fix_layout_of_directory over-writes the layouts for optimization, without considering decommissioned nodes, which leads to this problem of incorrect subvolume getting zero-ed out ranges.

Suspect this is a regression caused by:

commit 4f87fd0ae2ce629576ca5f647a99888d31a46815
Author: Anand Avati <avati>
Date:   Thu Aug 30 13:15:39 2012 -0700

    dht: improve dht_fix_layout_of_directory for better re-assignment
.....
    Change-Id: I0cbbf3bfa334645728072d66aaaa80120d0b295f
    BUG: 853258
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/3883
    Tested-by: Gluster Build System <jenkins.com>

Comment 7 Rachana Patel 2013-06-18 05:57:35 UTC
verified on 3.4.0.9rhs-1.el6.x86_64
Working as per expectation, hence moving it to verified

Comment 9 Scott Haines 2013-09-23 22:29:53 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html