Bug 1032558

Summary: Remove-brick with self-heal causes data loss
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: glusterfsAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: pkarampu, psriniva, spalai, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.50rhs Doc Type: Bug Fix
Doc Text:
Previously, when one of the bricks in a replica pair was offline, a few files were not migrated from the decommissioned bricks resulting in some files missing. With this fix, data is completely migrated even when one of the bricks in the replica pair is offline.
Story Points: ---
Clone Of:
: 1032927 (view as bug list) Environment:
Last Closed: 2014-02-25 08:05:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1032927    

Description shylesh 2013-11-20 11:55:33 UTC
Description of problem:

While remove-brick from a distributed-replicate volume is in progress if one of the node goes down and comes back self-heal is triggered. If remove-brick and self-heal runs together it causes data loss

Version-Release number of selected component (if applicable):

3.4.0.44rhs-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. created a 6x2 distributed-replicate volume using 3 nodes in the cluster
2. Fill up volume with files and deep directories upto depth 5
3. started remove-brick of one of the pair using
 gluster volume remove-brick <vol>  <b1> <b2> start
4. while remove-brick is in progress reboot one of the node so that after it comes back self-heal will be triggered
5. after remove-brick completed check the areequal checksum on the mount

Actual results:

There will be few file lost from the mount point

areequal before
--------------
 [root@rhs-client4 lo]# /opt/qa/tools/arequal-checksum .


Entry counts
Regular files   : 9330
Directories     : 9331
Symbolic links  : 0
Other           : 0
Total           : 18661

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 00
Directories     : 10000002e01
Symbolic links  : 0
Other           : 0
Total           : 10000002e01


areequal after
-------------
[root@rhs-client4 lo]# /opt/qa/tools/arequal-checksum  .

Entry counts
Regular files   : 9327
Directories     : 9331
Symbolic links  : 0
Other           : 0
Total           : 18658

Metadata checksums
Regular files   : 4bc885
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : d26028d3461d4fd7daf74b8d71b5ed7c
Directories     : 362e656c4767
Symbolic links  : 0
Other           : 0
Total           : 897557052c4e5cc


[root@rhs-client4 ~]# gluster v info lo
 
Volume Name: lo
Type: Distributed-Replicate
Volume ID: 4474edd0-512a-42fd-92ac-536d1a258b42
Status: Started
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/lo0
Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/lo1
Brick3: rhs-client9.lab.eng.blr.redhat.com:/home/lo4
Brick4: rhs-client39.lab.eng.blr.redhat.com:/home/lo5
Brick5: rhs-client4.lab.eng.blr.redhat.com:/home/lo6
Brick6: rhs-client9.lab.eng.blr.redhat.com:/home/lo7
Brick7: rhs-client39.lab.eng.blr.redhat.com:/home/lo8
Brick8: rhs-client4.lab.eng.blr.redhat.com:/home/lo9
Brick9: rhs-client9.lab.eng.blr.redhat.com:/home/lo10
Brick10: rhs-client39.lab.eng.blr.redhat.com:/home/lo11

decommissioned bricks
---------------------
rhs-client39.lab.eng.blr.redhat.com:/home/lo2
rhs-client4.lab.eng.blr.redhat.com:/home/lo3


cluster info
---------------

rhs-client9.lab.eng.blr.redhat.com
rhs-client39.lab.eng.blr.redhat.com
rhs-client4.lab.eng.blr.redhat.com

mounted on
----------
rhs-client4.lab.eng.blr.redhat.com:/lo

one of the missing file  0/5/0/5/file.5 is still present in the decommissioned brick

[root@rhs-client4 lo3]# getfattr -d -m . -e hex  0/5/0/5/
# file: 0/5/0/5/
trusted.afr.lo-client-2=0x000000000000000000000000
trusted.afr.lo-client-3=0x000000000000000000000000
trusted.gfid=0x79e3bd0c99a6446bafe7c3a39bf4b0fe
trusted.glusterfs.dht=0x00000001000000000000000000000000


for all the missing files 
Rebalance log from node rhs-client39.lab.eng.blr.redhat.com says 
------------------
[2013-11-20 10:09:55.311935] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-lo-client-9: remote operation failed: Trans
port endpoint is not connected. Path: /0/5/0/5/file.5 (3cbc4f8c-02ce-408b-8b2e-8e2d84b186af). Key: trusted.glusterfs.pathinfo


[2013-11-20 10:11:44.744731] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-lo-client-0: remote operation failed: Trans
port endpoint is not connected. Path: /2/4/0/2/file.0 (f1d127c1-42b2-4fc5-a227-35a5ef6a8d32). Key: trusted.glusterfs.pathinf
o


attached the sosreport
------------------

Comment 3 shylesh 2013-11-20 11:58:24 UTC
*** Bug 1031971 has been marked as a duplicate of this bug. ***

Comment 4 Susant Kumar Palai 2013-11-22 12:18:41 UTC
I was able to reproduce this bug in Bigbend.

Comment 6 Susant Kumar Palai 2013-11-22 12:51:06 UTC
[root@localhost mnt]# glusterfs --version
glusterfs 3.4.0.33rhs built on Sep  8 2013 13:20:25


[root@localhost mnt]# gluster volume info
 
Volume Name: test
Type: Replicate
Volume ID: d1166e18-a761-4ce3-8ef6-5a5ccfcd79ef
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.190:/brick2/test_brick1
Brick2: 10.70.43.118:/brick2/test_brick1
Brick3: 10.70.42.190:/brick2/test_brick2
Brick4: 10.70.43.118:/brick2/test_brick2
Options Reconfigured:
diagnostics.client-log-level: TRACE



[root@localhost mnt]# gluster volume remove-brick test 10.70.42.190:/brick2/test_brick2 10.70.43.118:/brick2/test_brick2 start

I rebooted node 10.70.43.118 after remove-brick operation started.


After remove-brick commit I found some of the files on the removed brick.

Results of ls -R on removed brick 10.70.43.118/brick2/test_brick2 :
./3/5/2/2:
0
1
2
3
4
5
file.1
file.2
file.3
file.4


Here is the result of "ls ./3/5/2/2" on mount point.

[root@localhost mnt]# ls ./3/5/2/2/
0  1  2  3  4  5  file.0  file.5
( file 1 to 4 are missing)

Comment 9 shylesh 2013-12-24 09:50:08 UTC
Verified on 3.4.0.52rhs-1.el6rhs.x86_64

Comment 10 Pavithra 2013-12-31 09:43:58 UTC
Please review the text for technical accuracy.

Comment 13 errata-xmlrpc 2014-02-25 08:05:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html