Bug 1032558 - Remove-brick with self-heal causes data loss
Remove-brick with self-heal causes data loss
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.1
x86_64 Linux
high Severity high
: ---
: RHGS 2.1.2
Assigned To: Pranith Kumar K
shylesh
: ZStream
: 1031971 (view as bug list)
Depends On:
Blocks: 1032927
  Show dependency treegraph
 
Reported: 2013-11-20 06:55 EST by shylesh
Modified: 2015-05-15 14:35 EDT (History)
5 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.50rhs
Doc Type: Bug Fix
Doc Text:
Previously, when one of the bricks in a replica pair was offline, a few files were not migrated from the decommissioned bricks resulting in some files missing. With this fix, data is completely migrated even when one of the bricks in the replica pair is offline.
Story Points: ---
Clone Of:
: 1032927 (view as bug list)
Environment:
Last Closed: 2014-02-25 03:05:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description shylesh 2013-11-20 06:55:33 EST
Description of problem:

While remove-brick from a distributed-replicate volume is in progress if one of the node goes down and comes back self-heal is triggered. If remove-brick and self-heal runs together it causes data loss

Version-Release number of selected component (if applicable):

3.4.0.44rhs-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. created a 6x2 distributed-replicate volume using 3 nodes in the cluster
2. Fill up volume with files and deep directories upto depth 5
3. started remove-brick of one of the pair using
 gluster volume remove-brick <vol>  <b1> <b2> start
4. while remove-brick is in progress reboot one of the node so that after it comes back self-heal will be triggered
5. after remove-brick completed check the areequal checksum on the mount

Actual results:

There will be few file lost from the mount point

areequal before
--------------
 [root@rhs-client4 lo]# /opt/qa/tools/arequal-checksum .


Entry counts
Regular files   : 9330
Directories     : 9331
Symbolic links  : 0
Other           : 0
Total           : 18661

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 00
Directories     : 10000002e01
Symbolic links  : 0
Other           : 0
Total           : 10000002e01


areequal after
-------------
[root@rhs-client4 lo]# /opt/qa/tools/arequal-checksum  .

Entry counts
Regular files   : 9327
Directories     : 9331
Symbolic links  : 0
Other           : 0
Total           : 18658

Metadata checksums
Regular files   : 4bc885
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : d26028d3461d4fd7daf74b8d71b5ed7c
Directories     : 362e656c4767
Symbolic links  : 0
Other           : 0
Total           : 897557052c4e5cc


[root@rhs-client4 ~]# gluster v info lo
 
Volume Name: lo
Type: Distributed-Replicate
Volume ID: 4474edd0-512a-42fd-92ac-536d1a258b42
Status: Started
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/lo0
Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/lo1
Brick3: rhs-client9.lab.eng.blr.redhat.com:/home/lo4
Brick4: rhs-client39.lab.eng.blr.redhat.com:/home/lo5
Brick5: rhs-client4.lab.eng.blr.redhat.com:/home/lo6
Brick6: rhs-client9.lab.eng.blr.redhat.com:/home/lo7
Brick7: rhs-client39.lab.eng.blr.redhat.com:/home/lo8
Brick8: rhs-client4.lab.eng.blr.redhat.com:/home/lo9
Brick9: rhs-client9.lab.eng.blr.redhat.com:/home/lo10
Brick10: rhs-client39.lab.eng.blr.redhat.com:/home/lo11

decommissioned bricks
---------------------
rhs-client39.lab.eng.blr.redhat.com:/home/lo2
rhs-client4.lab.eng.blr.redhat.com:/home/lo3


cluster info
---------------

rhs-client9.lab.eng.blr.redhat.com
rhs-client39.lab.eng.blr.redhat.com
rhs-client4.lab.eng.blr.redhat.com

mounted on
----------
rhs-client4.lab.eng.blr.redhat.com:/lo

one of the missing file  0/5/0/5/file.5 is still present in the decommissioned brick

[root@rhs-client4 lo3]# getfattr -d -m . -e hex  0/5/0/5/
# file: 0/5/0/5/
trusted.afr.lo-client-2=0x000000000000000000000000
trusted.afr.lo-client-3=0x000000000000000000000000
trusted.gfid=0x79e3bd0c99a6446bafe7c3a39bf4b0fe
trusted.glusterfs.dht=0x00000001000000000000000000000000


for all the missing files 
Rebalance log from node rhs-client39.lab.eng.blr.redhat.com says 
------------------
[2013-11-20 10:09:55.311935] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-lo-client-9: remote operation failed: Trans
port endpoint is not connected. Path: /0/5/0/5/file.5 (3cbc4f8c-02ce-408b-8b2e-8e2d84b186af). Key: trusted.glusterfs.pathinfo


[2013-11-20 10:11:44.744731] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-lo-client-0: remote operation failed: Trans
port endpoint is not connected. Path: /2/4/0/2/file.0 (f1d127c1-42b2-4fc5-a227-35a5ef6a8d32). Key: trusted.glusterfs.pathinf
o


attached the sosreport
------------------
Comment 3 shylesh 2013-11-20 06:58:24 EST
*** Bug 1031971 has been marked as a duplicate of this bug. ***
Comment 4 Susant Kumar Palai 2013-11-22 07:18:41 EST
I was able to reproduce this bug in Bigbend.
Comment 6 Susant Kumar Palai 2013-11-22 07:51:06 EST
[root@localhost mnt]# glusterfs --version
glusterfs 3.4.0.33rhs built on Sep  8 2013 13:20:25


[root@localhost mnt]# gluster volume info
 
Volume Name: test
Type: Replicate
Volume ID: d1166e18-a761-4ce3-8ef6-5a5ccfcd79ef
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.190:/brick2/test_brick1
Brick2: 10.70.43.118:/brick2/test_brick1
Brick3: 10.70.42.190:/brick2/test_brick2
Brick4: 10.70.43.118:/brick2/test_brick2
Options Reconfigured:
diagnostics.client-log-level: TRACE



[root@localhost mnt]# gluster volume remove-brick test 10.70.42.190:/brick2/test_brick2 10.70.43.118:/brick2/test_brick2 start

I rebooted node 10.70.43.118 after remove-brick operation started.


After remove-brick commit I found some of the files on the removed brick.

Results of ls -R on removed brick 10.70.43.118/brick2/test_brick2 :
./3/5/2/2:
0
1
2
3
4
5
file.1
file.2
file.3
file.4


Here is the result of "ls ./3/5/2/2" on mount point.

[root@localhost mnt]# ls ./3/5/2/2/
0  1  2  3  4  5  file.0  file.5
( file 1 to 4 are missing)
Comment 9 shylesh 2013-12-24 04:50:08 EST
Verified on 3.4.0.52rhs-1.el6rhs.x86_64
Comment 10 Pavithra 2013-12-31 04:43:58 EST
Please review the text for technical accuracy.
Comment 13 errata-xmlrpc 2014-02-25 03:05:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.