Description of problem: While remove-brick from a distributed-replicate volume is in progress if one of the node goes down and comes back self-heal is triggered. If remove-brick and self-heal runs together it causes data loss Version-Release number of selected component (if applicable): 3.4.0.44rhs-1.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. created a 6x2 distributed-replicate volume using 3 nodes in the cluster 2. Fill up volume with files and deep directories upto depth 5 3. started remove-brick of one of the pair using gluster volume remove-brick <vol> <b1> <b2> start 4. while remove-brick is in progress reboot one of the node so that after it comes back self-heal will be triggered 5. after remove-brick completed check the areequal checksum on the mount Actual results: There will be few file lost from the mount point areequal before -------------- [root@rhs-client4 lo]# /opt/qa/tools/arequal-checksum . Entry counts Regular files : 9330 Directories : 9331 Symbolic links : 0 Other : 0 Total : 18661 Metadata checksums Regular files : 3e9 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 00 Directories : 10000002e01 Symbolic links : 0 Other : 0 Total : 10000002e01 areequal after ------------- [root@rhs-client4 lo]# /opt/qa/tools/arequal-checksum . Entry counts Regular files : 9327 Directories : 9331 Symbolic links : 0 Other : 0 Total : 18658 Metadata checksums Regular files : 4bc885 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : d26028d3461d4fd7daf74b8d71b5ed7c Directories : 362e656c4767 Symbolic links : 0 Other : 0 Total : 897557052c4e5cc [root@rhs-client4 ~]# gluster v info lo Volume Name: lo Type: Distributed-Replicate Volume ID: 4474edd0-512a-42fd-92ac-536d1a258b42 Status: Started Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/lo0 Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/lo1 Brick3: rhs-client9.lab.eng.blr.redhat.com:/home/lo4 Brick4: rhs-client39.lab.eng.blr.redhat.com:/home/lo5 Brick5: rhs-client4.lab.eng.blr.redhat.com:/home/lo6 Brick6: rhs-client9.lab.eng.blr.redhat.com:/home/lo7 Brick7: rhs-client39.lab.eng.blr.redhat.com:/home/lo8 Brick8: rhs-client4.lab.eng.blr.redhat.com:/home/lo9 Brick9: rhs-client9.lab.eng.blr.redhat.com:/home/lo10 Brick10: rhs-client39.lab.eng.blr.redhat.com:/home/lo11 decommissioned bricks --------------------- rhs-client39.lab.eng.blr.redhat.com:/home/lo2 rhs-client4.lab.eng.blr.redhat.com:/home/lo3 cluster info --------------- rhs-client9.lab.eng.blr.redhat.com rhs-client39.lab.eng.blr.redhat.com rhs-client4.lab.eng.blr.redhat.com mounted on ---------- rhs-client4.lab.eng.blr.redhat.com:/lo one of the missing file 0/5/0/5/file.5 is still present in the decommissioned brick [root@rhs-client4 lo3]# getfattr -d -m . -e hex 0/5/0/5/ # file: 0/5/0/5/ trusted.afr.lo-client-2=0x000000000000000000000000 trusted.afr.lo-client-3=0x000000000000000000000000 trusted.gfid=0x79e3bd0c99a6446bafe7c3a39bf4b0fe trusted.glusterfs.dht=0x00000001000000000000000000000000 for all the missing files Rebalance log from node rhs-client39.lab.eng.blr.redhat.com says ------------------ [2013-11-20 10:09:55.311935] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-lo-client-9: remote operation failed: Trans port endpoint is not connected. Path: /0/5/0/5/file.5 (3cbc4f8c-02ce-408b-8b2e-8e2d84b186af). Key: trusted.glusterfs.pathinfo [2013-11-20 10:11:44.744731] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-lo-client-0: remote operation failed: Trans port endpoint is not connected. Path: /2/4/0/2/file.0 (f1d127c1-42b2-4fc5-a227-35a5ef6a8d32). Key: trusted.glusterfs.pathinf o attached the sosreport ------------------
*** Bug 1031971 has been marked as a duplicate of this bug. ***
I was able to reproduce this bug in Bigbend.
[root@localhost mnt]# glusterfs --version glusterfs 3.4.0.33rhs built on Sep 8 2013 13:20:25 [root@localhost mnt]# gluster volume info Volume Name: test Type: Replicate Volume ID: d1166e18-a761-4ce3-8ef6-5a5ccfcd79ef Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.190:/brick2/test_brick1 Brick2: 10.70.43.118:/brick2/test_brick1 Brick3: 10.70.42.190:/brick2/test_brick2 Brick4: 10.70.43.118:/brick2/test_brick2 Options Reconfigured: diagnostics.client-log-level: TRACE [root@localhost mnt]# gluster volume remove-brick test 10.70.42.190:/brick2/test_brick2 10.70.43.118:/brick2/test_brick2 start I rebooted node 10.70.43.118 after remove-brick operation started. After remove-brick commit I found some of the files on the removed brick. Results of ls -R on removed brick 10.70.43.118/brick2/test_brick2 : ./3/5/2/2: 0 1 2 3 4 5 file.1 file.2 file.3 file.4 Here is the result of "ls ./3/5/2/2" on mount point. [root@localhost mnt]# ls ./3/5/2/2/ 0 1 2 3 4 5 file.0 file.5 ( file 1 to 4 are missing)
Verified on 3.4.0.52rhs-1.el6rhs.x86_64
Please review the text for technical accuracy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html