Description of problem: In a 3 x (4 + 2) = 18 distributed disperse volume, there are input/output error of some files on fuse mount after simulating the following scenario 1. Simulate the disk failure by killing the disk pid and again adding the same disk after formatting the drive 2. Try to read the recovered or healed file after 2 bricks/nodes were brought down Version-Release number of selected component (if applicable): glusterfs 3.7.2 built on Jun 19 2015 16:33:27 Repository revision: git://git.gluster.com/glusterfs.git <http://git.gluster.com/glusterfs.git> Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License How reproducible: Steps to Reproduce: 1. create a 3x(4+2) disperse volume across nodes 2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd 3. simulate the disk failure by killing pid of any disk on one node and add again the same disk after formatting the drive 4. start volume by force 5. self haling adding the file name with 0 bytes in newly formatted drive 6. wait more time to finish self healing, but self healing is not happening the file lies on 0 bytes 7. Try to read same file from client, now the file name with 0 byte try to recovery and recovery completed. Get the md5sum of the file with all client live and the result is positive 8. Now, bring down 2 of the node 9. Now try to get the mdsum of same recoverd file, client throws I/O error Actual results: I/O error on the recovered file Expected results: There should not be any IO erro Additional info: admin@node001:~$ sudo gluster volume info Volume Name: vaulttest21 Type: Distributed-Disperse Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af Status: Started Number of Bricks: 3 x (4 + 2) = 18 Transport-type: tcp Bricks: Brick1: 10.1.2.1:/media/disk1 Brick2: 10.1.2.2:/media/disk1 Brick3: 10.1.2.3:/media/disk1 Brick4: 10.1.2.4:/media/disk1 Brick5: 10.1.2.5:/media/disk1 Brick6: 10.1.2.6:/media/disk1 Brick7: 10.1.2.1:/media/disk2 Brick8: 10.1.2.2:/media/disk2 Brick9: 10.1.2.3:/media/disk2 Brick10: 10.1.2.4:/media/disk2 Brick11: 10.1.2.5:/media/disk2 Brick12: 10.1.2.6:/media/disk2 Brick13: 10.1.2.1:/media/disk3 Brick14: 10.1.2.2:/media/disk3 Brick15: 10.1.2.3:/media/disk3 Brick16: 10.1.2.4:/media/disk3 Brick17: 10.1.2.5:/media/disk3 Brick18: 10.1.2.6:/media/disk3 Options Reconfigured: performance.readdir-ahead: on *_After simulated the disk failure( node3- disk2) and adding aging by formatting the drive _* admin@node003:~$ date Thu Jun 25 *16:21:58* IST 2015 admin@node003:~$ ls -l -h /media/disk2 total 1.6G drwxr-xr-x 3 root root 22 Jun 25 16:18 1 *-rw-r--r-- 2 root root 0 Jun 25 16:17 up1* *-rw-r--r-- 2 root root 0 Jun 25 16:17 up2* -rw-r--r-- 2 root root 797M Jun 25 16:03 up3 -rw-r--r-- 2 root root 797M Jun 25 16:04 up4 -- admin@node003:~$ date Thu Jun 25 *16:25:09* IST 2015 admin@node003:~$ ls -l -h /media/disk2 total 1.6G drwxr-xr-x 3 root root 22 Jun 25 16:18 1 *-rw-r--r-- 2 root root 0 Jun 25 16:17 up1* *-rw-r--r-- 2 root root 0 Jun 25 16:17 up2* -rw-r--r-- 2 root root 797M Jun 25 16:03 up3 -rw-r--r-- 2 root root 797M Jun 25 16:04 up4 admin@node003:~$ date Thu Jun 25 *16:41:25* IST 2015 admin@node003:~$ ls -l -h /media/disk2 total 1.6G drwxr-xr-x 3 root root 22 Jun 25 16:18 1 -rw-r--r-- 2 root root 0 Jun 25 16:17 up1 -rw-r--r-- 2 root root 0 Jun 25 16:17 up2 -rw-r--r-- 2 root root 797M Jun 25 16:03 up3 -rw-r--r-- 2 root root 797M Jun 25 16:04 up4 *after waiting nearly 20 minutes, self healing is not recovered the full data junk . Then try to read the file using md5sum* * * root@mas03:/mnt/gluster# time md5sum up1 4650543ade404ed5a1171726e76f8b7c up1 real 1m58.010s user 0m6.243s sys 0m0.778s *corrupted junk starts growing* admin@node003:~$ ls -l -h /media/disk2 total 2.6G drwxr-xr-x 3 root root 22 Jun 25 16:18 1 -rw-r--r-- 2 root root 797M Jun 25 15:57 up1 -rw-r--r-- 2 root root 0 Jun 25 16:17 up2 -rw-r--r-- 2 root root 797M Jun 25 16:03 up3 -rw-r--r-- 2 root root 797M Jun 25 16:04 up4 *_To verify healed file after two node 5 & 6 taken offline_* root@mas03:/mnt/gluster# time md5sum up1 md5sum: up1:*Input/output error*
We are receiving same I/O error after recovery of unavailable data chunks on the failed nodes,once it becomes live. Steps to reproduce: 1. create a 3x(4+2) disperse volume across nodes 2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd 3. Now, bring down 2 of the nodes(node 5 & 6) 4. write some files(eg filenew1, filenew2). The files will be available only on 4 nodes( node 1,2,3 & 4 ) 5. Now bring up the failed/down 2 nodes 6. Pro active Self healing will create unavailable data chunks on 2 nodes (node 5 & 6) perfectly. 7. Once finish the self healing, bring down another two nodes (node 1 & 2) 8. Now try to get the mdsum of same recoverd file or read the files (filenew1 & filenew2), client throws I/O error
Both failures seem related to bug #1235629. The bug happens because after healing a file, an important extended attribute is not correctly repaired. When 2 other bricks are killed, the system considers that there are 3 bad copies (2 down + 1 incorrect), thus reporting EIO. There is a patch for this problem already merged into latest release-3.7 branch. Could you try if the current release-3.7 branch solves the problem ?
Issue1 : I/O error after simulate disk failure. Issue2 : I/O error after simulate node failure. Issue2: Pro active self healing works properly. I/O error after recovery of unavailable data chunks on the failed nodes,once it becomes live has been resolved after installing the tar file "http://download.gluster.org/pub/gluster/glusterfs/nightly/sources/glusterfs-3.7.2-20150630.fb72055.tar.gz". Issue1: Pro active self healing is not working. But still I/O error exist after healing of failed disk files(simulate disk failure). The bug happens because the pro active self healing is not working properly. We have to run the find -d -exec getfattr -h -n test {} \; command to heal the failed disk files. After manual healing, the trusted.ec.config xattr is not available on healed files. We are able to read the healed files without I/O error,if we add trusted.ec.config xattr manually for all the healed files.
I write a test script to trigger the bug. -- # cat tests/basic/ec/ec-proactive-heal.t #!/bin/bash . $(dirname $0)/../../include.rc . $(dirname $0)/../../volume.rc cleanup ec_test_dir=$M0/test function ec_test_generate_src() { mkdir -p $ec_test_dir for i in `seq 0 19` do dd if=/dev/zero of=$ec_test_dir/$i.c bs=1024 count=2 done } function ec_test_make() { for i in `ls *.c` do file=`basename $i` filename=${file%.*} cp $i $filename.o done } ## step 1 TEST glusterd TEST pidof glusterd TEST $CLI volume create $V0 disperse 7 redundancy 3 $H0:$B0/${V0}{0..6} TEST $CLI volume start $V0 TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0 EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0 ## step 2 TEST ec_test_generate_src cd $ec_test_dir TEST ec_test_make ## step 3 TEST kill_brick $V0 $H0 $B0/${V0}0 TEST kill_brick $V0 $H0 $B0/${V0}1 EXPECT '5' online_brick_count TEST rm -f *.o TEST ec_test_make ## step 4 TEST $CLI volume start $V0 force EXPECT '7' online_brick_count # active heal EXPECT_WITHIN $PROCESS_UP_TIMEOUT "[0-9][0-9]*" get_shd_process_pid TEST $CLI volume heal $V0 full TEST rm -f *.o TEST ec_test_make ## step 5 TEST kill_brick $V0 $H0 $B0/${V0}2 TEST kill_brick $V0 $H0 $B0/${V0}3 EXPECT '5' online_brick_count TEST rm -f *.o TEST ec_test_make EXPECT '5' online_brick_count ## step 6 TEST $CLI volume start $V0 force EXPECT '7' online_brick_count # self-healing TEST rm -f *.o TEST ec_test_make TEST pidof glusterd EXPECT "$V0" volinfo_field $V0 'Volume Name' EXPECT 'Started' volinfo_field $V0 'Status' EXPECT '7' online_brick_count ## cleanup cd EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0 TEST $CLI volume stop $V0 TEST $CLI volume delete $V0 TEST rm -rf $B0/* cleanup; -- I tested on branch release-3.7 with commitID b639cb9f62ae, and on master with commitID 9442e7bf80f5c. On both branches the I/O error was reported in Step 5 during the two tests "TEST rm -f *.o and TEST ec_test_make". Please note that if we use the root directory of the mount-point, i.e. set the ec_test_dir to $M0, the test always passes. Hope this helps.
This bug could not be fixed in time for glusterfs-3.7.3. This is now being tracked for being fixed in glusterfs-3.7.4.
Fang Huang, Thanks a ton for the test script. Xavi, I think I found the root cause for this problem. After the heal happens, inode is still not able to update the 'bad' in inode-ctx because of which it thinks enough good subvolumes are not present which is leading to EIO. Let's talk about this today. [2015-08-05 00:56:52.530667] E [MSGID: 122034] [ec-common.c:546:ec_child_select] 0-patchy-disperse-0: Insufficient available childs for this request (have 3, need 4) I modified the script to umount and mount again so that inode-ctx will be updated afresh and the test passes. #!/bin/bash . $(dirname $0)/../../include.rc . $(dirname $0)/../../volume.rc cleanup ec_test_dir=$M0/test function ec_test_generate_src() { mkdir -p $ec_test_dir for i in `seq 0 19` do dd if=/dev/zero of=$ec_test_dir/$i.c bs=1024 count=2 done } function ec_test_make() { for i in `ls *.c` do file=`basename $i` filename=${file%.*} cp $i $filename.o done } ## step 1 TEST glusterd TEST pidof glusterd TEST $CLI volume create $V0 disperse 7 redundancy 3 $H0:$B0/${V0}{0..6} TEST $CLI volume start $V0 TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0 EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0 ## step 2 TEST ec_test_generate_src cd $ec_test_dir TEST ec_test_make ## step 3 TEST kill_brick $V0 $H0 $B0/${V0}0 TEST kill_brick $V0 $H0 $B0/${V0}1 EXPECT '5' online_brick_count TEST rm -f *.o TEST ec_test_make ## step 4 TEST $CLI volume start $V0 force EXPECT '7' online_brick_count # active heal EXPECT_WITHIN $PROCESS_UP_TIMEOUT "[0-9][0-9]*" get_shd_process_pid TEST $CLI volume heal $V0 full TEST rm -f *.o TEST ec_test_make ## step 5 TEST kill_brick $V0 $H0 $B0/${V0}2 TEST kill_brick $V0 $H0 $B0/${V0}3 EXPECT '5' online_brick_count cd - EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0 TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0 EXPECT_WITHIN $CHILD_UP_TIMEOUT "5" ec_child_up_count $V0 0 cd - TEST rm -f *.o TEST ec_test_make EXPECT '5' online_brick_count ## step 6 TEST $CLI volume start $V0 force EXPECT '7' online_brick_count cd - EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0 TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0 EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0 cd - # self-healing TEST rm -f *.o TEST ec_test_make TEST pidof glusterd EXPECT "$V0" volinfo_field $V0 'Volume Name' EXPECT 'Started' volinfo_field $V0 'Status' EXPECT '7' online_brick_count ## cleanup cd EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0 TEST $CLI volume stop $V0 TEST $CLI volume delete $V0 TEST rm -rf $B0/* cleanup;
I have tested the 3.7.3 as well as 3.7.2 nightly build( glusterfs-3.7.2-20150726.b639cb9.tar.gz) for the I/O error and hangout issue. I have found that 3.7.3 has the data corruption issue which is not present is 3.7.2 nightly build( glusterfs-3.7.2-20150707.36f24f5.tar.gz). Data has been corrupted after replacing the failed drive and running the self heal command. Even we are finding the data corruption after the recovery of node failure ,When unavailable data chunks has been copied by proactive self heal daemon. You can reproduce the bug through the following steps Steps to reproduce: 1. create a 3x(4+2) disperse volume across nodes 2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd 3. Now, bring down 2 of the nodes(node 5 & 6) 4. write some files(eg filenew1, filenew2). The files will be available only on 4 nodes( node 1,2,3 & 4 ) 5. calculate the md5sum of filenew1 and filenew2 6. Now bring up the failed/down 2 nodes( node 5 & 6) 6. Pro active Self healing will create unavailable data chunks on 2 nodes (node 5 & 6). 7. Once finish the self healing, bring down another two nodes (node 1 & 2) 8. Now try to get the mdsum of same recovered file, there will be a mismatch in md5sum value. But this bug is not available in 3.7.2 nightly build (glusterfs-3.7.2-20150707.36f24f5.tar.gz) Also i would like to know, why the proactive self healing is not happening after replacing the failed drives. I have to manually run the volume heal command for healing the unavailable files.
REVIEW: http://review.gluster.org/11867 (cluster/ec: Fix tracking of good bricks) posted (#1) for review on release-3.7 by Xavier Hernandez (xhernandez)
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.4, please open a new bug report. glusterfs-3.7.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12496 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user