Bug 1235964
| Summary: | Disperse volume: FUSE I/O error after self healing the failed disk files | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Backer <mdfakkeer> | |
| Component: | disperse | Assignee: | Xavi Hernandez <jahernan> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.7.2 | CC: | bugs, fanghuang.data, gluster-bugs, jahernan, pkarampu | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.4 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1236065 (view as bug list) | Environment: | ||
| Last Closed: | 2015-09-09 09:38:02 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1236065 | |||
| Bug Blocks: | 1248533 | |||
|
Description
Backer
2015-06-26 08:32:14 UTC
We are receiving same I/O error after recovery of unavailable data chunks on the failed nodes,once it becomes live. Steps to reproduce: 1. create a 3x(4+2) disperse volume across nodes 2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd 3. Now, bring down 2 of the nodes(node 5 & 6) 4. write some files(eg filenew1, filenew2). The files will be available only on 4 nodes( node 1,2,3 & 4 ) 5. Now bring up the failed/down 2 nodes 6. Pro active Self healing will create unavailable data chunks on 2 nodes (node 5 & 6) perfectly. 7. Once finish the self healing, bring down another two nodes (node 1 & 2) 8. Now try to get the mdsum of same recoverd file or read the files (filenew1 & filenew2), client throws I/O error Both failures seem related to bug #1235629. The bug happens because after healing a file, an important extended attribute is not correctly repaired. When 2 other bricks are killed, the system considers that there are 3 bad copies (2 down + 1 incorrect), thus reporting EIO. There is a patch for this problem already merged into latest release-3.7 branch. Could you try if the current release-3.7 branch solves the problem ? Issue1 : I/O error after simulate disk failure. Issue2 : I/O error after simulate node failure. Issue2: Pro active self healing works properly. I/O error after recovery of unavailable data chunks on the failed nodes,once it becomes live has been resolved after installing the tar file "http://download.gluster.org/pub/gluster/glusterfs/nightly/sources/glusterfs-3.7.2-20150630.fb72055.tar.gz". Issue1: Pro active self healing is not working. But still I/O error exist after healing of failed disk files(simulate disk failure). The bug happens because the pro active self healing is not working properly. We have to run the find -d -exec getfattr -h -n test {} \; command to heal the failed disk files. After manual healing, the trusted.ec.config xattr is not available on healed files. We are able to read the healed files without I/O error,if we add trusted.ec.config xattr manually for all the healed files. I write a test script to trigger the bug.
--
# cat tests/basic/ec/ec-proactive-heal.t
#!/bin/bash
. $(dirname $0)/../../include.rc
. $(dirname $0)/../../volume.rc
cleanup
ec_test_dir=$M0/test
function ec_test_generate_src()
{
mkdir -p $ec_test_dir
for i in `seq 0 19`
do
dd if=/dev/zero of=$ec_test_dir/$i.c bs=1024 count=2
done
}
function ec_test_make()
{
for i in `ls *.c`
do
file=`basename $i`
filename=${file%.*}
cp $i $filename.o
done
}
## step 1
TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 disperse 7 redundancy 3 $H0:$B0/${V0}{0..6}
TEST $CLI volume start $V0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0
## step 2
TEST ec_test_generate_src
cd $ec_test_dir
TEST ec_test_make
## step 3
TEST kill_brick $V0 $H0 $B0/${V0}0
TEST kill_brick $V0 $H0 $B0/${V0}1
EXPECT '5' online_brick_count
TEST rm -f *.o
TEST ec_test_make
## step 4
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count
# active heal
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "[0-9][0-9]*" get_shd_process_pid
TEST $CLI volume heal $V0 full
TEST rm -f *.o
TEST ec_test_make
## step 5
TEST kill_brick $V0 $H0 $B0/${V0}2
TEST kill_brick $V0 $H0 $B0/${V0}3
EXPECT '5' online_brick_count
TEST rm -f *.o
TEST ec_test_make
EXPECT '5' online_brick_count
## step 6
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count
# self-healing
TEST rm -f *.o
TEST ec_test_make
TEST pidof glusterd
EXPECT "$V0" volinfo_field $V0 'Volume Name'
EXPECT 'Started' volinfo_field $V0 'Status'
EXPECT '7' online_brick_count
## cleanup
cd
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST $CLI volume stop $V0
TEST $CLI volume delete $V0
TEST rm -rf $B0/*
cleanup;
--
I tested on branch release-3.7 with commitID b639cb9f62ae, and on master with commitID 9442e7bf80f5c. On both branches the I/O error was reported in Step 5 during the two tests "TEST rm -f *.o and TEST ec_test_make".
Please note that if we use the root directory of the mount-point, i.e. set the ec_test_dir to $M0, the test always passes.
Hope this helps.
This bug could not be fixed in time for glusterfs-3.7.3. This is now being tracked for being fixed in glusterfs-3.7.4. Fang Huang,
Thanks a ton for the test script.
Xavi,
I think I found the root cause for this problem. After the heal happens, inode is still not able to update the 'bad' in inode-ctx because of which it thinks enough good subvolumes are not present which is leading to EIO. Let's talk about this today.
[2015-08-05 00:56:52.530667] E [MSGID: 122034] [ec-common.c:546:ec_child_select] 0-patchy-disperse-0: Insufficient available childs for this request (have 3, need 4)
I modified the script to umount and mount again so that inode-ctx will be updated afresh and the test passes.
#!/bin/bash
. $(dirname $0)/../../include.rc
. $(dirname $0)/../../volume.rc
cleanup
ec_test_dir=$M0/test
function ec_test_generate_src()
{
mkdir -p $ec_test_dir
for i in `seq 0 19`
do
dd if=/dev/zero of=$ec_test_dir/$i.c bs=1024 count=2
done
}
function ec_test_make()
{
for i in `ls *.c`
do
file=`basename $i`
filename=${file%.*}
cp $i $filename.o
done
}
## step 1
TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 disperse 7 redundancy 3 $H0:$B0/${V0}{0..6}
TEST $CLI volume start $V0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0
## step 2
TEST ec_test_generate_src
cd $ec_test_dir
TEST ec_test_make
## step 3
TEST kill_brick $V0 $H0 $B0/${V0}0
TEST kill_brick $V0 $H0 $B0/${V0}1
EXPECT '5' online_brick_count
TEST rm -f *.o
TEST ec_test_make
## step 4
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count
# active heal
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "[0-9][0-9]*" get_shd_process_pid
TEST $CLI volume heal $V0 full
TEST rm -f *.o
TEST ec_test_make
## step 5
TEST kill_brick $V0 $H0 $B0/${V0}2
TEST kill_brick $V0 $H0 $B0/${V0}3
EXPECT '5' online_brick_count
cd -
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "5" ec_child_up_count $V0 0
cd -
TEST rm -f *.o
TEST ec_test_make
EXPECT '5' online_brick_count
## step 6
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count
cd -
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0
cd -
# self-healing
TEST rm -f *.o
TEST ec_test_make
TEST pidof glusterd
EXPECT "$V0" volinfo_field $V0 'Volume Name'
EXPECT 'Started' volinfo_field $V0 'Status'
EXPECT '7' online_brick_count
## cleanup
cd
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST $CLI volume stop $V0
TEST $CLI volume delete $V0
TEST rm -rf $B0/*
cleanup;
I have tested the 3.7.3 as well as 3.7.2 nightly build( glusterfs-3.7.2-20150726.b639cb9.tar.gz) for the I/O error and hangout issue. I have found that 3.7.3 has the data corruption issue which is not present is 3.7.2 nightly build( glusterfs-3.7.2-20150707.36f24f5.tar.gz). Data has been corrupted after replacing the failed drive and running the self heal command. Even we are finding the data corruption after the recovery of node failure ,When unavailable data chunks has been copied by proactive self heal daemon. You can reproduce the bug through the following steps Steps to reproduce: 1. create a 3x(4+2) disperse volume across nodes 2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd 3. Now, bring down 2 of the nodes(node 5 & 6) 4. write some files(eg filenew1, filenew2). The files will be available only on 4 nodes( node 1,2,3 & 4 ) 5. calculate the md5sum of filenew1 and filenew2 6. Now bring up the failed/down 2 nodes( node 5 & 6) 6. Pro active Self healing will create unavailable data chunks on 2 nodes (node 5 & 6). 7. Once finish the self healing, bring down another two nodes (node 1 & 2) 8. Now try to get the mdsum of same recovered file, there will be a mismatch in md5sum value. But this bug is not available in 3.7.2 nightly build (glusterfs-3.7.2-20150707.36f24f5.tar.gz) Also i would like to know, why the proactive self healing is not happening after replacing the failed drives. I have to manually run the volume heal command for healing the unavailable files. REVIEW: http://review.gluster.org/11867 (cluster/ec: Fix tracking of good bricks) posted (#1) for review on release-3.7 by Xavier Hernandez (xhernandez) This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.4, please open a new bug report. glusterfs-3.7.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12496 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |