Bug 1235964

Summary:	Disperse volume: FUSE I/O error after self healing the failed disk files
Product:	[Community] GlusterFS	Reporter:	Backer <mdfakkeer>
Component:	disperse	Assignee:	Xavi Hernandez <jahernan>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.7.2	CC:	bugs, fanghuang.data, gluster-bugs, jahernan, pkarampu
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.4	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1236065 (view as bug list)		Environment:
Last Closed:	2015-09-09 09:38:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1236065
Bug Blocks:	1248533

Description Backer 2015-06-26 08:32:14 UTC

Description of problem:
In a 3 x (4 + 2) = 18 distributed disperse volume, there are
input/output error of some files on fuse mount after simulating the
following scenario

1.   Simulate the disk failure by killing the disk pid and again adding
the same disk after formatting the drive
2.   Try to read the recovered or healed file after 2 bricks/nodes were
brought down
Version-Release number of selected component (if applicable):

glusterfs 3.7.2 built on Jun 19 2015 16:33:27
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License

How reproducible:

Steps to Reproduce:

1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd
3. simulate the disk failure by killing pid of any disk on one node and add again the same disk after formatting the drive
4. start volume by force
5. self haling adding the file name with 0 bytes in newly formatted drive
6. wait more time to finish self healing, but self healing is not happening the file lies on 0 bytes
7. Try to read same file from client, now the file name with 0 byte try to recovery and recovery completed. Get the md5sum of the file with all client live and the result is positive
8. Now, bring down 2 of the node
9. Now try to get the mdsum of same recoverd file, client throws I/O error


Actual results:
I/O error on the recovered file

Expected results:
There should not be any IO erro

Additional info:

admin@node001:~$ sudo gluster volume info

Volume Name: vaulttest21
Type: Distributed-Disperse
Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/media/disk1
Brick2: 10.1.2.2:/media/disk1
Brick3: 10.1.2.3:/media/disk1
Brick4: 10.1.2.4:/media/disk1
Brick5: 10.1.2.5:/media/disk1
Brick6: 10.1.2.6:/media/disk1
Brick7: 10.1.2.1:/media/disk2
Brick8: 10.1.2.2:/media/disk2
Brick9: 10.1.2.3:/media/disk2
Brick10: 10.1.2.4:/media/disk2
Brick11: 10.1.2.5:/media/disk2
Brick12: 10.1.2.6:/media/disk2
Brick13: 10.1.2.1:/media/disk3
Brick14: 10.1.2.2:/media/disk3
Brick15: 10.1.2.3:/media/disk3
Brick16: 10.1.2.4:/media/disk3
Brick17: 10.1.2.5:/media/disk3
Brick18: 10.1.2.6:/media/disk3
Options Reconfigured:
performance.readdir-ahead: on

*_After simulated the disk failure( node3- disk2) and adding aging by
formatting the drive _*

admin@node003:~$ date

Thu Jun 25 *16:21:58* IST 2015


admin@node003:~$ ls -l -h /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

--

admin@node003:~$ date

Thu Jun 25 *16:25:09* IST 2015


admin@node003:~$ ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4


admin@node003:~$ date

Thu Jun 25 *16:41:25* IST 2015


admin@node003:~$  ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up2

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4


*after waiting nearly 20 minutes, self healing is not recovered the full
data junk . Then try to read the file using md5sum*
*
*
root@mas03:/mnt/gluster# time md5sum up1
4650543ade404ed5a1171726e76f8b7c  up1

real    1m58.010s
user    0m6.243s
sys     0m0.778s

*corrupted junk starts growing*

admin@node003:~$ ls -l -h  /media/disk2
total 2.6G
drwxr-xr-x 3 root root   22 Jun 25 16:18 1
-rw-r--r-- 2 root root 797M Jun 25 15:57 up1
-rw-r--r-- 2 root root    0 Jun 25 16:17 up2
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*_To verify healed file after two node 5 & 6 taken offline_*

root@mas03:/mnt/gluster# time md5sum up1
md5sum: up1:*Input/output error*

Comment 1 Backer 2015-06-29 06:05:35 UTC

We are receiving same I/O error after recovery of unavailable data chunks on the failed nodes,once it becomes live.

Steps to reproduce:
1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd
3. Now, bring down 2 of the nodes(node 5 & 6)
4. write some files(eg filenew1, filenew2). The files will be available only on 4 nodes( node 1,2,3 & 4 )
5. Now bring up the failed/down 2 nodes
6. Pro active Self healing will create unavailable data chunks on 2 nodes (node 5 & 6) perfectly.
7. Once finish the self healing, bring down another two nodes (node 1 & 2)
8. Now try to get the mdsum of same recoverd file or read the files (filenew1 & filenew2), client throws I/O error

Comment 2 Xavi Hernandez 2015-06-30 14:05:15 UTC

Both failures seem related to bug #1235629. The bug happens because after healing a file, an important extended attribute is not correctly repaired. When 2 other bricks are killed, the system considers that there are 3 bad copies (2 down + 1 incorrect), thus reporting EIO.

There is a patch for this problem already merged into latest release-3.7 branch. Could you try if the current release-3.7 branch solves the problem ?

Comment 3 Backer 2015-07-01 11:40:10 UTC

Issue1 : I/O error after simulate disk failure.
Issue2 : I/O error after simulate node failure. 

Issue2: Pro active self healing works properly.

I/O error after recovery of unavailable data chunks on the failed nodes,once it becomes live has been resolved after installing the tar file "http://download.gluster.org/pub/gluster/glusterfs/nightly/sources/glusterfs-3.7.2-20150630.fb72055.tar.gz".

Issue1: Pro active self healing is not working.

But still I/O error exist after healing of failed disk files(simulate disk failure). The bug happens because the pro active self healing is not working properly. We have to run the find -d -exec getfattr -h -n test {} \; command to heal the failed disk files. After manual healing, the trusted.ec.config xattr is not available on healed files.

We are able to read the healed files without I/O error,if we add trusted.ec.config xattr manually for all the healed files.

Comment 4 Fang Huang 2015-07-27 06:35:48 UTC

I write a test script to trigger the bug.

--
# cat tests/basic/ec/ec-proactive-heal.t   

#!/bin/bash

. $(dirname $0)/../../include.rc
. $(dirname $0)/../../volume.rc

cleanup

ec_test_dir=$M0/test

function ec_test_generate_src()
{
   mkdir -p $ec_test_dir
   for i in `seq 0 19`
   do
      dd if=/dev/zero of=$ec_test_dir/$i.c bs=1024 count=2
   done
}

function ec_test_make()
{
   for i in `ls *.c`
   do
     file=`basename $i`
     filename=${file%.*}
     cp $i $filename.o
   done 
}

## step 1
TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 disperse 7 redundancy 3 $H0:$B0/${V0}{0..6}
TEST $CLI volume start $V0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0

## step 2
TEST ec_test_generate_src

cd $ec_test_dir
TEST ec_test_make

## step 3
TEST kill_brick $V0 $H0 $B0/${V0}0
TEST kill_brick $V0 $H0 $B0/${V0}1
EXPECT '5' online_brick_count

TEST rm -f *.o
TEST ec_test_make

## step 4
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count

# active heal
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "[0-9][0-9]*" get_shd_process_pid
TEST $CLI volume heal $V0 full

TEST rm -f *.o
TEST ec_test_make


## step 5
TEST kill_brick $V0 $H0 $B0/${V0}2
TEST kill_brick $V0 $H0 $B0/${V0}3
EXPECT '5' online_brick_count

TEST rm -f *.o 
TEST ec_test_make

EXPECT '5' online_brick_count

## step 6
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count

# self-healing
TEST rm -f *.o
TEST ec_test_make

TEST pidof glusterd
EXPECT "$V0" volinfo_field $V0 'Volume Name'
EXPECT 'Started' volinfo_field $V0 'Status'
EXPECT '7' online_brick_count

## cleanup
cd
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST $CLI volume stop $V0
TEST $CLI volume delete $V0
TEST rm -rf $B0/*

cleanup;
--

I tested on branch release-3.7 with commitID b639cb9f62ae, and on master with commitID 9442e7bf80f5c. On both branches the I/O error was reported in Step 5 during the two tests "TEST rm -f *.o and TEST ec_test_make". 

Please note that if we use the root directory of the mount-point, i.e. set the ec_test_dir to $M0, the test always passes.

Hope this helps.

Comment 5 Kaushal 2015-07-30 13:17:51 UTC

This bug could not be fixed in time for glusterfs-3.7.3. This is now being tracked for being fixed in glusterfs-3.7.4.

Comment 6 Pranith Kumar K 2015-08-05 01:02:46 UTC

Fang Huang,
      Thanks a ton for the test script.
Xavi,
     I think I found the root cause for this problem. After the heal happens, inode is still not able to update the 'bad' in inode-ctx because of which it thinks enough good subvolumes are not present which is leading to EIO. Let's talk about this today.

[2015-08-05 00:56:52.530667] E [MSGID: 122034] [ec-common.c:546:ec_child_select] 0-patchy-disperse-0: Insufficient available childs for this request (have 3, need 4)
 I modified the script to umount and mount again so that inode-ctx will be updated afresh and the test passes.

#!/bin/bash

. $(dirname $0)/../../include.rc
. $(dirname $0)/../../volume.rc

cleanup

ec_test_dir=$M0/test

function ec_test_generate_src()
{
   mkdir -p $ec_test_dir
   for i in `seq 0 19`
   do
      dd if=/dev/zero of=$ec_test_dir/$i.c bs=1024 count=2
   done
}

function ec_test_make()
{
   for i in `ls *.c`
   do
     file=`basename $i`
     filename=${file%.*}
     cp $i $filename.o
   done
}

## step 1
TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 disperse 7 redundancy 3 $H0:$B0/${V0}{0..6}
TEST $CLI volume start $V0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0

## step 2
TEST ec_test_generate_src

cd $ec_test_dir
TEST ec_test_make

## step 3
TEST kill_brick $V0 $H0 $B0/${V0}0
TEST kill_brick $V0 $H0 $B0/${V0}1
EXPECT '5' online_brick_count

TEST rm -f *.o
TEST ec_test_make

## step 4
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count

# active heal
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "[0-9][0-9]*" get_shd_process_pid
TEST $CLI volume heal $V0 full

TEST rm -f *.o
TEST ec_test_make


## step 5
TEST kill_brick $V0 $H0 $B0/${V0}2
TEST kill_brick $V0 $H0 $B0/${V0}3
EXPECT '5' online_brick_count
cd -
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "5" ec_child_up_count $V0 0
cd -

TEST rm -f *.o
TEST ec_test_make

EXPECT '5' online_brick_count

## step 6
TEST $CLI volume start $V0 force
EXPECT '7' online_brick_count
cd -
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0
EXPECT_WITHIN $CHILD_UP_TIMEOUT "7" ec_child_up_count $V0 0
cd -

# self-healing
TEST rm -f *.o
TEST ec_test_make

TEST pidof glusterd
EXPECT "$V0" volinfo_field $V0 'Volume Name'
EXPECT 'Started' volinfo_field $V0 'Status'
EXPECT '7' online_brick_count

## cleanup
cd
EXPECT_WITHIN $UMOUNT_TIMEOUT "Y" force_umount $M0
TEST $CLI volume stop $V0
TEST $CLI volume delete $V0
TEST rm -rf $B0/*

cleanup;

Comment 7 Backer 2015-08-05 08:16:11 UTC

I have tested the 3.7.3 as well as 3.7.2 nightly build( glusterfs-3.7.2-20150726.b639cb9.tar.gz) for the I/O error and hangout issue. I have found that 3.7.3 has the data corruption issue which is not present is 3.7.2 nightly build( glusterfs-3.7.2-20150707.36f24f5.tar.gz). Data has been corrupted after replacing the failed drive and running the self heal command. Even we are finding the data corruption after the recovery of node failure ,When unavailable data chunks has been copied by proactive self heal daemon. You can reproduce the bug through the following steps

Steps to reproduce:
1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd
3. Now, bring down 2 of the nodes(node 5 & 6)
4. write some files(eg filenew1, filenew2). The files will be available only on 4 nodes( node 1,2,3 & 4 )
5. calculate the md5sum of filenew1 and filenew2
6. Now bring up the failed/down 2 nodes( node 5 & 6)
6. Pro active Self healing will create unavailable data chunks on 2 nodes (node 5 & 6).
7. Once finish the self healing, bring down another two nodes (node 1 & 2)
8. Now try to get the mdsum of same recovered file, there will be a mismatch in md5sum value.

But this bug is not available in 3.7.2 nightly build (glusterfs-3.7.2-20150707.36f24f5.tar.gz)

Also i would like to know, why the proactive self healing is not happening after replacing the failed drives. I have to manually run the volume heal command for healing the unavailable files.

Comment 8 Anand Avati 2015-08-08 15:46:52 UTC

REVIEW: http://review.gluster.org/11867 (cluster/ec: Fix tracking of good bricks) posted (#1) for review on release-3.7 by Xavier Hernandez (xhernandez)

Comment 9 Kaushal 2015-09-09 09:38:02 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.4, please open a new bug report.

glusterfs-3.7.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12496
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user