Bug 798874

Summary: self-heal hangs in case of metadata, data self-heal w.o. any changelog
Product: [Community] GlusterFS Reporter: Pranith Kumar K <pkarampu>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact: Shwetha Panduranga <shwetha.h.panduranga>
Severity: high Docs Contact:
Priority: high    
Version: pre-releaseCC: gluster-bugs, rodrigo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:28:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description Pranith Kumar K 2012-03-01 07:08:45 UTC
Description of problem:
The bug is observed due to stale value of active_sink count set in metadata self-heal. In data self-heal it decides nothing needs to be done and tried to close the opened files on sources, sinks. Since the active sink was set to one in metadata self-heal it will set call count to 4 but only performs 2 winds resulting in an incomplete stack frame.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Anand Avati 2012-03-05 18:36:08 UTC
CHANGE: http://review.gluster.com/2849 (cluster/afr: Reset re-usable sh args in sh_*_done) merged in master by Vijay Bellur (vijay)

Comment 2 Anand Avati 2012-03-13 02:37:51 UTC
CHANGE: http://review.gluster.com/2928 (replicate: backport of 0783ca994d9ea95fd9ab3dd95d6407918f19f255) merged in release-3.2 by Anand Avati (avati)

Comment 3 Shwetha Panduranga 2012-06-04 09:03:13 UTC
Waiting for inputs from pranith to verify the bug.

Comment 4 Shwetha Panduranga 2012-06-12 11:57:03 UTC
Steps to recreate the bug:-
---------------------------

1.Create a replicate volume(1x2: brick1 and brick2)
2.From the backend create a file "file1" on brick1 and brick2.The size and ownership of file "file1" should differ on brick1 and brick2
3.Set "background-self-heal-count" option to value "0" on the volume. 
4.Start the volume
5.Create a fuse/nfs mount
6.ls <file1> from the mount. 
7.cat <file1> from the mount

Expected Result:-
-----------------
1. ls should not hang and look up of the file will succeed (ls, ls -l, stat)
2. cat will report EIO
3. GFID's are assigned to the file. Extended attributes are not set.
4. rm -f <file1> should be successful.

Comment 5 Shwetha Panduranga 2012-06-12 12:34:10 UTC
[06/12/12 - 08:20:45 root@AFR-Server1 ~]# glusterd
[06/12/12 - 08:21:11 root@AFR-Server1 ~]# ./peer_probe.sh 
Probe successful
Probe successful
Number of Peers: 2

Hostname: 10.16.159.188
Uuid: b0784ecf-5412-4c6d-a9ca-f104c2a31497
State: Peer in Cluster (Connected)

Hostname: 10.16.159.196
Uuid: ac29a04f-35e0-4ec4-8caa-b3169c8a194d
State: Peer in Cluster (Connected)
[06/12/12 - 08:21:41 root@AFR-Server1 ~]# ./create_vol_1_3.sh 
Creation of volume vol has been successful. Please start the volume to access data.
[06/12/12 - 08:21:52 root@AFR-Server1 ~]# 
[06/12/12 - 08:21:57 root@AFR-Server1 ~]# 
[06/12/12 - 08:21:58 root@AFR-Server1 ~]# gluster v set vol background-self-heal-count 0
Set volume successful
[06/12/12 - 08:22:22 root@AFR-Server1 ~]# gluster v info
 
Volume Name: vol
Type: Replicate
Volume ID: 0c1cf7ba-abd9-47da-aba0-379776511854
Status: Created
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.159.184:/export_b1/dir1
Brick2: 10.16.159.188:/export_b1/dir1
Brick3: 10.16.159.196:/export_b1/dir1
Options Reconfigured:
cluster.background-self-heal-count: 0

Create file "file1" on Brick1 from backend:-
-------------------------------------------
[06/12/12 - 08:22:29 root@AFR-Server1 ~]# echo "Data From Brick1" > /export_b1/dir1/file1 
[
[06/12/12 - 08:23:47 root@AFR-Server1 ~]# ls -lh /export_b1/dir1/file1
-rw-r--r--. 1 root root 17 Jun 12 08:23 /export_b1/dir1/file1

Create file "file1" on Brick2 from backend:-
-------------------------------------------
[06/12/12 - 08:25:25 root@AFR-Server2 ~]# echo "Data From Brick2" > /export_b1/dir1/file1

[06/12/12 - 08:25:32 root@AFR-Server2 ~]# chown qa /export_b1/dir1/file1

[06/12/12 - 08:25:38 root@AFR-Server2 ~]# ls -lh /export_b1/dir1/file1
-rw-r--r--. 1 qa root 17 Jun 12 08:25 /export_b1/dir1/file1


Create file "file1" on Brick3 from backend:-
-------------------------------------------
[06/12/12 - 08:24:27 root@AFR-Server3 ~]# for i in {1..100}; do echo "This is from Brick3" >> /export_b1/dir1/file1; done

[06/12/12 - 08:24:59 root@AFR-Server3 ~]# ls -lh /export_b1/dir1/file1
-rw-r--r--. 1 root root 2.0K Jun 12 08:24 /export_b1/dir1/file1


[06/12/12 - 08:25:53 root@AFR-Server1 ~]# gluster v start vol
Starting volume vol has been successful

[06/12/12 - 08:25:57 root@AFR-Server1 ~]# gluster v status
Status of volume: vol
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.16.159.184:/export_b1/dir1			24009	Y	16069
Brick 10.16.159.188:/export_b1/dir1			24009	Y	10729
Brick 10.16.159.196:/export_b1/dir1			24009	Y	25019
NFS Server on localhost					38467	Y	16075
Self-heal Daemon on localhost				N/A	Y	16081
NFS Server on 10.16.159.188				38467	Y	10735
Self-heal Daemon on 10.16.159.188			N/A	Y	10740
NFS Server on 10.16.159.196				38467	Y	25024
Self-heal Daemon on 10.16.159.196			N/A	Y	25031
 
Mount Output;-
--------------

[06/12/12 - 08:25:49 root@ARF-Client1 ~]# mount -t glusterfs 10.16.159.184:/vol /mnt/gfsc1
[06/12/12 - 08:26:06 root@ARF-Client1 ~]# cd /mnt/gfsc1
[06/12/12 - 08:26:11 root@ARF-Client1 gfsc1]# ls
file1
[06/12/12 - 08:26:13 root@ARF-Client1 gfsc1]# ls -lh file1
-rw-r--r--. 1 qa root 17 Jun 12 08:25 file1
[06/12/12 - 08:26:19 root@ARF-Client1 gfsc1]# stat file1
  File: `file1'
  Size: 17        	Blocks: 8          IO Block: 131072 regular file
Device: 15h/21d	Inode: 12957359811459890708  Links: 1
Access: (0644/-rw-r--r--)  Uid: (  501/      qa)   Gid: (    0/    root)
Access: 2012-06-12 08:26:19.226629936 -0400
Modify: 2012-06-12 08:25:32.996155593 -0400
Change: 2012-06-12 08:26:13.824264595 -0400
[06/12/12 - 08:26:23 root@ARF-Client1 gfsc1]# cat file1
cat: file1: Input/output error
[06/12/12 - 08:26:32 root@ARF-Client1 gfsc1]# rm file1
rm: remove regular file `file1'? y
[06/12/12 - 08:28:14 root@ARF-Client1 gfsc1]# ls


Brick1 xattrs:-
-------------

[06/12/12 - 08:26:00 root@AFR-Server1 ~]# getfattr -d -m. -ehex /export_b1/dir1/file1 
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/file1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x6dc5370120d14a0bb3d1ca18f4fd3e14

[06/12/12 - 08:26:38 root@AFR-Server1 ~]# getfattr -d -m. -ehex /export_b1/dir1/
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0x0c1cf7baabd947daaba0379776511854

[06/12/12 - 08:26:41 root@AFR-Server1 ~]# ls -lh /export_b1/dir1/file1
-rw-r--r--. 2 root root 17 Jun 12 08:23 /export_b1/dir1/file1

[06/12/12 - 08:27:49 root@AFR-Server1 ~]# cat /export_b1/dir1/file1
Data From Brick1

Brick2 xattrs:-
---------------

[06/12/12 - 08:25:41 root@AFR-Server2 ~]# getfattr -d -m. -ehex /export_b1/dir1/file1
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/file1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x6dc5370120d14a0bb3d1ca18f4fd3e14


[06/12/12 - 08:26:55 root@AFR-Server2 ~]# getfattr -d -m. -ehex /export_b1/dir1/
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0x0c1cf7baabd947daaba0379776511854

[06/12/12 - 08:27:04 root@AFR-Server2 ~]# ls -lh /export_b1/dir1/file1
-rw-r--r--. 2 qa root 17 Jun 12 08:25 /export_b1/dir1/file1

[06/12/12 - 08:27:44 root@AFR-Server2 ~]# cat /export_b1/dir1/file1
Data From Brick2


Brick3 xattrs:-
---------------

[06/12/12 - 08:27:16 root@AFR-Server3 ~]# getfattr -d -m. -ehex /export_b1/dir1/file1 
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/file1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x6dc5370120d14a0bb3d1ca18f4fd3e14

[06/12/12 - 08:27:22 root@AFR-Server3 ~]# getfattr -d -m. -ehex /export_b1/dir1/
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0x0c1cf7baabd947daaba0379776511854

[06/12/12 - 08:27:25 root@AFR-Server3 ~]# ls -lh /export_b1/dir1/file1
-rw-r--r--. 2 root root 2.0K Jun 12 08:24 /export_b1/dir1/file1

Comment 6 Shwetha Panduranga 2012-06-13 07:16:05 UTC
Verified the bug on 3.3.0qa45.