Bug 764301 (GLUSTER-2569)

Summary: Peer shortly unreacheable in the middle of data copy forces heal at the end of copy
Product: [Community] GlusterFS Reporter: raf <milanraf>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: medium    
Version: 3.1.3CC: gluster-bugs, pkarampu
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description raf 2011-03-21 10:16:56 UTC
Environment:
[192.168.0.1]#gluster volume create san replica 2 transport tcp 192.168.0.1:/var/gluster 192.168.0.2:/var/gluster
[192.168.0.1]#gluster volume start san
[192.168.0.1]#gluster volume set san network.ping-timeout 10
[192.168.0.3]#mount -t glusterfs 192.168.0.1:/san /mnt/gluster
share /mnt/gluster using SAMBA 

Start copying two files of about 2GB in size from WinXP client.

In the middle of the 1st file copy hit:
[192.168.0.1]#killall glusterd glusterfsd glusterfs

Wait about 10 secs (copy of 1st file is still running) and enter
[192.168.0.1]#service glusterd start (copy of 1st file hasn't finished yet)

Peer absence hasn't influenced data copy in any way (it has been running smoothly).
Now the peer is back but the 1st file is only been written on 192.168.0.2 (this is normal behaviour).
When the 1st file copy finishes, before starting the 2nd, self-heal of 1st is someway forced and 2nd file copy is posponed until self-heal is finished.
It seems that the file-close action after data copy triggers self-heal process.

I think this represents a big limit in GlusterFS performance. 
Isn't the self-heal process supposed to run in background with minor system resources use?

Comment 1 Pranith Kumar K 2011-03-22 03:47:49 UTC
With the current design it is supposed work this way, but there is a plan to implement what you have suggested, so it is not a bug but a feature enhancement.

Comment 2 Pranith Kumar K 2011-08-23 03:37:48 UTC
The design change brought in as part of 3182 fixes this.