Bug 765234 (GLUSTER-3502)

Summary: Self-healing causes KVM VMs to hard reboot and causes IO bottlenecks
Product: [Community] GlusterFS Reporter: jblawn
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3-betaCC: gluster-bugs, jdarcy
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
dmesg - failed node none

Description Pranith Kumar K 2011-09-02 00:16:33 UTC
hi Jeremy,
     Could you please configure self-heal-window-size to be 1 and configure the diagnostics.client-log-level, diagnostics.brick-log-level to be DEBUG and perform the same test and provide all the log files on client and brick machines and the dmesg output on the client machine.

Pranith

Comment 1 jblawn 2011-09-02 01:27:09 UTC
Created attachment 646

Comment 2 jblawn 2011-09-02 02:29:25 UTC
The self-healing process in 3.3beta no longer causes the VMs to become completely unresponsive due to the entire VM container (qemu image) being locked.  However I am receiving the following error from qemu-kvm when self-healing kicks off which causes the VMs to hard reboot:

qemu-kvm: virtio_ioport_write: unexpected address 0x13 value 0x1

Also, the IO is terrible as the healing process is consuming nearly 100% of all IO on the disks between two servers.  I have attempted to tune to reduce the healing IO by setting the following:

cluster.data-self-heal-algorithm	= diff
cluster.self-heal-window-size	        = 4 or 8, from 16

There was very little difference.

System config:
ArchLinux
Linux kernel 3.0.3
qemu-kvm 0.15.0-2
libvirt 0.9.4-2
gluster 3.3beta2

VMs are Windows Server 2008 with the following disk configuration:

    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/gluster/WinTest2.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

Mount options for gluster client and VM filesystem:
/dev/mapper/testvg-vm on /vm type ext4 (rw,noatime,user_xattr)

test1:/test-vm on /gluster type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)

Please let me know if there is more information needed, or additional testing with debugging enabled.

Comment 3 jblawn 2011-09-02 02:31:57 UTC
Attached failed node dmesg, there was nothing logged to dmesg on the active node.  This is a two node cluster which act as bricks and clients, self-mounting on each for HA/live migration, etc.

With the self-heal change the IO was still high, but this time neither VM crashed.  No iowrite errors logged.  The performance was definitely impacted, is there anyway to throttle this down further?

Test started local time Sep 2 00:48, log files are too large, including two URLs:
http://www.cctt.org/gluster-active-node.tar.gz
http://www.cctt.org/gluster-failed-node.tar.gz

Thanks!

Comment 4 Pranith Kumar K 2011-09-02 04:26:23 UTC
(In reply to comment #3)
> Attached failed node dmesg, there was nothing logged to dmesg on the active
> node.  This is a two node cluster which act as bricks and clients,
> self-mounting on each for HA/live migration, etc.
> 
> With the self-heal change the IO was still high, but this time neither VM
> crashed.  No iowrite errors logged.  The performance was definitely impacted,
> is there anyway to throttle this down further?
> 
> Test started local time Sep 2 00:48, log files are too large, including two
> URLs:
> http://www.cctt.org/gluster-active-node.tar.gz
> http://www.cctt.org/gluster-failed-node.tar.gz
> 
> Thanks!

hi Jeremy,
       Good to know that the self-heal did not crash the VMs with the window size 1. I made this default in the master branch today morning. Self-heal needs to sync changes from source to stale node to keep them in sync, reason for the I/O you are observing. We are still working on ways to make self-heal happen in a way that does not affect the real-traffic i.e the self-heal operations will happen with very low priority. I went through the logs I did not see any thing out of the ordinary. You are the first one who ran into severe problems with the default config. Will it be possible to test the new change in your environment? That will tell us if it needs any more improvement.

I can provide a build with these changes.

Pranith

Comment 5 jblawn 2011-09-02 11:04:32 UTC
Pranith,

Yes, I can test again.  I can pull the latest git and build off that.

Are there any other tunables that we should look at for storing virtual images?

Is 'cluster.data-self-heal-algorithm = diff' the preferred setting for this setup?

Thanks,

Jeremy

Comment 6 Pranith Kumar K 2011-09-02 11:26:41 UTC
(In reply to comment #5)
> Pranith,
> 
> Yes, I can test again.  I can pull the latest git and build off that.
> 
> Are there any other tunables that we should look at for storing virtual images?
> 
> Is 'cluster.data-self-heal-algorithm = diff' the preferred setting for this
> setup?
> 
> Thanks,
> 
> Jeremy

hi Jeremy,
      I have sent the build in an email to you. Please let us know your findings.

Pranith.

Comment 7 jblawn 2011-09-08 10:11:29 UTC
IO performance returned to acceptable levels after setting self-heal-window-size=1 as default in latest git build.  There is still an issue with using virtio drivers and Windows guests during self-heal where the VM may timeout and require a hard reboot.  This is not exhibited with using a virtual ide controller.

Comment 8 Anand Avati 2011-09-09 00:35:47 UTC
CHANGE: http://review.gluster.com/335 (Change-Id: Ib6730a708f008054fbd379889a0f6dd3b051b6ad) merged in master by Anand Avati (avati)

Comment 9 Anand Avati 2011-09-09 07:21:48 UTC
CHANGE: http://review.gluster.com/336 (Change-Id: Id8a1dffa3c3200234ad154d1749278a2d7c7021b) merged in master by Anand Avati (avati)