Bug 842254 - performance problem with VMs and replicate
Summary: performance problem with VMs and replicate
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
Assignee: Brian Foster
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 853680 858495
TreeView+ depends on / blocked
 
Reported: 2012-07-23 09:00 UTC by Pranith Kumar K
Modified: 2015-08-06 13:05 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 853680 (view as bug list)
Environment:
Last Closed: 2013-07-24 17:58:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2012-07-23 09:00:37 UTC
Description of problem:
Update on our progress with using KVM & Gluster:

We built a two server (Dell R710) cluster, each box has...
 5 x 500 GB SATA RAID5 array (software raid)
 an Intel 10GB ethernet HBA.
 One box has 8GB RAM, the other 48GB
 both have 2 x E5520 Xeon
 Centos 6.3 installed
 Gluster 3.3 installed from the rpm files on the gluster site


1) create a replicated gluster volume (on top of xfs)
2) setup qemu/kvm with a gluster volume (mounts localhost:/gluster-vol)
3) sanlock configured (this is evil!)
4) build a virtual machines with 30GB qcow2 image, 1GB RAM
5) clone this VM into 4 machines
6) check that live migration works (OK)

Start basic test cycle:
a) migrate all machines to host #1, then reboot host #2
b) watch logs for self-heal to complete
c) migrate VM's to host #2, reboot host #1
d) check logs for self heal

The above cycle can be repeated numerous times, and completes without error, provided that no (or little) load is on the VM.


If I give the VM's a work load, such by running "bonnie++" on each VM, things start to break.
1) it becomes almost impossible to log in to each VM
2) the kernel on each VM starts giving timeout errors
i.e. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
3) top / uptime on the hosts shows load average of up to 24
4) dd write speed (block size 1K) to gluster is around 3MB/s on the host


While I agree that running bonnie++ on four VM's is possibly unfair, there are load spikes on quiet machines (yum updates etc). I suspect that the I/O of one VM starts blocking that of another VM, and the pressure builds up rapidly on gluster - which does not seem to cope well under pressure. Possibly this is the access pattern / block size of qcow2 disks?

I'm (slightly) disappointed.

Though it doesn't corrupt data, the I/O performance is < 1% of my hardwares capability. Hopefully work on buffering and other tuning will fix this ? Or maybe the work mentioned getting qemu talking directly to gluster will fix this?

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 yin.yin 2012-07-28 06:42:16 UTC
I have the same problem.

Comment 2 koungho 2012-08-23 09:44:24 UTC
Have the same problem, too. GlusterFS self-heal process block VM HOST and guests.

Comment 3 Brian Foster 2012-12-12 21:34:51 UTC
http://review.gluster.org/4119

Comment 4 Justin Clift 2013-03-11 01:58:42 UTC
As a thought, since Brian's patch has been merged, it would be interesting to hear if the problem is now solved.

Comment 5 Jules 2015-08-06 10:01:22 UTC
This bug still exists in latest glusterfs-server 3.6.4-1 release.


Note You need to log in before you can comment on or make changes to this bug.