842254 – performance problem with VMs and replicate

Bug 842254 - performance problem with VMs and replicate

Summary: performance problem with VMs and replicate

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Assignee:	Brian Foster
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	853680 858495
TreeView+	depends on / blocked

Reported:	2012-07-23 09:00 UTC by Pranith Kumar K
Modified:	2015-08-06 13:05 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Clones:	853680 (view as bug list)
Environment:
Last Closed:	2013-07-24 17:58:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pranith Kumar K 2012-07-23 09:00:37 UTC

Description of problem:
Update on our progress with using KVM & Gluster:

We built a two server (Dell R710) cluster, each box has...
 5 x 500 GB SATA RAID5 array (software raid)
 an Intel 10GB ethernet HBA.
 One box has 8GB RAM, the other 48GB
 both have 2 x E5520 Xeon
 Centos 6.3 installed
 Gluster 3.3 installed from the rpm files on the gluster site


1) create a replicated gluster volume (on top of xfs)
2) setup qemu/kvm with a gluster volume (mounts localhost:/gluster-vol)
3) sanlock configured (this is evil!)
4) build a virtual machines with 30GB qcow2 image, 1GB RAM
5) clone this VM into 4 machines
6) check that live migration works (OK)

Start basic test cycle:
a) migrate all machines to host #1, then reboot host #2
b) watch logs for self-heal to complete
c) migrate VM's to host #2, reboot host #1
d) check logs for self heal

The above cycle can be repeated numerous times, and completes without error, provided that no (or little) load is on the VM.


If I give the VM's a work load, such by running "bonnie++" on each VM, things start to break.
1) it becomes almost impossible to log in to each VM
2) the kernel on each VM starts giving timeout errors
i.e. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
3) top / uptime on the hosts shows load average of up to 24
4) dd write speed (block size 1K) to gluster is around 3MB/s on the host


While I agree that running bonnie++ on four VM's is possibly unfair, there are load spikes on quiet machines (yum updates etc). I suspect that the I/O of one VM starts blocking that of another VM, and the pressure builds up rapidly on gluster - which does not seem to cope well under pressure. Possibly this is the access pattern / block size of qcow2 disks?

I'm (slightly) disappointed.

Though it doesn't corrupt data, the I/O performance is < 1% of my hardwares capability. Hopefully work on buffering and other tuning will fix this ? Or maybe the work mentioned getting qemu talking directly to gluster will fix this?

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 yin.yin 2012-07-28 06:42:16 UTC

I have the same problem.

Comment 2 koungho 2012-08-23 09:44:24 UTC

Have the same problem, too. GlusterFS self-heal process block VM HOST and guests.

Comment 3 Brian Foster 2012-12-12 21:34:51 UTC

http://review.gluster.org/4119

Comment 4 Justin Clift 2013-03-11 01:58:42 UTC

As a thought, since Brian's patch has been merged, it would be interesting to hear if the problem is now solved.

Comment 5 Jules 2015-08-06 10:01:22 UTC

This bug still exists in latest glusterfs-server 3.6.4-1 release.

Note You need to log in before you can comment on or make changes to this bug.