Bug 689127

Summary:	abysmal performance using btrfs for VM storage
Product:	[Fedora] Fedora	Reporter:	James Ralston <ralston>
Component:	kernel	Assignee:	Zach Brown <zab>
Status:	CLOSED DEFERRED	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	19	CC:	agk, amit.shah, blair, bugzilla, cyberwizzard+redhat, eric-bugs2, erik-fedora, fche, fedoraproject, gansalmon, gholms, gmazyland, itamar, jakobunt, jdulaney, jeremy, jforbes, jonathan, kagesenshi.87, kchamart, kernel-maint, k.georgiou, kxra, lightdot, luke, madhu.chinakonda, me, michel, ms, naoki, oliver.henshaw, pbrobinson, pmrpla, rjones, sjensen, sweil, sysoutfran, tez, tmraz
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-04-23 17:24:59 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	689509

Description James Ralston 2011-03-19 17:42:18 UTC

I loaded Fedora 15 on my laptop using btrfs. I have a single Windows XP virtual kvm guest. Prior to the load, I had been using a raw LVM volume for the disk image, but I converted that to a plain file on my btrfs volume.

I read Josef Bacik's advice here, about setting "Cache" to "None" in virt-manager:

http://lists.fedoraproject.org/pipermail/devel/2011-March/149271.html

The first problem I encountered was that setting "Cache" to "None" causes the Windows XP virtio drivers to BSOD:

https://bugzilla.redhat.com/show_bug.cgi?id=607555

The second problem I encountered was that the I/O performance of the virtual guest (when using a disk image stored on btrfs) was abysmal to the point of being unusable.

By using an external USB drive, I was tested three different VM backing stores: a plain file stored on a btrfs filesystem, a plain file stored on an ext4 filesystem, and a raw LVM volume.

Here are some rough performance results:

LVM ext4 btrfs
-------------------------------------------------
boot 30s 30s 2 minutes
login 20s 20s 5 minutes
open Start menu <1s <1s 40s
shutdown 40s 40s 4 minutes

While I was waiting for the actions above to execute, btrfs was thrashing the disk the entire time, to the point that disk I/O was painfully slow for the host as well. For the ext4 versus btrfs test, the VM configuration was identical; the only difference was where the disk image file was located.

I assert that something is horribly wrong with btrfs.

I need the ability to create and run usable libvirt guests on my laptop, and I can't carry an external USB drive around with me just to use for guest backing store. However, as a courtesy, I'll keep btrfs around if someone wants to investigate this problem.

It may or may not be relevant, but I have multiple subvolumes in my btrfs filesystem, and the disk image file is in one of the subvolumes. I have tested compress=zlib and compress=lzo mounts for this filesystem, but currently, I'm not enabling compression at mount time.

Versions:

0:kernel-2.6.38-1.fc15.x86_64
2:qemu-common-0.14.0-4.fc15.x86_64
2:qemu-img-0.14.0-4.fc15.x86_64
2:qemu-kvm-0.14.0-4.fc15.x86_64
2:qemu-system-x86-0.14.0-4.fc15.x86_64

Comment 1 Josef Bacik 2011-03-21 13:46:25 UTC

Ok I don't quite understand.  You say that Cache=None will break Windows, so you don't use it, but when you don't use Cache=None the performance is abysmal?

Comment 2 James Ralston 2011-03-21 16:47:03 UTC

When I am using btrfs to store the guest's disk image:

- If I use Cache=None with the image configured as an IDE disk, performance is abysmal.

- If I use Cache=Default with the image configured as a virtio disk, performance is abysmal.

- If I use Cache=None with the image configured as a virtio disk, the Windows virtual guest crashes.

The scenario I want to test is how the performance of (btrfs backing store, Cache=None, virtio disk) contrasts with (LVM raw volume, Cache=Default, virtio disk). But I can't, due to the incompatibility issue with Cache=None and virtio disks.

The reason why I am asserting that this is a problem with btrfs is because the performance when using ext4 as a backing store is similar to the performance of using a raw LVM volume, regardless of what the Cache setting is. So, one could argue that this is a performance regression of btrfs contrasted with ext4.

Comment 3 Josef Bacik 2011-03-21 17:22:11 UTC

Ok can you reproduce this with a linux guest?  I need to have actual numbers so I have an idea of how crappy we are.  I'm going to setup a tracker bug for btrfs issues that need to be addressed by F16 and I'll throw this one on there so I can be sure it gets addressed.  Thanks.

Comment 4 Josef Bacik 2011-03-22 15:57:19 UTC

Actually I just realized you were using compression, which forces us to bypass all of the DIO stuff and fall back on buffered IO, which will be ridiculously slow.  So can you delete the image and mount your fs without compression on and re-create the image and try again and see if you still get crappy performance?

Comment 5 James Ralston 2011-03-24 19:58:12 UTC

Actually, re-creating the image was one of the first things I did after I remounted without compression, because I suspected that the compression feature would have fragmented the image.

I used fallocate(1) to create a new image, and dd to copy the old image to it:

$ grep compress /proc/mounts || echo compression disabled
compression disabled
$ fallocate -n -l 8589934592 new.img
$ dd if=current.img conv=notrunc bs=16M of=new.img
$ mv new.img current.img

The performance is still hideous, though.

I'll attempt to reproduce the poor performance with a Linux guest.

Comment 6 Josef Bacik 2011-04-05 03:07:26 UTC

Another thing to try would be to mount your btrfs fs as nodatacow and see if it's just cow thats biting you.

Comment 7 Josef Bacik 2011-04-05 18:38:18 UTC

Ok nevermind I finally got some time to just do a basic test and for small io's (which qemu tends to send down), we _really_ _really_ suck compared to ext4.  Doing a 40mb 4k at a time dd on btrfs I get like 6 mb/s, on ext4 its 16 mb/s.  So there is definitely room for improvement, and I will now spend some time getting this to suck less.

Comment 8 James Ralston 2011-06-01 20:57:08 UTC

Good luck with that. Seriously.

In terms of VM backing store performance, ext4 sets the bar very high: ext4 is almost as fast as using LVM block devices directly. Considering the popularity of virtualization, btrfs needs to really close that gap if it's going to supersede LVM and/or ext4.

Comment 9 Josef Bacik 2011-06-02 17:15:47 UTC

We are getting closer but aren't completely there yet.  The problem is we do _a lot_ of memory allocations for keeping track of various things, and although I've gotten us close to ext4, we aren't there yet.

Comment 10 Chuck Ebbert 2011-06-15 21:51:34 UTC

In the meantime we've already decided to make btrfs the default for F16? that can't be good...

Comment 11 James Ralston 2011-06-15 22:31:15 UTC

btrfs—as it exists in F15, at least—is *not* ready to become the default filesystem.

Josef, I'd be willing to test custom kernels with more recent btrfs code, so long as nothing breaks on-disk compataiblity with the btrfs snapshot in the F15 kernels.

Comment 12 Josef Bacik 2011-06-16 15:32:07 UTC

-rc3 has all of my performance fixes up to this point for this particular problem.  Can you try it out and let me know how I did?  In my test we're still lagging behind ext4, but these aren't real world tests so we may be better or worse.

Comment 13 Frank Murphy 2011-07-14 06:38:54 UTC

(In reply to comment #12)
<snip>
 but these aren't real world tests so we may be better or
> worse.

What do you mean by real world tests?

Comment 14 Josef Bacik 2011-07-14 13:36:32 UTC

Actually running a vm on btrfs.

Comment 15 David Mansfield 2011-08-01 21:08:21 UTC

just came across this as I'm experiencing the same issues, in a completely-up-to-date F15 system.

i'm going to try the ext4 workaround to verify that btrfs is the culprit.

Comment 16 David Mansfield 2011-08-02 13:09:07 UTC

i can add: 

on btrfs, huge gobs of i/o writes are being generated against the backing device when using a qemu raw file disk image stored on btrfs. 

iostat and vmstat show about 40-80MB/s

creating an ext4 lv on the same disk, moving the files, back to normal.

Comment 17 Josh Boyer 2012-06-04 17:51:27 UTC

Is this still happening with the 2.6.43/3.3 F15/F16 updates?

Josef, should this be moved to rawhide until it's finally resolved?

Comment 18 Josef Bacik 2012-06-04 17:54:46 UTC

Yeah, still slowly working on it.

Comment 19 John Dulaney 2012-06-05 03:05:22 UTC

I've noticed some improvement, but not where it ought to be.

Comment 20 Fedora End Of Life 2013-04-03 19:48:04 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 21 Justin M. Forbes 2013-04-05 15:55:36 UTC

Is this still a problem with 3.9 based F19 kernels?

Comment 22 Eric Hopper 2013-04-05 16:18:32 UTC

AFAIK, it is. But there are ways to deal with this. You can create an empty file with the same permissions and security context as the original (use chmod --reference, chown --reference and chcon --referece to accomplish this), then chattr +C on the file. Then 'cat original >empty' to copy the contents. Then remove the original and 'mv empty original'.

This will give you a VM image that btrfs will not apply copy-on-write to the data of. This will largely solve the performance issue.

It also works on the other major source of this problem, MySQL database files and similar kinds of entities where the file is large, there is significant internal structure, and the software using the file has its own transaction mechanism.

Comment 23 Michel Lind 2013-04-06 01:28:05 UTC

Or chattr +C the directory in which database tables and VM images will be stored:
https://wiki.archlinux.org/index.php/Btrfs#Copy-On-Write_.28CoW.29

Comment 24 James Ralston 2013-04-06 05:44:45 UTC

Since I reported this bug, I've just been using LVM logical volumes as raw block devices for VM disk images, which works well. (And I can't get rid of LVM entirely anyway, as the blasted installer STILL can't install to a preexisting btrfs filesystem; see bug 921757.)

But when F19 goes alpha, I'll give this a whirl again, using the "chattr +C" trick on the parent directory.

To follow-up to Eric's and Michel's suggestions, as a PSA, if you're going to [re]create any largish file, you should use fallocate(1) instead of cat. E.g.:

$ fallocate -n -l $(find OLDFILE -printf '%s') NEWFILE
$ dd if=OLDFILE conv=notrunc,append of=NEWFILE && rm OLDFILE

Creating a file with fallocate() lets the filesystem perform the allocation, but also lets the filesystem know in advance precisely how big the file is going to be, so it can optimize the allocation accordingly.

Especially when creating new (blank) disk images for use in a VM, using fallocate is blazingly fast. For example, here's how to create a 16 GiB VM image in a fraction of a second:

$ time fallocate -l $[16*1024*1024*1024] vm.img
real    0m0.023s
user    0m0.001s
sys     0m0.022s

$ ls -s vm.img 
16777216 vm.img

Comment 25 Eric Hopper 2013-04-06 08:54:49 UTC

(In reply to comment #24)
> $ fallocate -n -l $(find OLDFILE -printf '%s') NEWFILE
> $ dd if=OLDFILE conv=notrunc,append of=NEWFILE && rm OLDFILE

You are completely correct. But in the interests of coming up with solutions that are ever so slightly more optimal:

$ fallocate -n -l $(stat -c '%s' OLDFILE) NEWFILE

Comment 26 James Ralston 2013-04-10 00:20:23 UTC

Re: comment 25: yes, that's simpler.

Someone sanity-check me, please: enabling compression for VM image files will result in poor performance (per Josef in comment 4). But a file that was created with no COW will also be exempt from compression, right?  So in terms of avoiding things that kill performance for VM images, creating the image with the "C" attribute avoids both COW and compression, right?

Comment 27 Justin M. Forbes 2013-04-23 17:24:59 UTC

This bug is being closed with INSUFFICIENT_DATA as there has not been a
response in 2 weeks.  If you are still experiencing this issue,
please reopen and attach the relevant data from the latest kernel you are
running and any data that might have been requested previously.

Comment 28 James Ralston 2013-04-23 17:35:45 UTC

I re-tested VM performance with a VM image created via fallocate(1) in a btrfs directory that first had the "C" attribute set on it. The performance was actually very close to LVM performance.

I think for now, there's no reason to leave this bug open. Josef et. al. are aware of the remaining performance issues and working to address them moving forward.

The "chattr +C" work-around needs to be communicated far and wide, though.

Comment 29 jakobunt 2013-04-23 21:44:45 UTC

This breaks snapshots, right?

Comment 30 James Ralston 2013-04-23 23:14:30 UTC

In response to comment 29:

Based on my read of the btrfs wiki, and my own experience, snapshots override the +C/nocow option.

You'll still take a performance hit whenever you modify either the source or the snapshot and trigger the COW. But once you take that initial hit, btrfs will respect the +C attribute and not continue to perform COW for the blocks that have already been duplicated.

At least that's my understanding. Josef?

If you want to avoid the COW performance hit for VM images in snapshots, you'll need to create a new copy of the VM image file, so that it won't share any data blocks with any other snapshots. (See comment 25.)

Alternatively, if you don't need multiple copies of the VM image file, just simply delete the file in all but 1 of the snapshots.