Red Hat Bugzilla – Bug 689127
abysmal performance using btrfs for VM storage
Last modified: 2015-05-17 21:40:00 EDT
I loaded Fedora 15 on my laptop using btrfs. I have a single Windows XP virtual kvm guest. Prior to the load, I had been using a raw LVM volume for the disk image, but I converted that to a plain file on my btrfs volume.
I read Josef Bacik's advice here, about setting "Cache" to "None" in virt-manager:
The first problem I encountered was that setting "Cache" to "None" causes the Windows XP virtio drivers to BSOD:
The second problem I encountered was that the I/O performance of the virtual guest (when using a disk image stored on btrfs) was abysmal to the point of being unusable.
By using an external USB drive, I was tested three different VM backing stores: a plain file stored on a btrfs filesystem, a plain file stored on an ext4 filesystem, and a raw LVM volume.
Here are some rough performance results:
LVM ext4 btrfs
boot 30s 30s 2 minutes
login 20s 20s 5 minutes
open Start menu <1s <1s 40s
shutdown 40s 40s 4 minutes
While I was waiting for the actions above to execute, btrfs was thrashing the disk the entire time, to the point that disk I/O was painfully slow for the host as well. For the ext4 versus btrfs test, the VM configuration was identical; the only difference was where the disk image file was located.
I assert that something is horribly wrong with btrfs.
I need the ability to create and run usable libvirt guests on my laptop, and I can't carry an external USB drive around with me just to use for guest backing store. However, as a courtesy, I'll keep btrfs around if someone wants to investigate this problem.
It may or may not be relevant, but I have multiple subvolumes in my btrfs filesystem, and the disk image file is in one of the subvolumes. I have tested compress=zlib and compress=lzo mounts for this filesystem, but currently, I'm not enabling compression at mount time.
Ok I don't quite understand. You say that Cache=None will break Windows, so you don't use it, but when you don't use Cache=None the performance is abysmal?
When I am using btrfs to store the guest's disk image:
- If I use Cache=None with the image configured as an IDE disk, performance is abysmal.
- If I use Cache=Default with the image configured as a virtio disk, performance is abysmal.
- If I use Cache=None with the image configured as a virtio disk, the Windows virtual guest crashes.
The scenario I want to test is how the performance of (btrfs backing store, Cache=None, virtio disk) contrasts with (LVM raw volume, Cache=Default, virtio disk). But I can't, due to the incompatibility issue with Cache=None and virtio disks.
The reason why I am asserting that this is a problem with btrfs is because the performance when using ext4 as a backing store is similar to the performance of using a raw LVM volume, regardless of what the Cache setting is. So, one could argue that this is a performance regression of btrfs contrasted with ext4.
Ok can you reproduce this with a linux guest? I need to have actual numbers so I have an idea of how crappy we are. I'm going to setup a tracker bug for btrfs issues that need to be addressed by F16 and I'll throw this one on there so I can be sure it gets addressed. Thanks.
Actually I just realized you were using compression, which forces us to bypass all of the DIO stuff and fall back on buffered IO, which will be ridiculously slow. So can you delete the image and mount your fs without compression on and re-create the image and try again and see if you still get crappy performance?
Actually, re-creating the image was one of the first things I did after I remounted without compression, because I suspected that the compression feature would have fragmented the image.
I used fallocate(1) to create a new image, and dd to copy the old image to it:
$ grep compress /proc/mounts || echo compression disabled
$ fallocate -n -l 8589934592 new.img
$ dd if=current.img conv=notrunc bs=16M of=new.img
$ mv new.img current.img
The performance is still hideous, though.
I'll attempt to reproduce the poor performance with a Linux guest.
Another thing to try would be to mount your btrfs fs as nodatacow and see if it's just cow thats biting you.
Ok nevermind I finally got some time to just do a basic test and for small io's (which qemu tends to send down), we _really_ _really_ suck compared to ext4. Doing a 40mb 4k at a time dd on btrfs I get like 6 mb/s, on ext4 its 16 mb/s. So there is definitely room for improvement, and I will now spend some time getting this to suck less.
Good luck with that. Seriously.
In terms of VM backing store performance, ext4 sets the bar very high: ext4 is almost as fast as using LVM block devices directly. Considering the popularity of virtualization, btrfs needs to really close that gap if it's going to supersede LVM and/or ext4.
We are getting closer but aren't completely there yet. The problem is we do _a lot_ of memory allocations for keeping track of various things, and although I've gotten us close to ext4, we aren't there yet.
In the meantime we've already decided to make btrfs the default for F16? that can't be good...
btrfs—as it exists in F15, at least—is *not* ready to become the default filesystem.
Josef, I'd be willing to test custom kernels with more recent btrfs code, so long as nothing breaks on-disk compataiblity with the btrfs snapshot in the F15 kernels.
-rc3 has all of my performance fixes up to this point for this particular problem. Can you try it out and let me know how I did? In my test we're still lagging behind ext4, but these aren't real world tests so we may be better or worse.
(In reply to comment #12)
but these aren't real world tests so we may be better or
What do you mean by real world tests?
Actually running a vm on btrfs.
just came across this as I'm experiencing the same issues, in a completely-up-to-date F15 system.
i'm going to try the ext4 workaround to verify that btrfs is the culprit.
i can add:
on btrfs, huge gobs of i/o writes are being generated against the backing device when using a qemu raw file disk image stored on btrfs.
iostat and vmstat show about 40-80MB/s
creating an ext4 lv on the same disk, moving the files, back to normal.
Is this still happening with the 2.6.43/3.3 F15/F16 updates?
Josef, should this be moved to rawhide until it's finally resolved?
Yeah, still slowly working on it.
I've noticed some improvement, but not where it ought to be.
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.
(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)
More information and reason for this action is here:
Is this still a problem with 3.9 based F19 kernels?
AFAIK, it is. But there are ways to deal with this. You can create an empty file with the same permissions and security context as the original (use chmod --reference, chown --reference and chcon --referece to accomplish this), then chattr +C on the file. Then 'cat original >empty' to copy the contents. Then remove the original and 'mv empty original'.
This will give you a VM image that btrfs will not apply copy-on-write to the data of. This will largely solve the performance issue.
It also works on the other major source of this problem, MySQL database files and similar kinds of entities where the file is large, there is significant internal structure, and the software using the file has its own transaction mechanism.
Or chattr +C the directory in which database tables and VM images will be stored:
Since I reported this bug, I've just been using LVM logical volumes as raw block devices for VM disk images, which works well. (And I can't get rid of LVM entirely anyway, as the blasted installer STILL can't install to a preexisting btrfs filesystem; see bug 921757.)
But when F19 goes alpha, I'll give this a whirl again, using the "chattr +C" trick on the parent directory.
To follow-up to Eric's and Michel's suggestions, as a PSA, if you're going to [re]create any largish file, you should use fallocate(1) instead of cat. E.g.:
$ fallocate -n -l $(find OLDFILE -printf '%s') NEWFILE
$ dd if=OLDFILE conv=notrunc,append of=NEWFILE && rm OLDFILE
Creating a file with fallocate() lets the filesystem perform the allocation, but also lets the filesystem know in advance precisely how big the file is going to be, so it can optimize the allocation accordingly.
Especially when creating new (blank) disk images for use in a VM, using fallocate is blazingly fast. For example, here's how to create a 16 GiB VM image in a fraction of a second:
$ time fallocate -l $[16*1024*1024*1024] vm.img
$ ls -s vm.img
(In reply to comment #24)
> $ fallocate -n -l $(find OLDFILE -printf '%s') NEWFILE
> $ dd if=OLDFILE conv=notrunc,append of=NEWFILE && rm OLDFILE
You are completely correct. But in the interests of coming up with solutions that are ever so slightly more optimal:
$ fallocate -n -l $(stat -c '%s' OLDFILE) NEWFILE
Re: comment 25: yes, that's simpler.
Someone sanity-check me, please: enabling compression for VM image files will result in poor performance (per Josef in comment 4). But a file that was created with no COW will also be exempt from compression, right? So in terms of avoiding things that kill performance for VM images, creating the image with the "C" attribute avoids both COW and compression, right?
This bug is being closed with INSUFFICIENT_DATA as there has not been a
response in 2 weeks. If you are still experiencing this issue,
please reopen and attach the relevant data from the latest kernel you are
running and any data that might have been requested previously.
I re-tested VM performance with a VM image created via fallocate(1) in a btrfs directory that first had the "C" attribute set on it. The performance was actually very close to LVM performance.
I think for now, there's no reason to leave this bug open. Josef et. al. are aware of the remaining performance issues and working to address them moving forward.
The "chattr +C" work-around needs to be communicated far and wide, though.
This breaks snapshots, right?
In response to comment 29:
Based on my read of the btrfs wiki, and my own experience, snapshots override the +C/nocow option.
You'll still take a performance hit whenever you modify either the source or the snapshot and trigger the COW. But once you take that initial hit, btrfs will respect the +C attribute and not continue to perform COW for the blocks that have already been duplicated.
At least that's my understanding. Josef?
If you want to avoid the COW performance hit for VM images in snapshots, you'll need to create a new copy of the VM image file, so that it won't share any data blocks with any other snapshots. (See comment 25.)
Alternatively, if you don't need multiple copies of the VM image file, just simply delete the file in all but 1 of the snapshots.