Bug 689127
Summary: | abysmal performance using btrfs for VM storage | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | James Ralston <ralston> |
Component: | kernel | Assignee: | Zach Brown <zab> |
Status: | CLOSED DEFERRED | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 19 | CC: | agk, amit.shah, blair, bugzilla, cyberwizzard+redhat, eric-bugs2, erik-fedora, fche, fedoraproject, gansalmon, gholms, gmazyland, itamar, jakobunt, jdulaney, jeremy, jforbes, jonathan, kagesenshi.87, kchamart, kernel-maint, k.georgiou, kxra, lightdot, luke, madhu.chinakonda, me, michel, ms, naoki, oliver.henshaw, pbrobinson, pmrpla, rjones, sjensen, sweil, sysoutfran, tez, tmraz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-04-23 17:24:59 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 689509 |
Description
James Ralston
2011-03-19 17:42:18 UTC
Ok I don't quite understand. You say that Cache=None will break Windows, so you don't use it, but when you don't use Cache=None the performance is abysmal? When I am using btrfs to store the guest's disk image: - If I use Cache=None with the image configured as an IDE disk, performance is abysmal. - If I use Cache=Default with the image configured as a virtio disk, performance is abysmal. - If I use Cache=None with the image configured as a virtio disk, the Windows virtual guest crashes. The scenario I want to test is how the performance of (btrfs backing store, Cache=None, virtio disk) contrasts with (LVM raw volume, Cache=Default, virtio disk). But I can't, due to the incompatibility issue with Cache=None and virtio disks. The reason why I am asserting that this is a problem with btrfs is because the performance when using ext4 as a backing store is similar to the performance of using a raw LVM volume, regardless of what the Cache setting is. So, one could argue that this is a performance regression of btrfs contrasted with ext4. Ok can you reproduce this with a linux guest? I need to have actual numbers so I have an idea of how crappy we are. I'm going to setup a tracker bug for btrfs issues that need to be addressed by F16 and I'll throw this one on there so I can be sure it gets addressed. Thanks. Actually I just realized you were using compression, which forces us to bypass all of the DIO stuff and fall back on buffered IO, which will be ridiculously slow. So can you delete the image and mount your fs without compression on and re-create the image and try again and see if you still get crappy performance? Actually, re-creating the image was one of the first things I did after I remounted without compression, because I suspected that the compression feature would have fragmented the image. I used fallocate(1) to create a new image, and dd to copy the old image to it: $ grep compress /proc/mounts || echo compression disabled compression disabled $ fallocate -n -l 8589934592 new.img $ dd if=current.img conv=notrunc bs=16M of=new.img $ mv new.img current.img The performance is still hideous, though. I'll attempt to reproduce the poor performance with a Linux guest. Another thing to try would be to mount your btrfs fs as nodatacow and see if it's just cow thats biting you. Ok nevermind I finally got some time to just do a basic test and for small io's (which qemu tends to send down), we _really_ _really_ suck compared to ext4. Doing a 40mb 4k at a time dd on btrfs I get like 6 mb/s, on ext4 its 16 mb/s. So there is definitely room for improvement, and I will now spend some time getting this to suck less. Good luck with that. Seriously. In terms of VM backing store performance, ext4 sets the bar very high: ext4 is almost as fast as using LVM block devices directly. Considering the popularity of virtualization, btrfs needs to really close that gap if it's going to supersede LVM and/or ext4. We are getting closer but aren't completely there yet. The problem is we do _a lot_ of memory allocations for keeping track of various things, and although I've gotten us close to ext4, we aren't there yet. In the meantime we've already decided to make btrfs the default for F16? that can't be good... btrfs—as it exists in F15, at least—is *not* ready to become the default filesystem. Josef, I'd be willing to test custom kernels with more recent btrfs code, so long as nothing breaks on-disk compataiblity with the btrfs snapshot in the F15 kernels. -rc3 has all of my performance fixes up to this point for this particular problem. Can you try it out and let me know how I did? In my test we're still lagging behind ext4, but these aren't real world tests so we may be better or worse. (In reply to comment #12) <snip> but these aren't real world tests so we may be better or > worse. What do you mean by real world tests? Actually running a vm on btrfs. just came across this as I'm experiencing the same issues, in a completely-up-to-date F15 system. i'm going to try the ext4 workaround to verify that btrfs is the culprit. i can add: on btrfs, huge gobs of i/o writes are being generated against the backing device when using a qemu raw file disk image stored on btrfs. iostat and vmstat show about 40-80MB/s creating an ext4 lv on the same disk, moving the files, back to normal. Is this still happening with the 2.6.43/3.3 F15/F16 updates? Josef, should this be moved to rawhide until it's finally resolved? Yeah, still slowly working on it. I've noticed some improvement, but not where it ought to be. This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19 Is this still a problem with 3.9 based F19 kernels? AFAIK, it is. But there are ways to deal with this. You can create an empty file with the same permissions and security context as the original (use chmod --reference, chown --reference and chcon --referece to accomplish this), then chattr +C on the file. Then 'cat original >empty' to copy the contents. Then remove the original and 'mv empty original'. This will give you a VM image that btrfs will not apply copy-on-write to the data of. This will largely solve the performance issue. It also works on the other major source of this problem, MySQL database files and similar kinds of entities where the file is large, there is significant internal structure, and the software using the file has its own transaction mechanism. Or chattr +C the directory in which database tables and VM images will be stored: https://wiki.archlinux.org/index.php/Btrfs#Copy-On-Write_.28CoW.29 Since I reported this bug, I've just been using LVM logical volumes as raw block devices for VM disk images, which works well. (And I can't get rid of LVM entirely anyway, as the blasted installer STILL can't install to a preexisting btrfs filesystem; see bug 921757.) But when F19 goes alpha, I'll give this a whirl again, using the "chattr +C" trick on the parent directory. To follow-up to Eric's and Michel's suggestions, as a PSA, if you're going to [re]create any largish file, you should use fallocate(1) instead of cat. E.g.: $ fallocate -n -l $(find OLDFILE -printf '%s') NEWFILE $ dd if=OLDFILE conv=notrunc,append of=NEWFILE && rm OLDFILE Creating a file with fallocate() lets the filesystem perform the allocation, but also lets the filesystem know in advance precisely how big the file is going to be, so it can optimize the allocation accordingly. Especially when creating new (blank) disk images for use in a VM, using fallocate is blazingly fast. For example, here's how to create a 16 GiB VM image in a fraction of a second: $ time fallocate -l $[16*1024*1024*1024] vm.img real 0m0.023s user 0m0.001s sys 0m0.022s $ ls -s vm.img 16777216 vm.img (In reply to comment #24) > $ fallocate -n -l $(find OLDFILE -printf '%s') NEWFILE > $ dd if=OLDFILE conv=notrunc,append of=NEWFILE && rm OLDFILE You are completely correct. But in the interests of coming up with solutions that are ever so slightly more optimal: $ fallocate -n -l $(stat -c '%s' OLDFILE) NEWFILE Re: comment 25: yes, that's simpler. Someone sanity-check me, please: enabling compression for VM image files will result in poor performance (per Josef in comment 4). But a file that was created with no COW will also be exempt from compression, right? So in terms of avoiding things that kill performance for VM images, creating the image with the "C" attribute avoids both COW and compression, right? This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously. I re-tested VM performance with a VM image created via fallocate(1) in a btrfs directory that first had the "C" attribute set on it. The performance was actually very close to LVM performance. I think for now, there's no reason to leave this bug open. Josef et. al. are aware of the remaining performance issues and working to address them moving forward. The "chattr +C" work-around needs to be communicated far and wide, though. This breaks snapshots, right? In response to comment 29: Based on my read of the btrfs wiki, and my own experience, snapshots override the +C/nocow option. You'll still take a performance hit whenever you modify either the source or the snapshot and trigger the COW. But once you take that initial hit, btrfs will respect the +C attribute and not continue to perform COW for the blocks that have already been duplicated. At least that's my understanding. Josef? If you want to avoid the COW performance hit for VM images in snapshots, you'll need to create a new copy of the VM image file, so that it won't share any data blocks with any other snapshots. (See comment 25.) Alternatively, if you don't need multiple copies of the VM image file, just simply delete the file in all but 1 of the snapshots. |