693530 – Qemu does the wrong thing with Cache=None and looks like corruption

Bug 693530 - Qemu does the wrong thing with Cache=None and looks like corruption

Summary: Qemu does the wrong thing with Cache=None and looks like corruption

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	qemu
Sub Component:
Version:	17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Fedora Virtualization Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-04-04 21:42 UTC by Josef Bacik
Modified:	2013-01-09 23:44 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-05-30 14:44:55 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Josef Bacik 2011-04-04 21:42:50 UTC

Description of problem:
Btrfs does checksumming so if it reads back data and the checksum doesn't look right it will complain and return EIO.  I spent some time debugging a problem with installing a Windows 7 box on Btrfs via kvm and using Cache=None, and it looked like Btrfs was corrupting the file.  Lot's of printk's later I got this out of direct io

 submiting dio 00007f6b94dbc000, offset=3281743872, len=0, write=0
 submiting dio 00007f6b88040000, offset=3281747968, len=4096, write=0
 submiting dio 00007f6bea43d000, offset=3281752064, len=4096, write=0
 submiting dio 00007f6b88040000, offset=3281756160, len=4096, write=0

this is me doing a printk of the iovec that userspace (qemu) provided to Btrfs to write out, so it's this

for (seg = 0; seg < nr_segs; seg++) {
        if (debug)
                printk(KERN_ERR "submiting dio %p, offset=%llu, len=%llu, write=%d\n",
                       iov[seg].iov_base, end, size, (rw & WRITE));

so qemu is providing the same memory address for 2 different offsets, which will obviously give you unpredictable results, and it screws up Btrfs because we think that the storage is corrupt.  So please stop doing this :).

Comment 1 Eric Paris 2011-04-05 12:53:48 UTC

qemu-kvm-0.12.1.2-2.152.el6.x86_64

Comment 2 Kevin Wolf 2011-04-05 13:37:43 UTC

Windows seems to submit this kind of iovs that use a single buffer twice. IDE guarantees that the buffer contains the data from latest offset, so even though it looks a bit odd, it's well defined. However, qemu doesn't pay attention to this and just forwards the iov to preadv and trusts preadv to do the right thing.

Is there a specified behaviour for such cases with preadv, or is it just completely undefined?

Comment 3 Josef Bacik 2011-04-05 14:38:10 UTC

So if this was buffered IO that would work out just fine, but unfortunately with DIO we are using the iov directly to store the end result in, and because btrfs does checksumming we're using that to check to make sure the checksum came out correctly.  So I guess there are 2 options here

1) Make qemu recognize that the guest is using the same iov_base in the same iovec and split it up into two different calls.

2) Make btrfs do that.

I'd really rather "fix" it in qemu, I would hate to have this check in btrfs and subject every dio operation to this kind of scrutiny, however we are the ones that don't really handle this kind of thing well.  What do you guys think?

Comment 4 Kevin Wolf 2011-04-05 14:53:33 UTC

I think the question we really need to answer is which semantics preadv should have. This answer will automatically tell us which side must be fixed.

In any case I think you need to fix btrfs if it falsely detects corruption. But it might be a reasonable answer to say that it's undefined which content the buffer actually has in the end.

Comment 5 Josef Bacik 2011-04-05 15:10:46 UTC

Well the only thing we can do is just fall back to buffered IO if we detect something like that, which is going to suck for performance, but it's better than getting EIO.  Readv will do the right thing if its buffered IO, but if it's DIO it will depend on what the underlying storage does, and so you are at the whim of sata/scsi/whatever.

Comment 6 Avi Kivity 2011-04-06 10:22:54 UTC

An alternative is to have btrfs (or the kernel in general) issue the I/O up to the first reused buffer, do the checks, then issue the next I/O.  That solves both the ordering and the checksum problems.

Comment 7 Fedora Admin XMLRPC Client 2012-03-15 17:52:34 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 8 Cole Robinson 2012-05-29 00:17:12 UTC

F15 is end of life in a month, so I'm wondering if this is still relevant for F16 or F17. Josef, Kevin, and/or Avi, any idea if things have changed here in qemu or kernel land?

Comment 9 Kevin Wolf 2012-05-29 08:31:05 UTC

Not in qemu, at least.

Comment 10 Cole Robinson 2012-05-29 22:39:21 UTC

Josef, has anything changed in kernel land that affects this?

Comment 11 Josef Bacik 2012-05-30 14:44:55 UTC

Yeah I fixed this a while ago.

Note You need to log in before you can comment on or make changes to this bug.