Bug 693530 - Qemu does the wrong thing with Cache=None and looks like corruption
Qemu does the wrong thing with Cache=None and looks like corruption
Product: Fedora
Classification: Fedora
Component: qemu (Show other bugs)
Unspecified Unspecified
unspecified Severity urgent
: ---
: ---
Assigned To: Fedora Virtualization Maintainers
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2011-04-04 17:42 EDT by Josef Bacik
Modified: 2013-01-09 18:44 EST (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2012-05-30 10:44:55 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Josef Bacik 2011-04-04 17:42:50 EDT
Description of problem:
Btrfs does checksumming so if it reads back data and the checksum doesn't look right it will complain and return EIO.  I spent some time debugging a problem with installing a Windows 7 box on Btrfs via kvm and using Cache=None, and it looked like Btrfs was corrupting the file.  Lot's of printk's later I got this out of direct io

 submiting dio 00007f6b94dbc000, offset=3281743872, len=0, write=0
 submiting dio 00007f6b88040000, offset=3281747968, len=4096, write=0
 submiting dio 00007f6bea43d000, offset=3281752064, len=4096, write=0
 submiting dio 00007f6b88040000, offset=3281756160, len=4096, write=0

this is me doing a printk of the iovec that userspace (qemu) provided to Btrfs to write out, so it's this

for (seg = 0; seg < nr_segs; seg++) {
        if (debug)
                printk(KERN_ERR "submiting dio %p, offset=%llu, len=%llu, write=%d\n",
                       iov[seg].iov_base, end, size, (rw & WRITE));

so qemu is providing the same memory address for 2 different offsets, which will obviously give you unpredictable results, and it screws up Btrfs because we think that the storage is corrupt.  So please stop doing this :).
Comment 1 Eric Paris 2011-04-05 08:53:48 EDT
Comment 2 Kevin Wolf 2011-04-05 09:37:43 EDT
Windows seems to submit this kind of iovs that use a single buffer twice. IDE guarantees that the buffer contains the data from latest offset, so even though it looks a bit odd, it's well defined. However, qemu doesn't pay attention to this and just forwards the iov to preadv and trusts preadv to do the right thing.

Is there a specified behaviour for such cases with preadv, or is it just completely undefined?
Comment 3 Josef Bacik 2011-04-05 10:38:10 EDT
So if this was buffered IO that would work out just fine, but unfortunately with DIO we are using the iov directly to store the end result in, and because btrfs does checksumming we're using that to check to make sure the checksum came out correctly.  So I guess there are 2 options here

1) Make qemu recognize that the guest is using the same iov_base in the same iovec and split it up into two different calls.

2) Make btrfs do that.

I'd really rather "fix" it in qemu, I would hate to have this check in btrfs and subject every dio operation to this kind of scrutiny, however we are the ones that don't really handle this kind of thing well.  What do you guys think?
Comment 4 Kevin Wolf 2011-04-05 10:53:33 EDT
I think the question we really need to answer is which semantics preadv should have. This answer will automatically tell us which side must be fixed.

In any case I think you need to fix btrfs if it falsely detects corruption. But it might be a reasonable answer to say that it's undefined which content the buffer actually has in the end.
Comment 5 Josef Bacik 2011-04-05 11:10:46 EDT
Well the only thing we can do is just fall back to buffered IO if we detect something like that, which is going to suck for performance, but it's better than getting EIO.  Readv will do the right thing if its buffered IO, but if it's DIO it will depend on what the underlying storage does, and so you are at the whim of sata/scsi/whatever.
Comment 6 Avi Kivity 2011-04-06 06:22:54 EDT
An alternative is to have btrfs (or the kernel in general) issue the I/O up to the first reused buffer, do the checks, then issue the next I/O.  That solves both the ordering and the checksum problems.
Comment 7 Fedora Admin XMLRPC Client 2012-03-15 13:52:34 EDT
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 8 Cole Robinson 2012-05-28 20:17:12 EDT
F15 is end of life in a month, so I'm wondering if this is still relevant for F16 or F17. Josef, Kevin, and/or Avi, any idea if things have changed here in qemu or kernel land?
Comment 9 Kevin Wolf 2012-05-29 04:31:05 EDT
Not in qemu, at least.
Comment 10 Cole Robinson 2012-05-29 18:39:21 EDT
Josef, has anything changed in kernel land that affects this?
Comment 11 Josef Bacik 2012-05-30 10:44:55 EDT
Yeah I fixed this a while ago.

Note You need to log in before you can comment on or make changes to this bug.