Bug 693530

Summary: Qemu does the wrong thing with Cache=None and looks like corruption
Product: [Fedora] Fedora Reporter: Josef Bacik <jbacik>
Component: qemuAssignee: Fedora Virtualization Maintainers <virt-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 17CC: amit.shah, berrange, bstein, crobinso, dwmw2, ehabkost, eparis, itamar, jaswinder, jforbes, knoel, kwolf, ondrejj, scottt.tw, tburke, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-30 10:44:55 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Josef Bacik 2011-04-04 17:42:50 EDT
Description of problem:
Btrfs does checksumming so if it reads back data and the checksum doesn't look right it will complain and return EIO.  I spent some time debugging a problem with installing a Windows 7 box on Btrfs via kvm and using Cache=None, and it looked like Btrfs was corrupting the file.  Lot's of printk's later I got this out of direct io

 submiting dio 00007f6b94dbc000, offset=3281743872, len=0, write=0
 submiting dio 00007f6b88040000, offset=3281747968, len=4096, write=0
 submiting dio 00007f6bea43d000, offset=3281752064, len=4096, write=0
 submiting dio 00007f6b88040000, offset=3281756160, len=4096, write=0

this is me doing a printk of the iovec that userspace (qemu) provided to Btrfs to write out, so it's this

for (seg = 0; seg < nr_segs; seg++) {
        if (debug)
                printk(KERN_ERR "submiting dio %p, offset=%llu, len=%llu, write=%d\n",
                       iov[seg].iov_base, end, size, (rw & WRITE));

so qemu is providing the same memory address for 2 different offsets, which will obviously give you unpredictable results, and it screws up Btrfs because we think that the storage is corrupt.  So please stop doing this :).
Comment 1 Eric Paris 2011-04-05 08:53:48 EDT
Comment 2 Kevin Wolf 2011-04-05 09:37:43 EDT
Windows seems to submit this kind of iovs that use a single buffer twice. IDE guarantees that the buffer contains the data from latest offset, so even though it looks a bit odd, it's well defined. However, qemu doesn't pay attention to this and just forwards the iov to preadv and trusts preadv to do the right thing.

Is there a specified behaviour for such cases with preadv, or is it just completely undefined?
Comment 3 Josef Bacik 2011-04-05 10:38:10 EDT
So if this was buffered IO that would work out just fine, but unfortunately with DIO we are using the iov directly to store the end result in, and because btrfs does checksumming we're using that to check to make sure the checksum came out correctly.  So I guess there are 2 options here

1) Make qemu recognize that the guest is using the same iov_base in the same iovec and split it up into two different calls.

2) Make btrfs do that.

I'd really rather "fix" it in qemu, I would hate to have this check in btrfs and subject every dio operation to this kind of scrutiny, however we are the ones that don't really handle this kind of thing well.  What do you guys think?
Comment 4 Kevin Wolf 2011-04-05 10:53:33 EDT
I think the question we really need to answer is which semantics preadv should have. This answer will automatically tell us which side must be fixed.

In any case I think you need to fix btrfs if it falsely detects corruption. But it might be a reasonable answer to say that it's undefined which content the buffer actually has in the end.
Comment 5 Josef Bacik 2011-04-05 11:10:46 EDT
Well the only thing we can do is just fall back to buffered IO if we detect something like that, which is going to suck for performance, but it's better than getting EIO.  Readv will do the right thing if its buffered IO, but if it's DIO it will depend on what the underlying storage does, and so you are at the whim of sata/scsi/whatever.
Comment 6 Avi Kivity 2011-04-06 06:22:54 EDT
An alternative is to have btrfs (or the kernel in general) issue the I/O up to the first reused buffer, do the checks, then issue the next I/O.  That solves both the ordering and the checksum problems.
Comment 7 Fedora Admin XMLRPC Client 2012-03-15 13:52:34 EDT
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 8 Cole Robinson 2012-05-28 20:17:12 EDT
F15 is end of life in a month, so I'm wondering if this is still relevant for F16 or F17. Josef, Kevin, and/or Avi, any idea if things have changed here in qemu or kernel land?
Comment 9 Kevin Wolf 2012-05-29 04:31:05 EDT
Not in qemu, at least.
Comment 10 Cole Robinson 2012-05-29 18:39:21 EDT
Josef, has anything changed in kernel land that affects this?
Comment 11 Josef Bacik 2012-05-30 10:44:55 EDT
Yeah I fixed this a while ago.