Bug 217859

Summary: HVM device model 'qemu-dm' needs to handle ENOSPC sparse files
Product: Red Hat Enterprise Linux 5 Reporter: Daniel Berrangé <berrange>
Component: xenAssignee: Daniel Berrangé <berrange>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 5.0CC: xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-01-26 20:07:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 217765    
Bug Blocks:    
Attachments:
Description Flags
Propagate i/o errors back through IDE layer
none
Propagate QEMU I/O errors back to guest through IDE layer none

Description Daniel Berrangé 2006-11-30 14:04:47 UTC
Verified that exactly same problem occurs in fullyvirt guests with sparse file
backed disks.

+++ This bug was initially created as a clone of Bug #217765 +++

In addition to potentially providing non-sparse files initially, we need to fail
gracefully on sparse files by suspending writes to a blktap file on ENOSPC, as
otherwise the only result will be massive guest filesystem corruption.

+++ This bug was initially created as a clone of Bug #217764 +++

Description of problem:

When using sparse files to back guests, it is possible for dom0 to run out of
disk space without the guest OSes knowing.  This means the user could lose data
because dom0 ran out of space, even though the guest OS thinks it has plenty of
disk space.

...

-- Additional comment from sct on 2006-11-29 15:38 EST --
Needs to be done for qemu-dm too, for FV disk images.


-- Additional comment from berrange on 2006-11-29 15:46 EST --
tools/blktap/drivers/tapdisk.c  is the offending file in blktap.

The 'send_response' method is responsible for sending messages about I/O
completion back to the guest OS. When the request resulted in an error
condition, however, it merely does

        if (res != 0) {
                DPRINTF("*** request error %d! \n", res);
                return 0;
        }


So, the guest never sees an notification abou the error. The DPRINTF ends up in
syslog, but unless you've turned on *.debug  you'll never see it in any logs.


-- Additional comment from sct on 2006-11-29 15:57 EST --
As part of this, we urgently need a review of error handling inside blktap,
because ENOSPC _should_ be getting returned to the guest as an EIO, which, for
ext3, should result in the guest turning its filesystem readonly before
corruption occurs.

-- Additional comment from sct on 2006-11-29 16:01 EST --
<danpb> sct: yeah, i'll verify both  full & paravirt  cases

-- Additional comment from sct on 2006-11-29 16:25 EST --
High priority, it's a (potentially massive) data corrupter if we get it wrong.


-- Additional comment from berrange on 2006-11-29 18:22 EST --
Ok, this is severely broken. Here are the steps to reproduce problem:

1. On Dom0 host, prepare a filesystem of 200 MB, and create a 1 GB sparse file
to represent the guest filesystem:

  # lvcreate -n TestData -L 200M /dev/HostVG
  # mke2fs -m 0 /dev/HostVG/TestData 
  # mkdir /xen/TestData
  # mount /dev/HostVG/TestData /xen/TestData/
  # cd /xen/TestData/
  # dd if=/dev/zero of=data.img bs=1G count=0 seek=1
  # ls -lhs data.img
  1.0K -rw-r--r-- 1 root root 1.0G Nov 29 17:59 data.img

So we have 1 GB disk, consuming 1k, on a partition 200MB in size.

2. Now take an existing PV guest and setup data.img as a secondary disk, eg

  # grep disk /etc/xen/XenGuest1 
  disk = [ "phy:/dev/HostVG/XenGuest1,xvda,w",
"tap:aio:/xen/TestData/data.img,xvdc,w" ]


3. Boot the guest & ssh into it

4. Inside the guest create a single partition across whole /dev/xvdc disk, and
mount it

# fdisk /dev/xvdc 
# mount /dev/xvdc1 /mnt/
# cd /mnt/
# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvdc1           1004M  1.3M 1003M   1% /mnt

Note, the partition has 1 GB of space

5. Attempt to create a file of 500 MB on /mnt.

# cd /mnt/
# dd if=/dev/zero of=data.bin bs=500M count=1
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 54.4381 seconds, 9.6 MB/s
# ls -lh data.bin
-rw-r--r-- 1 root root 500M Nov 29 18:13 data.bin

Notice, that  the 'dd' succeeded to write 524 MB, even though the underlying
filesystem in host is a mere 200 MB in size. Clearly this should not have
succeeded, and yet no IO errors were seen by the guest

In the host, syslog shows

Nov 29 18:13:58 localhost TAPDISK: AIO did less than I asked it to.  
Nov 29 18:14:36 localhost last message repeated 67436 times



6. For added sanity check, try copying a file into the 'full' disk and then
comparing md5sum's of it from the guest's POV.

# cd /mnt
# cp /lib/libc.so.6 .
# umount /mnt/
# cd /
# mount /dev/xvdc1 /mnt/
# cd /mnt/
# md5sum libc.so.6 /lib/libc.so.6 
588347f8791221050f4abe95b4199a5e  libc.so.6
a16f66d50c2c2085f2d1393693901417  /lib/libc.so.6


So clearly even though the guest thinks the data was succesfully written, what
it later gets back is garbage.

Comment 1 Daniel Berrangé 2006-11-30 19:45:55 UTC
Created attachment 142510 [details]
Propagate i/o errors back through IDE layer

Ref:
http://post-office.corp.redhat.com/archives/virtualist/2006-November/msg00379.html


The core of the problem lies in the QEMU IDE codebase, tools/ioemu/hw/ide.c
In particular 4 methods ide_read_dma_cb, ide_sector_read, ide_write_dma_cb
and ide_sector_write. These methods call bdrv_read / bdrv_write which are
contracted to return 0 on success, -1 on failure.

Fixing the code to check the return status here is pretty simple. My question
is what are the best error conditions to return from the IDE protocol POV.
I'm attaching a proof-of-concept patch which returns ERR_STAT + ICRC_ERR
(aka BadSector) for read failures, and returns WRERR_STAT (aka DeviceFault)
for write failures.  

I am far from convinced I'm using the optimal status codes here though, since
I know next-to-nothing about the IDE protocol.

Originally I had write failures also returning BadSector, but the effect
of that was that the guest OS simply retried the request writing to the
next sector..which failed...so retried to next sector...so on for the 
entire disk. This clearly isn't too useful, so I switched to DeviceFault
for write failures. With this though, the guest OS sees the fault, resets
the IDE device, and tries again, forever,

Nov 30 11:59:58 dhcp-4-205 kernel: hdc: dma_intr: status=0x20 { DeviceFault }
Nov 30 11:59:58 dhcp-4-205 kernel: ide: failed opcode was: unknown
Nov 30 11:59:58 dhcp-4-205 kernel: hdc: DMA disabled
Nov 30 11:59:58 dhcp-4-205 kernel: ide1: reset: success
Nov 30 12:00:28 dhcp-4-205 kernel: hdc: lost interrupt
Nov 30 12:00:28 dhcp-4-205 kernel: hdc: task_out_intr: status=0x20 {
DeviceFault }
Nov 30 12:00:28 dhcp-4-205 kernel: ide: failed opcode was: unknown
Nov 30 12:00:28 dhcp-4-205 kernel: ide1: reset: success
Nov 30 12:00:58 dhcp-4-205 kernel: hdc: lost interrupt
Nov 30 12:00:58 dhcp-4-205 kernel: hdc: task_out_intr: status=0x20 {
DeviceFault }
Nov 30 12:00:58 dhcp-4-205 kernel: ide: failed opcode was: unknown
Nov 30 12:00:58 dhcp-4-205 kernel: ide1: reset: success

What I think we want is for the process doing the I/O in the guest to get
an -EIO from the write() call it is doing & the guest OS then re-mount the 
filesystem read only.

Comment 2 Jay Turner 2006-12-01 14:28:07 UTC
QE ack for RHEL5.

Comment 3 Stephen Tweedie 2006-12-01 14:46:08 UTC
Two separate issues:

One-off media errors need to get propagated upwards.

But catastrophic failures of the backing store, as ENOSPC implies, may well
require more serious handling, including potentially terminating the guest with
prejudice.


Comment 4 Daniel Berrangé 2006-12-04 17:56:01 UTC
Created attachment 142755 [details]
Propagate QEMU I/O errors back to guest through IDE layer

This is the updated patch merged in xen-unstable.hg to propagate QEMU I/O
errors back to the guest through the IDE layer.

Comment 5 Rik van Riel 2006-12-05 20:00:58 UTC
in xen-3.0.3-11.el5

Comment 6 Jay Turner 2007-01-26 20:07:21 UTC
xen-3.0.3-22.el5 included in 20070125.0.