Verified that exactly same problem occurs in fullyvirt guests with sparse file backed disks. +++ This bug was initially created as a clone of Bug #217765 +++ In addition to potentially providing non-sparse files initially, we need to fail gracefully on sparse files by suspending writes to a blktap file on ENOSPC, as otherwise the only result will be massive guest filesystem corruption. +++ This bug was initially created as a clone of Bug #217764 +++ Description of problem: When using sparse files to back guests, it is possible for dom0 to run out of disk space without the guest OSes knowing. This means the user could lose data because dom0 ran out of space, even though the guest OS thinks it has plenty of disk space. ... -- Additional comment from sct on 2006-11-29 15:38 EST -- Needs to be done for qemu-dm too, for FV disk images. -- Additional comment from berrange on 2006-11-29 15:46 EST -- tools/blktap/drivers/tapdisk.c is the offending file in blktap. The 'send_response' method is responsible for sending messages about I/O completion back to the guest OS. When the request resulted in an error condition, however, it merely does if (res != 0) { DPRINTF("*** request error %d! \n", res); return 0; } So, the guest never sees an notification abou the error. The DPRINTF ends up in syslog, but unless you've turned on *.debug you'll never see it in any logs. -- Additional comment from sct on 2006-11-29 15:57 EST -- As part of this, we urgently need a review of error handling inside blktap, because ENOSPC _should_ be getting returned to the guest as an EIO, which, for ext3, should result in the guest turning its filesystem readonly before corruption occurs. -- Additional comment from sct on 2006-11-29 16:01 EST -- <danpb> sct: yeah, i'll verify both full & paravirt cases -- Additional comment from sct on 2006-11-29 16:25 EST -- High priority, it's a (potentially massive) data corrupter if we get it wrong. -- Additional comment from berrange on 2006-11-29 18:22 EST -- Ok, this is severely broken. Here are the steps to reproduce problem: 1. On Dom0 host, prepare a filesystem of 200 MB, and create a 1 GB sparse file to represent the guest filesystem: # lvcreate -n TestData -L 200M /dev/HostVG # mke2fs -m 0 /dev/HostVG/TestData # mkdir /xen/TestData # mount /dev/HostVG/TestData /xen/TestData/ # cd /xen/TestData/ # dd if=/dev/zero of=data.img bs=1G count=0 seek=1 # ls -lhs data.img 1.0K -rw-r--r-- 1 root root 1.0G Nov 29 17:59 data.img So we have 1 GB disk, consuming 1k, on a partition 200MB in size. 2. Now take an existing PV guest and setup data.img as a secondary disk, eg # grep disk /etc/xen/XenGuest1 disk = [ "phy:/dev/HostVG/XenGuest1,xvda,w", "tap:aio:/xen/TestData/data.img,xvdc,w" ] 3. Boot the guest & ssh into it 4. Inside the guest create a single partition across whole /dev/xvdc disk, and mount it # fdisk /dev/xvdc # mount /dev/xvdc1 /mnt/ # cd /mnt/ # df -h . Filesystem Size Used Avail Use% Mounted on /dev/xvdc1 1004M 1.3M 1003M 1% /mnt Note, the partition has 1 GB of space 5. Attempt to create a file of 500 MB on /mnt. # cd /mnt/ # dd if=/dev/zero of=data.bin bs=500M count=1 1+0 records in 1+0 records out 524288000 bytes (524 MB) copied, 54.4381 seconds, 9.6 MB/s # ls -lh data.bin -rw-r--r-- 1 root root 500M Nov 29 18:13 data.bin Notice, that the 'dd' succeeded to write 524 MB, even though the underlying filesystem in host is a mere 200 MB in size. Clearly this should not have succeeded, and yet no IO errors were seen by the guest In the host, syslog shows Nov 29 18:13:58 localhost TAPDISK: AIO did less than I asked it to. Nov 29 18:14:36 localhost last message repeated 67436 times 6. For added sanity check, try copying a file into the 'full' disk and then comparing md5sum's of it from the guest's POV. # cd /mnt # cp /lib/libc.so.6 . # umount /mnt/ # cd / # mount /dev/xvdc1 /mnt/ # cd /mnt/ # md5sum libc.so.6 /lib/libc.so.6 588347f8791221050f4abe95b4199a5e libc.so.6 a16f66d50c2c2085f2d1393693901417 /lib/libc.so.6 So clearly even though the guest thinks the data was succesfully written, what it later gets back is garbage.
Created attachment 142510 [details] Propagate i/o errors back through IDE layer Ref: http://post-office.corp.redhat.com/archives/virtualist/2006-November/msg00379.html The core of the problem lies in the QEMU IDE codebase, tools/ioemu/hw/ide.c In particular 4 methods ide_read_dma_cb, ide_sector_read, ide_write_dma_cb and ide_sector_write. These methods call bdrv_read / bdrv_write which are contracted to return 0 on success, -1 on failure. Fixing the code to check the return status here is pretty simple. My question is what are the best error conditions to return from the IDE protocol POV. I'm attaching a proof-of-concept patch which returns ERR_STAT + ICRC_ERR (aka BadSector) for read failures, and returns WRERR_STAT (aka DeviceFault) for write failures. I am far from convinced I'm using the optimal status codes here though, since I know next-to-nothing about the IDE protocol. Originally I had write failures also returning BadSector, but the effect of that was that the guest OS simply retried the request writing to the next sector..which failed...so retried to next sector...so on for the entire disk. This clearly isn't too useful, so I switched to DeviceFault for write failures. With this though, the guest OS sees the fault, resets the IDE device, and tries again, forever, Nov 30 11:59:58 dhcp-4-205 kernel: hdc: dma_intr: status=0x20 { DeviceFault } Nov 30 11:59:58 dhcp-4-205 kernel: ide: failed opcode was: unknown Nov 30 11:59:58 dhcp-4-205 kernel: hdc: DMA disabled Nov 30 11:59:58 dhcp-4-205 kernel: ide1: reset: success Nov 30 12:00:28 dhcp-4-205 kernel: hdc: lost interrupt Nov 30 12:00:28 dhcp-4-205 kernel: hdc: task_out_intr: status=0x20 { DeviceFault } Nov 30 12:00:28 dhcp-4-205 kernel: ide: failed opcode was: unknown Nov 30 12:00:28 dhcp-4-205 kernel: ide1: reset: success Nov 30 12:00:58 dhcp-4-205 kernel: hdc: lost interrupt Nov 30 12:00:58 dhcp-4-205 kernel: hdc: task_out_intr: status=0x20 { DeviceFault } Nov 30 12:00:58 dhcp-4-205 kernel: ide: failed opcode was: unknown Nov 30 12:00:58 dhcp-4-205 kernel: ide1: reset: success What I think we want is for the process doing the I/O in the guest to get an -EIO from the write() call it is doing & the guest OS then re-mount the filesystem read only.
QE ack for RHEL5.
Two separate issues: One-off media errors need to get propagated upwards. But catastrophic failures of the backing store, as ENOSPC implies, may well require more serious handling, including potentially terminating the guest with prejudice.
Created attachment 142755 [details] Propagate QEMU I/O errors back to guest through IDE layer This is the updated patch merged in xen-unstable.hg to propagate QEMU I/O errors back to the guest through the IDE layer.
in xen-3.0.3-11.el5
xen-3.0.3-22.el5 included in 20070125.0.