Bug 217765

Summary: blktap needs to handle ENOSPC sparse files
Product: Red Hat Enterprise Linux 5 Reporter: Stephen Tweedie <sct>
Component: xenAssignee: Daniel Berrangé <berrange>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 5.0CC: berrange, riel, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-01-26 20:05:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 217764    
Bug Blocks: 217859    
Attachments:
Description Flags
Ensure I/O failures are reported back to blktap kernel driver.
none
Fix blktap I/O error reporting none

Description Stephen Tweedie 2006-11-29 20:36:55 UTC
In addition to potentially providing non-sparse files initially, we need to fail
gracefully on sparse files by suspending writes to a blktap file on ENOSPC, as
otherwise the only result will be massive guest filesystem corruption.

+++ This bug was initially created as a clone of Bug #217764 +++

Description of problem:

When using sparse files to back guests, it is possible for dom0 to run out of
disk space without the guest OSes knowing.  This means the user could lose data
because dom0 ran out of space, even though the guest OS thinks it has plenty of
disk space.

...

Comment 1 Stephen Tweedie 2006-11-29 20:38:27 UTC
Needs to be done for qemu-dm too, for FV disk images.


Comment 2 Daniel Berrangé 2006-11-29 20:46:07 UTC
tools/blktap/drivers/tapdisk.c  is the offending file in blktap.

The 'send_response' method is responsible for sending messages about I/O
completion back to the guest OS. When the request resulted in an error
condition, however, it merely does

        if (res != 0) {
                DPRINTF("*** request error %d! \n", res);
                return 0;
        }


So, the guest never sees an notification abou the error. The DPRINTF ends up in
syslog, but unless you've turned on *.debug  you'll never see it in any logs.


Comment 6 Daniel Berrangé 2006-11-29 23:22:34 UTC
Ok, this is severely broken. Here are the steps to reproduce problem:

1. On Dom0 host, prepare a filesystem of 200 MB, and create a 1 GB sparse file
to represent the guest filesystem:

  # lvcreate -n TestData -L 200M /dev/HostVG
  # mke2fs -m 0 /dev/HostVG/TestData 
  # mkdir /xen/TestData
  # mount /dev/HostVG/TestData /xen/TestData/
  # cd /xen/TestData/
  # dd if=/dev/zero of=data.img bs=1G count=0 seek=1
  # ls -lhs data.img
  1.0K -rw-r--r-- 1 root root 1.0G Nov 29 17:59 data.img

So we have 1 GB disk, consuming 1k, on a partition 200MB in size.

2. Now take an existing PV guest and setup data.img as a secondary disk, eg

  # grep disk /etc/xen/XenGuest1 
  disk = [ "phy:/dev/HostVG/XenGuest1,xvda,w",
"tap:aio:/xen/TestData/data.img,xvdc,w" ]


3. Boot the guest & ssh into it

4. Inside the guest create a single partition across whole /dev/xvdc disk, and
mount it

# fdisk /dev/xvdc 
# mount /dev/xvdc1 /mnt/
# cd /mnt/
# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvdc1           1004M  1.3M 1003M   1% /mnt

Note, the partition has 1 GB of space

5. Attempt to create a file of 500 MB on /mnt.

# cd /mnt/
# dd if=/dev/zero of=data.bin bs=500M count=1
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 54.4381 seconds, 9.6 MB/s
# ls -lh data.bin
-rw-r--r-- 1 root root 500M Nov 29 18:13 data.bin

Notice, that  the 'dd' succeeded to write 524 MB, even though the underlying
filesystem in host is a mere 200 MB in size. Clearly this should not have
succeeded, and yet no IO errors were seen by the guest

In the host, syslog shows

Nov 29 18:13:58 localhost TAPDISK: AIO did less than I asked it to.  
Nov 29 18:14:36 localhost last message repeated 67436 times



6. For added sanity check, try copying a file into the 'full' disk and then
comparing md5sum's of it from the guest's POV.

# cd /mnt
# cp /lib/libc.so.6 .
# umount /mnt/
# cd /
# mount /dev/xvdc1 /mnt/
# cd /mnt/
# md5sum libc.so.6 /lib/libc.so.6 
588347f8791221050f4abe95b4199a5e  libc.so.6
a16f66d50c2c2085f2d1393693901417  /lib/libc.so.6


So clearly even though the guest thinks the data was succesfully written, what
it later gets back is garbage.



Comment 7 Daniel Berrangé 2006-11-30 19:49:20 UTC
Created attachment 142511 [details]
Ensure I/O failures are reported back to blktap kernel driver.

The attached patch ensures that the 'status' field in blkif_response_t struct
is set to BLKIF_RSP_ERROR when an I/O error occurrs.

It also fixes the aio driver to signal an error based on the 'io_event.res'
field, instead of 'res2' field. Although the latter is intended to give
detailed error code, all current versions of the kernel just fill it to 0, so
its unusable as is.

Comment 8 Daniel Berrangé 2006-11-30 19:51:52 UTC
It should be noted, however, that even with this patch while the guest does now
see I/O errors, the actual 'dd' test still succeeds without seeing -EIO or
indeed, any failure condition at all.

Kernel dmesg logs, however, are splattered with

Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1260895
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1260983
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261071
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261159
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261247
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261335
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261423
Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261511
Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261599
Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261687
Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261775
Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261863
Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261951

And the filesystem is still writable. It really ought to toggle to read-only
when the first I/O error occurrs.

Comment 9 Stephen Tweedie 2006-11-30 22:29:54 UTC
"dd" won't usually see write errors because it uses async writes by default. 
You'll want to use fsync or O_DIRECT to cause IO sychronisation if you want it
to see write errors.  Normal dd can't do that, but lmdd has options for both
(it's part of lmbench).

Alternatively, test via dd to a raw device in the guest --- that will force it
to perform synchronised writes.

Note that the filesystem does not take itself offline until it sees
unrecoverable errors.  Mere data write errors don't usually count --- we detect
those but propagate them to a per-address_space (and then per-inode) error flag.
It's only errors on critical metadata that should take it offline.  Of course,
metadata is more often preallocated, or allocated early, so its writes won't
necessarily fail on a sparse backing store.

Comment 10 Daniel Berrangé 2006-12-04 17:54:32 UTC
Created attachment 142754 [details]
Fix blktap I/O error reporting

This is the updated patch merged with xen-unstable.hg to fix blktap I/O error
reporting

Comment 11 Rik van Riel 2006-12-05 20:01:19 UTC
in xen-3.0.3-11.el5

Comment 12 Jay Turner 2006-12-13 21:30:40 UTC
QE ack for RHEL5.

Comment 13 Jay Turner 2007-01-26 20:05:47 UTC
xen-3.0.3-22.el5 included in 20070125.0.