Bug 217765
Summary: | blktap needs to handle ENOSPC sparse files | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Stephen Tweedie <sct> | ||||||
Component: | xen | Assignee: | Daniel Berrangé <berrange> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.0 | CC: | berrange, riel, xen-maint | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | 5.0.0 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-01-26 20:05:47 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 217764 | ||||||||
Bug Blocks: | 217859 | ||||||||
Attachments: |
|
Description
Stephen Tweedie
2006-11-29 20:36:55 UTC
Needs to be done for qemu-dm too, for FV disk images. tools/blktap/drivers/tapdisk.c is the offending file in blktap. The 'send_response' method is responsible for sending messages about I/O completion back to the guest OS. When the request resulted in an error condition, however, it merely does if (res != 0) { DPRINTF("*** request error %d! \n", res); return 0; } So, the guest never sees an notification abou the error. The DPRINTF ends up in syslog, but unless you've turned on *.debug you'll never see it in any logs. Ok, this is severely broken. Here are the steps to reproduce problem: 1. On Dom0 host, prepare a filesystem of 200 MB, and create a 1 GB sparse file to represent the guest filesystem: # lvcreate -n TestData -L 200M /dev/HostVG # mke2fs -m 0 /dev/HostVG/TestData # mkdir /xen/TestData # mount /dev/HostVG/TestData /xen/TestData/ # cd /xen/TestData/ # dd if=/dev/zero of=data.img bs=1G count=0 seek=1 # ls -lhs data.img 1.0K -rw-r--r-- 1 root root 1.0G Nov 29 17:59 data.img So we have 1 GB disk, consuming 1k, on a partition 200MB in size. 2. Now take an existing PV guest and setup data.img as a secondary disk, eg # grep disk /etc/xen/XenGuest1 disk = [ "phy:/dev/HostVG/XenGuest1,xvda,w", "tap:aio:/xen/TestData/data.img,xvdc,w" ] 3. Boot the guest & ssh into it 4. Inside the guest create a single partition across whole /dev/xvdc disk, and mount it # fdisk /dev/xvdc # mount /dev/xvdc1 /mnt/ # cd /mnt/ # df -h . Filesystem Size Used Avail Use% Mounted on /dev/xvdc1 1004M 1.3M 1003M 1% /mnt Note, the partition has 1 GB of space 5. Attempt to create a file of 500 MB on /mnt. # cd /mnt/ # dd if=/dev/zero of=data.bin bs=500M count=1 1+0 records in 1+0 records out 524288000 bytes (524 MB) copied, 54.4381 seconds, 9.6 MB/s # ls -lh data.bin -rw-r--r-- 1 root root 500M Nov 29 18:13 data.bin Notice, that the 'dd' succeeded to write 524 MB, even though the underlying filesystem in host is a mere 200 MB in size. Clearly this should not have succeeded, and yet no IO errors were seen by the guest In the host, syslog shows Nov 29 18:13:58 localhost TAPDISK: AIO did less than I asked it to. Nov 29 18:14:36 localhost last message repeated 67436 times 6. For added sanity check, try copying a file into the 'full' disk and then comparing md5sum's of it from the guest's POV. # cd /mnt # cp /lib/libc.so.6 . # umount /mnt/ # cd / # mount /dev/xvdc1 /mnt/ # cd /mnt/ # md5sum libc.so.6 /lib/libc.so.6 588347f8791221050f4abe95b4199a5e libc.so.6 a16f66d50c2c2085f2d1393693901417 /lib/libc.so.6 So clearly even though the guest thinks the data was succesfully written, what it later gets back is garbage. Created attachment 142511 [details]
Ensure I/O failures are reported back to blktap kernel driver.
The attached patch ensures that the 'status' field in blkif_response_t struct
is set to BLKIF_RSP_ERROR when an I/O error occurrs.
It also fixes the aio driver to signal an error based on the 'io_event.res'
field, instead of 'res2' field. Although the latter is intended to give
detailed error code, all current versions of the kernel just fill it to 0, so
its unusable as is.
It should be noted, however, that even with this patch while the guest does now see I/O errors, the actual 'dd' test still succeeds without seeing -EIO or indeed, any failure condition at all. Kernel dmesg logs, however, are splattered with Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1260895 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1260983 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261071 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261159 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261247 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261335 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261423 Nov 30 14:33:46 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261511 Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261599 Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261687 Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261775 Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261863 Nov 30 14:33:47 dhcp-5-203 kernel: end_request: I/O error, dev xvdc, sector 1261951 And the filesystem is still writable. It really ought to toggle to read-only when the first I/O error occurrs. "dd" won't usually see write errors because it uses async writes by default. You'll want to use fsync or O_DIRECT to cause IO sychronisation if you want it to see write errors. Normal dd can't do that, but lmdd has options for both (it's part of lmbench). Alternatively, test via dd to a raw device in the guest --- that will force it to perform synchronised writes. Note that the filesystem does not take itself offline until it sees unrecoverable errors. Mere data write errors don't usually count --- we detect those but propagate them to a per-address_space (and then per-inode) error flag. It's only errors on critical metadata that should take it offline. Of course, metadata is more often preallocated, or allocated early, so its writes won't necessarily fail on a sparse backing store. Created attachment 142754 [details]
Fix blktap I/O error reporting
This is the updated patch merged with xen-unstable.hg to fix blktap I/O error
reporting
in xen-3.0.3-11.el5 QE ack for RHEL5. xen-3.0.3-22.el5 included in 20070125.0. |