Description of problem: I am trying to install a FC6 XEN guest box on my FC6 host OS with the guest box's image being stored on an XFS partition of the host OS. However the installation process fails badly; the image gets corrupted sooner or later when it installs the packages, and the installation then bombs out with "rpmdb: PANIC: fatal region error detected; run recovery". Actually once I tried (which I did about 10 times), the installation succeeded. But when I ran "yum update" in this surviver, it died too, also corrupting the ext3 partition in the image stored on the xfs partition, as I found out on VM shutdown: === Updating : gtk2 ##################### [ 23/111] Updating : GConf2 ##################### [ 24/111] Updating : selinux-policy ##################### [ 25/111] Updating : parted ##################### [ 26/111] Installing: startup-notification ##################### [ 27/111] Updating : mesa-libGL ##################### [ 28/111] Installing: libXres ##################### [ 29/111] error: libwnck-2.16.0-4.fc6: Header V3 DSA signature: BAD, key ID 4f2a6fd2 Installing: notification-daemon ##################### [ 31/111] error: unpacking of archive failed on file /usr/lib/notification-daemon-1.0/engines/libstandard.so;45663b4e: cpio: read Installing: libnotify ##################### [ 32/111] error: unpacking of archive failed on file /usr/lib/libnotify.so.1.1.0;45663b4e: cpio: read Installing: bluez-gnome ##################### [ 33/111] error: unpacking of archive failed on file /usr/share/doc/bluez-gnome-0.6/COPYING;45663b4e: cpio: MD5 sum mismatch *** glibc detected *** /usr/bin/python: malloc(): memory corruption: 0x0a262668 *** ^C ^C [1]+ Stopped yum update === Strangely, I didn't notice any other problems with XEN partitions on the host OS, although I have also /usr, /var, /var/lib as xfs. rpm -qVa does not show any corruption...Perhaps it is specific to sparse files such as the image? If I store the image on an ext3 partition, no such corruption happens. Actually today this was the first time I was using FC6 and the first time I played with XFS. Version-Release number of selected component (if applicable): Current packages (yum update) as of Fri Nov 24 03:08:52 CET 2006 How reproducible: It is spurious and occurs randomly during the package installation in the guest OS, or as written above also in during operation in the running OS, if it proceeds that far. Steps to Reproduce: 1. Installed FC6 with "selinux=0 xfs text askmethod"; yum update; booted updated kernel 2. /usr/sbin/xenguest-install -n test -f /home/test.img -r 230 -s 2 -l ftp://ftp.fu-berlin.de/linux/fedora/core/6/i386/os -x "ip=(IP) netmask=(netmask) gateway=(gateway) dns=(dns) ks=ftp://.../ks.cfg" ERROR: Installs currently require 256 megs of RAM. How much RAM should be allocated (in megabytes)? 230 Would you like to enable graphics support? (yes or no) yes ... === ks.cfg === install url --url ftp://ftp.fu-berlin.de/linux/fedora/core/6/i386/os lang en_US keyboard de-latin1-nodeadkeys xconfig --driver "nv" --resolution 1024x768 --depth 24 network --device eth0 --bootproto static --ip (IP) --netmask (netmask) --gateway (gateway) --nameserver (nameserver) --hostname test.(domain) rootpw --iscrypted (encrypted root pw) firewall --enabled --port=22:tcp firstboot --disable authconfig --enableshadow --enablemd5 selinux --disabled timezone right/Europe/Berlin bootloader --location=mbr --driveorder=xvda # The following is the partition information you requested # Note that any partitions you deleted are not expressed # here so unless you clear all partitions first, this is # not guaranteed to work clearpart --all part swap --size=512 --asprimary part / --fstype ext3 --size=1 --grow %packages @core tzdata yum === 3. wait Actual results: bombs out with guest OS image corruption. Expected results: should proceed flawlessly. Additional info: The machine has 512 MB ram; thus I can only use 230 MB for the guest VM. uname -a: Linux (host machine) 2.6.18-1.2849.fc6xen #1 SMP Fri Nov 10 13:56:52 EST 2006 i686 i686 i386 GNU/Linux Layout on host OS: #part /boot --fstype ext3 --size=128 --asprimary #part /home --fstype xfs --size=4096 #part /var --fstype xfs --size=2048 #part /usr --fstype xfs --size=2048 #part swap --size=1024 --asprimary #part / --fstype ext3 --size=512 --asprimary #part /var/lib --fstype xfs --size=1 --grow
"any other problems with XEN partitions" should read "any other problems with xfs partitions", sorry.
Does the underlying xfs partition check OK after you encounter the guest FS corruption?
I have experianced this first hand yesterday too - the blktap driver for file backed domains appears to be totally non-functional on XFS. While it will do read operations ok, any write operation appears to be completely lost. Changing to the legacy loopback based driver for file domains, and everything works as expected on XFS. The same works fine on ext3, NFS or GFS. The main difference I can think of between blktap & loop driver, is that blktap uses O_DIRECT + async IO for doing all disk writes, while the loop driver is synchronous & buffered. Perhaps XFS is not supporting ths O_DIRECT/aio combination correctly (or even at all ?)
XFS should handle O_DIRECT & aio just fine (AFAIK xfs was the first linux filesystem to support O_DIRECT) :) ... I'll have to read up on blktap to see how it works.
The host OS xfs partition hosting the guest OS image checks fine.
Eric - FYI, the blktap code in question is actually mostly in userspace - only a small stub is in kernel, which exposes a magic device to userspace - a daemon in userspace deals with the actual I/O. From the Xen SRPM, look at the stuff under the tools/blktap/ directory within the Xen tar.gz. There are several different drivers available - in Fedora/RHEL we default to the 'tap:aio:' driver which is tools/blktap/drivers/block-aio.c source file.
I wrote above that "It is spurious and occurs randomly during the package installation in the guest OS, or as written above also in during operation in the running OS, if it proceeds that far." But in fact, it seems not to be random at all. I just ran another test, and it failed at the exact same point: Installing: startup-notification ##################### [ 27/111] Updating : mesa-libGL ##################### [ 28/111] Installing: libXres ##################### [ 29/111] error: libwnck-2.16.0-4.fc6: Header V3 DSA signature: BAD, key ID 4f2a6fd2 Installing: notification-daemon ##################### [ 31/111] /var/tmp/rpm-tmp.38377: line 3: 1642 Segmentation fault gconftool-2 --makefile-install-rule /etc/gconf/schemas/notification-daemon.schemas >/dev/null Installing: libnotify ##################### [ 32/111] error: unpacking of archive failed on file /usr/lib/libnotify.so.1.1.0;45671e73: cpio: MD5 sum mismatch Installing: bluez-gnome ##################### [ 33/111] error: unpacking of archive failed on file /usr/bin/bluetooth-properties;45671e73: cpio: read *** glibc detected *** /usr/bin/python: double free or corruption (!prev): 0x0b179258 *** I conjecture that the randomness is actually no randomness but dependency on different memory size and swap device size parameters I used. I will further test this hypothesis and see if I can find out which parameters exactly are determining whether it will already fail during installation or only later during yum update. In fact, "free" shows the following: [1]+ Stopped yum update # free total used free shared buffers cached Mem: 235700 216064 19636 0 10012 131576 -/+ buffers/cache: 74476 161224 Swap: 522104 4 522100 It seems to just have just written its first bytes to the virtual swap partition; perhaps there is some relation to this. It also leads to the further conjecture that the corruption issue might be bound to certain areas of the image file or something similar.
It is not at all random - it happens very consistently. If you install a guest using a block device backed disk, and then add a second disk backed by a file on XFS you can easily reproduce the problem by writing to the 2nd disk. Stephen Tweedie's verify-data tool will show the problem happens on basically *any* write - they never successfully write their data out - future reads of data just written always return NULLs instead of the magic bit pattern. http://people.redhat.com/sct/src/verify-data/
I will try the verify-data. But I am a little bit confused... Basically as I wrote above I already succeeded installing the guest system completely, and only a subsequent yum update triggered the issue (albeit twice at the very same position). How would that be consistent with "they never successfully write their data out"? Either our problems are different or there must be some condition for the problem to occur. Can you describe a little bit more in detail what you are observing exactly?
Okay, I can confirm the verify-data diagnostics. I swapoff'd the swap partition (/dev/xvda2) inside a guest whose image was stored on an xfs partition and used it for testing. I ran verify-data /dev/xvda2 10M and it showed exactly what Daniel said. Note the following: 1. Blocks that will be bad when reading already take a suspiciously long time to write. 2. If you subsequently run verify-data with the identical command line, it will not produce any error anymore. If you however run verify-data with the step divided by two (say, verify-data /dev/xvda2 5M)it will produce the problem only for each second block. So obviously the problem happens only the *first time* a block is written to. 3. If you dd if=/zero of=/dev/loop0 on the losetup'd image on the host to remove any holes and make it non-sparse, then the problem will disappear entirely.
Thanks for the testing --- this sounds very much like a core kernel problem, not a bug in xen, so I'll reassign for now.
Yes, so it seems. I reproduced it with http://developer.osdl.org/daniel/AIO/TESTS/aiodio_sparse.c plus the following patch: --- aiodio_sparse.c.1 2003-11-12 20:53:46.000000000 +0100 +++ aiodio_sparse.c 2006-11-25 03:48:33.347155503 +0100 @@ -29,8 +29,8 @@ p = buf; while (size > 0) { - if (*buf != 0) { - fprintf(stderr, "non zero buffer at buf[%d] => 0x%02x,%02x,%02x,%02x\n", + if (*buf != 1) { + fprintf(stderr, "non one buffer at buf[%d] => 0x%02x,%02x,%02x,%02x\n", buf - p, (unsigned int)buf[0], size > 1 ? (unsigned int)buf[1] : 0, size > 2 ? (unsigned int)buf[2] : 0, @@ -147,7 +147,7 @@ unlink(filename); return; } - memset(bufptr, 0, writesize); + memset(bufptr, 1, writesize); io_prep_pwrite(iocbs[i], fd, bufptr, writesize, offset); offset += writesize; } (gcc -g -Wall -O2 aiodio_sparse.c -o aiodio_sparse -laio) I cross-checked on an ext3 partition where it works fine. I also checked the current FC6 non-xen kernel (2.6.18-1.2849.fc6), which shows the same problem. So it is a general XFS problem, unrelated to xen.
Addendum: It actually fails on ext3fs, sometimes, too, but not in a deterministic fashion. On xfs it reproducibly outputs nnon one buffer at buf[0] => 0x00,00,00,00 non-zero read at offset 0" where on extfs, it sometimes passes and sometimes yields the same error, but with varying, high offset.
Unfortunately I only now actually read aiodio_sparse.c and it seems to do something entirely different from what the name and description suggestested at first... It needs to be further checked.
Created attachment 142103 [details] Patch for aiodio_sparse.c I changed the patch to first write the file with O_DIRECT with gaps and then verify it with classic I/O. It deterministically fails on xfs now and works on ext3. I am however not sure at all if it really tests what it is supposed to test. It writes output to the file "file"; and if you run "hexdump file" after it has completed on xfs, I get 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0010000 0101 0101 0101 0101 0101 0101 0101 0101 * 0020000 0000 0000 0000 0000 0000 0000 0000 0000 * 0510000 0101 0101 0101 0101 0101 0101 0101 0101 ... while on ext3 I get 0000000 0101 0101 0101 0101 0101 0101 0101 0101 * 0010000 0000 0000 0000 0000 0000 0000 0000 0000 * 0500000 0101 0101 0101 0101 0101 0101 0101 0101 * 0510000 0000 0000 0000 0000 0000 0000 0000 0000 * So xfs seems to put the data just behind the block where is supposed to go, if I understood correctly. If you read the original aiodio_sparse.c, you will see why it 'accidentally' just sometimes passed on ext3 and failed on xfs. Please comment if the patched aiodio_sparse.c is for some reason still inappropriate.
I now did a little bit of testing inside a xen guest, with a fresh dd if=/dev/zero of=test count=1 seek=1G bs=1 mounted as /dev/xvdb (hexdump lists entirely nul). If I run ./verify-data /dev/xvdb 25M inside the guest and hexdump the image afterwards on the host, there's obviously lots of data in it from previously deleted files from the XFS partition.
Created attachment 142119 [details] testcase I extended the aiodio_sparse testcase patch so O_DIRECT and writing into the hole of a sparse file vs. writing at the end of a file can be turned on and off compile with gcc -g -Wall -O2 aiodio_sparse.c -o aiodio_sparse -laio ./aiodio_sparse runs the test (fails on xfs, hexdump showing as above, passes on ext3) ./aiodio_sparse -S runs the test with appending instead of sparse file hole writing. (passes) ./aiodio_sparse -D runs the test without O_DIRECT (passes) supply -k to keep the file "file" in any case.
rtc's new patch seems to show a genuine problem. It is writing staggered blocks of "1's" but when read back we get zeros. Further, if we bmap the file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..255]: 192..447 0 (192..447) 256 10000 1: [256..10239]: hole 9984 2: [10240..10495]: 10432..10687 0 (10432..10687) 256 10000 3: [10496..20479]: hole 9984 ... looks like unwritten extents aren't getting converted? [root@magnesium test]# ls -i file 133 file [root@magnesium test]# xfs_db -r /dev/hda8 xfs_db> inode 133 xfs_db> p ... xfs_db> fsblock 11543 xfs_db> type bmapbtd xfs_db> p ... recs[1-40] = [startoff,startblock,blockcount,extentflag] 1:[0,24,16,1] 2:[16,40,16,0] 3:[1280,1304,16,1] 4:[1296,1320,16,0] 5:[2560,2584,16,1] 6:[2576,2600,16,0] 7:[3840,3864,16,1] 8:[3856,3880,16,0] 9:[5120,5144,16,1] 10:[5136,5160,16,0] 11:[6400,6424,16,1] 12:[6416,6440,16,0] 13:[7680,7704,16,1] 14:[7696,7720,16,0] 15:[8960,8984,16,1] 16:[8976,9000,16,0] 17:[10240,10264,16,1] 18:[10256,10280,16,0] 19:[11520,11544,16,1] 20:[11536,11560,16,0] 21:[12800,12824,16,1] 22:[12816,12840,16,0] 23:[14080,14104,16,1] 24:[14096,14120,16,0] 25:[15360,15384,16,1] 26:[15376,15400,16,0] 27:[16640,16664,16,1] 28:[16656,16680,16,0] 29:[17920,17944,16,1] 30:[17936,17960,16,0] 31:[19200,19224,16,1] 32:[19216,19240,16,0] 33:[20480,20504,16,1] 34:[20496,20520,16,0] 35:[21760,21784,16,1] 36:[21776,21800,16,0] 37:[23040,23064,16,1] 38:[23056,23080,16,0] 39:[24320,24344,16,1] 40:[24336,24360,16,0] xfs_db> fsblock 24 xfs_db> p 000: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101 020: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101 040: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101 ... so data is there but unwritten extents weren't unflagged. If I make the filesystem with -dunwritten=0, it passes.
I can confirm the results of xfs_db (the beginning of the file actually contains 01s), but I get slightly different results with bmap: file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..127]: 3688..3815 0 (3688..3815) 128 10000 1: [128..255]: 654080..654207 0 (654080..654207) 128 2: [256..10239]: hole 9984 3: [10240..10367]: 314360..314487 0 (314360..314487) 128 10000 4: [10368..10495]: 654336..654463 0 (654336..654463) 128 5: [10496..20479]: hole 9984 ... and, as already mentioned at comment #15, if I hexdump file, I get not only nuls, but also 01s (at the areas that correspond to EXT 1, 4, and so on)
yep, I should have used xfs_bmap -vp, to break out the unwritten parts more clearly Also there is definitely stale data coming through here (so repeating the test may show up 1's from previous runs...) [root@magnesium mnt]# lmdd opat=1 bs=1m count=128 of=/dev/hda8 134.2177 MB in 0.9101 secs, 147.4722 MB/sec [root@magnesium mnt]# mkfs.xfs -f -dsize=100m /dev/hda8 meta-data=/dev/hda8 isize=256 agcount=6, agsize=4096 blks = sectsz=512 attr=0 data = bsize=4096 blocks=24576, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=1200, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 [root@magnesium mnt]# mount /dev/hda8 test [root@magnesium mnt]# cd test [root@magnesium test]# /root/aiodio_sparse_rtc -k non one buffer at buf[0] => 0x00,00,00,00 non-one read at offset 0 *** WARNING *** file has not been unlinked; if you don't rm it manually first, it may influence the next run [root@magnesium test]# hexdump file | more 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0014000 0000 0002 0004 0002 0008 0002 000c 0002 <--- stale disk data from lmdd 0014010 0010 0002 0014 0002 0018 0002 001c 0002 0014020 0020 0002 0024 0002 0028 0002 002c 0002 0014030 0030 0002 0034 0002 0038 0002 003c 0002 0014040 0040 0002 0044 0002 0048 0002 004c 0002 ... [root@magnesium test]# xfs_bmap -vp file | more file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..127]: 96..223 0 (96..223) 128 10000 1: [128..255]: 224..351 0 (224..351) 128 2: [256..10239]: hole 9984 3: [10240..10367]: 10336..10463 0 (10336..10463) 128 10000 4: [10368..10495]: 10464..10591 0 (10464..10591) 128 5: [10496..20479]: hole 9984 ... this almost looks like it is converting the extents at the wrong offsets. If we look at bmap output from a filesystem w/o unwritten extents: [root@magnesium test]# more /tmp/xfs_bmap_nounwritten file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..127]: 96..223 0 (96..223) 128 1: [128..10239]: hole 10112 2: [10240..10367]: 10336..10463 0 (10336..10463) 128 3: [10368..20479]: hole 10112 This actually looks like it's converting extents at the wrong offsets.
Created attachment 142124 [details] testcase plus dirty free block generation Actually, what you get if you hexdump the file is not data written by the most recent aiodio_sparse run, but it is data from unallocated blocks on the disk. It just happened to contain 01s from previous aiodio_sparse runs if you invoked it several times. I changed the patch to create dirty free blocks; these then clearly show up in the resulting file.
whoops take the xen folks off cc: - sorry about that guys
This looks like a core kernel problem, I'll see if I can get this fixed upstream. finished_one_bio calls dio_complete calls xfs_end_io_direct with an offset, but: offset = dio->iocb->ki_pos; so this is the -current- post-IO position, not the IO start point that dio_complete expects. So, xfs converts the the wrong region. Ouch! XFS seems to be the only filesystem that uses the offset passed to the end_io function, so only it is affected. This seems to fix it up: Index: linux-2.6.18/fs/direct-io.c =================================================================== --- linux-2.6.18.orig/fs/direct-io.c +++ linux-2.6.18/fs/direct-io.c @@ -256,7 +256,8 @@ static void finished_one_bio(struct dio if (dio->io_error) transferred = dio->io_error; - dio_complete(dio, offset, transferred); + /* dio_complete wants starting point of IO */ + dio_complete(dio, offset-transferred, transferred); /* Complete AIO later if falling back to buffered i/o */ if (dio->result == dio->size || [root@magnesium test]# /root/aiodio_sparse_rtc [root@magnesium test]# /root/aiodio_sparse_rtc [root@magnesium test]# /root/aiodio_sparse_rtc [root@magnesium test]# xfs_info . | grep unwritten = sunit=0 swidth=0 blks, unwritten=1
Zach Brown's upstream AIO work actually seems to resolve this issue, so rather than trying to patch things up in 2.6.19, I think we'll just let FC6 catch up with upstream to resolve this. If someone has a need that is more urgent than this, let me know and I'll see if I can find a little time to verify the patch in the previous comment & get it into FC6 prior to 2.6.20....
This is a regression that I introduced when fixing a problem with reading a partial block at the end of a file opened with O_DIRECT. For the offending patch, see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178084. I agree that the right approach is to revert the behaviour.
Thanks to Eric for the following link: http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg02646.html
Minor update; I re-ran the patched aiodio_sparse testcase with printk's ahead of calls to dio_complete on both rhel5 and 2.6.21. rhel5: finished_one_bio call dio_complete offset 65536, ki_pos 65536 finished_one_bio call dio_complete offset 5308416, ki_pos 5308416 finished_one_bio call dio_complete offset 10551296, ki_pos 10551296 finished_one_bio call dio_complete offset 15794176, ki_pos 15794176 finished_one_bio call dio_complete offset 21037056, ki_pos 21037056 finished_one_bio call dio_complete offset 26279936, ki_pos 26279936 finished_one_bio call dio_complete offset 31522816, ki_pos 31522816 finished_one_bio call dio_complete offset 36765696, ki_pos 36765696 finished_one_bio call dio_complete offset 42008576, ki_pos 42008576 finished_one_bio call dio_complete offset 47251456, ki_pos 47251456 finished_one_bio call dio_complete offset 52494336, ki_pos 52494336 finished_one_bio call dio_complete offset 57737216, ki_pos 57737216 finished_one_bio call dio_complete offset 62980096, ki_pos 62980096 finished_one_bio call dio_complete offset 68222976, ki_pos 68222976 finished_one_bio call dio_complete offset 73465856, ki_pos 73465856 finished_one_bio call dio_complete offset 78708736, ki_pos 78708736 finished_one_bio call dio_complete offset 83951616, ki_pos 83951616 finished_one_bio call dio_complete offset 89194496, ki_pos 89194496 finished_one_bio call dio_complete offset 94437376, ki_pos 94437376 finished_one_bio call dio_complete offset 99680256, ki_pos 99680256 2.6.21: dio_bio_end_aio call dio_complete offset/ki_pos 0 dio_bio_end_aio call dio_complete offset/ki_pos 5242880 dio_bio_end_aio call dio_complete offset/ki_pos 10485760 dio_bio_end_aio call dio_complete offset/ki_pos 15728640 dio_bio_end_aio call dio_complete offset/ki_pos 20971520 dio_bio_end_aio call dio_complete offset/ki_pos 26214400 dio_bio_end_aio call dio_complete offset/ki_pos 31457280 dio_bio_end_aio call dio_complete offset/ki_pos 36700160 dio_bio_end_aio call dio_complete offset/ki_pos 41943040 dio_bio_end_aio call dio_complete offset/ki_pos 47185920 dio_bio_end_aio call dio_complete offset/ki_pos 52428800 dio_bio_end_aio call dio_complete offset/ki_pos 57671680 dio_bio_end_aio call dio_complete offset/ki_pos 62914560 dio_bio_end_aio call dio_complete offset/ki_pos 68157440 dio_bio_end_aio call dio_complete offset/ki_pos 73400320 dio_bio_end_aio call dio_complete offset/ki_pos 78643200 dio_bio_end_aio call dio_complete offset/ki_pos 83886080 dio_bio_end_aio call dio_complete offset/ki_pos 89128960 dio_bio_end_aio call dio_complete offset/ki_pos 94371840 dio_bio_end_aio call dio_complete offset/ki_pos 99614720 so it seems clear that upstream (with Zach Brown's rewrite) is calling dio_complete with the IO startpoint, whereas rhel5 is calling it with the dio finish point (start+offset)
Eric, can you post the kernel patches that generated this output, please? Thanks!
Sure... lazy pasting whitespace-mangled patches, sorry: --- fs/direct-io.c.orig 2007-04-30 20:04:17.000000000 -0500 +++ fs/direct-io.c 2007-04-30 16:36:35.000000000 -0500 @@ -256,6 +256,7 @@ if (dio->io_error) transferred = dio->io_error; + printk("finished_one_bio call dio_complete offset %lld, ki_pos %lld, block_in file %lld bytes\n", offset, dio->iocb->ki_pos, dio->block_in_file << dio->blkbits); dio_complete(dio, offset, transferred); /* Complete AIO later if falling back to buffered i/o */ @@ -1138,6 +1139,7 @@ if (rw == READ && (offset + transferred > i_size)) transferred = i_size - offset; } + printk("direct_io_worker call dio_complete offset %lld, ki_pos %lld\n", offset, dio->iocb->ki_pos); dio_complete(dio, offset, transferred); if (ret == 0) ret = transferred; Index: linux-2.6.21/fs/direct-io.c =================================================================== --- linux-2.6.21.orig/fs/direct-io.c +++ linux-2.6.21/fs/direct-io.c @@ -283,7 +283,9 @@ static int dio_bio_end_aio(struct bio *b spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - int ret = dio_complete(dio, dio->iocb->ki_pos, 0); + int ret; + printk("dio_bio_end_aio call dio_complete offset/ki_pos %lld\n", dio->iocb->ki_pos); + ret = dio_complete(dio, dio->iocb->ki_pos, 0); aio_complete(dio->iocb, ret, 0); kfree(dio); } @@ -1112,6 +1114,7 @@ direct_io_worker(int rw, struct kiocb *i spin_unlock_irqrestore(&dio->bio_lock, flags); BUG_ON(!dio->is_async && ret2 != 0); if (ret2 == 0) { + printk("direct_io_worker call dio-complete offset %lld , ki_pos %lld\n", offset, dio->iocb->ki_pos); ret = dio_complete(dio, offset, ret); kfree(dio); } else
It looks like the patch is newer than the output. The RHEL 5 runs didn't print out a block_in_file column.
Jeff, that's true, sorry - I updated it after we talked on IRC. With that patch, the block_in_file was the same value as the other values, i.e.: finished_one_bio call dio_complete offset 65536, ki_pos 65536, block_in file 65536
This bug should be fixed in current kernels. Can you please re-test and update the bug as appropriate? Thanks!
FWIW running patched aio_dio_sparse on xfs with the F8 kernel was fine. I suppose the original XEN test would be the clincher. :)
I'm going to close this. We think it's fixed. If it isn't, feel free to re-open.