Bug 217098
Summary: | XFS issue with AIO + DIO + sparse files | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Peter Backes <rtc> | ||||||||
Component: | kernel | Assignee: | Jeff Moyer <jmoyer> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 6 | CC: | berrange, esandeen, sct, wtogami, yugzhang | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i386 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-10-04 19:09:40 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Peter Backes
2006-11-24 02:58:52 UTC
"any other problems with XEN partitions" should read "any other problems with xfs partitions", sorry. Does the underlying xfs partition check OK after you encounter the guest FS corruption? I have experianced this first hand yesterday too - the blktap driver for file backed domains appears to be totally non-functional on XFS. While it will do read operations ok, any write operation appears to be completely lost. Changing to the legacy loopback based driver for file domains, and everything works as expected on XFS. The same works fine on ext3, NFS or GFS. The main difference I can think of between blktap & loop driver, is that blktap uses O_DIRECT + async IO for doing all disk writes, while the loop driver is synchronous & buffered. Perhaps XFS is not supporting ths O_DIRECT/aio combination correctly (or even at all ?) XFS should handle O_DIRECT & aio just fine (AFAIK xfs was the first linux filesystem to support O_DIRECT) :) ... I'll have to read up on blktap to see how it works. The host OS xfs partition hosting the guest OS image checks fine. Eric - FYI, the blktap code in question is actually mostly in userspace - only a small stub is in kernel, which exposes a magic device to userspace - a daemon in userspace deals with the actual I/O. From the Xen SRPM, look at the stuff under the tools/blktap/ directory within the Xen tar.gz. There are several different drivers available - in Fedora/RHEL we default to the 'tap:aio:' driver which is tools/blktap/drivers/block-aio.c source file. I wrote above that "It is spurious and occurs randomly during the package installation in the guest OS, or as written above also in during operation in the running OS, if it proceeds that far." But in fact, it seems not to be random at all. I just ran another test, and it failed at the exact same point: Installing: startup-notification ##################### [ 27/111] Updating : mesa-libGL ##################### [ 28/111] Installing: libXres ##################### [ 29/111] error: libwnck-2.16.0-4.fc6: Header V3 DSA signature: BAD, key ID 4f2a6fd2 Installing: notification-daemon ##################### [ 31/111] /var/tmp/rpm-tmp.38377: line 3: 1642 Segmentation fault gconftool-2 --makefile-install-rule /etc/gconf/schemas/notification-daemon.schemas >/dev/null Installing: libnotify ##################### [ 32/111] error: unpacking of archive failed on file /usr/lib/libnotify.so.1.1.0;45671e73: cpio: MD5 sum mismatch Installing: bluez-gnome ##################### [ 33/111] error: unpacking of archive failed on file /usr/bin/bluetooth-properties;45671e73: cpio: read *** glibc detected *** /usr/bin/python: double free or corruption (!prev): 0x0b179258 *** I conjecture that the randomness is actually no randomness but dependency on different memory size and swap device size parameters I used. I will further test this hypothesis and see if I can find out which parameters exactly are determining whether it will already fail during installation or only later during yum update. In fact, "free" shows the following: [1]+ Stopped yum update # free total used free shared buffers cached Mem: 235700 216064 19636 0 10012 131576 -/+ buffers/cache: 74476 161224 Swap: 522104 4 522100 It seems to just have just written its first bytes to the virtual swap partition; perhaps there is some relation to this. It also leads to the further conjecture that the corruption issue might be bound to certain areas of the image file or something similar. It is not at all random - it happens very consistently. If you install a guest using a block device backed disk, and then add a second disk backed by a file on XFS you can easily reproduce the problem by writing to the 2nd disk. Stephen Tweedie's verify-data tool will show the problem happens on basically *any* write - they never successfully write their data out - future reads of data just written always return NULLs instead of the magic bit pattern. http://people.redhat.com/sct/src/verify-data/ I will try the verify-data. But I am a little bit confused... Basically as I wrote above I already succeeded installing the guest system completely, and only a subsequent yum update triggered the issue (albeit twice at the very same position). How would that be consistent with "they never successfully write their data out"? Either our problems are different or there must be some condition for the problem to occur. Can you describe a little bit more in detail what you are observing exactly? Okay, I can confirm the verify-data diagnostics. I swapoff'd the swap partition (/dev/xvda2) inside a guest whose image was stored on an xfs partition and used it for testing. I ran verify-data /dev/xvda2 10M and it showed exactly what Daniel said. Note the following: 1. Blocks that will be bad when reading already take a suspiciously long time to write. 2. If you subsequently run verify-data with the identical command line, it will not produce any error anymore. If you however run verify-data with the step divided by two (say, verify-data /dev/xvda2 5M)it will produce the problem only for each second block. So obviously the problem happens only the *first time* a block is written to. 3. If you dd if=/zero of=/dev/loop0 on the losetup'd image on the host to remove any holes and make it non-sparse, then the problem will disappear entirely. Thanks for the testing --- this sounds very much like a core kernel problem, not a bug in xen, so I'll reassign for now. Yes, so it seems. I reproduced it with http://developer.osdl.org/daniel/AIO/TESTS/aiodio_sparse.c plus the following patch: --- aiodio_sparse.c.1 2003-11-12 20:53:46.000000000 +0100 +++ aiodio_sparse.c 2006-11-25 03:48:33.347155503 +0100 @@ -29,8 +29,8 @@ p = buf; while (size > 0) { - if (*buf != 0) { - fprintf(stderr, "non zero buffer at buf[%d] => 0x%02x,%02x,%02x,%02x\n", + if (*buf != 1) { + fprintf(stderr, "non one buffer at buf[%d] => 0x%02x,%02x,%02x,%02x\n", buf - p, (unsigned int)buf[0], size > 1 ? (unsigned int)buf[1] : 0, size > 2 ? (unsigned int)buf[2] : 0, @@ -147,7 +147,7 @@ unlink(filename); return; } - memset(bufptr, 0, writesize); + memset(bufptr, 1, writesize); io_prep_pwrite(iocbs[i], fd, bufptr, writesize, offset); offset += writesize; } (gcc -g -Wall -O2 aiodio_sparse.c -o aiodio_sparse -laio) I cross-checked on an ext3 partition where it works fine. I also checked the current FC6 non-xen kernel (2.6.18-1.2849.fc6), which shows the same problem. So it is a general XFS problem, unrelated to xen. Addendum: It actually fails on ext3fs, sometimes, too, but not in a deterministic fashion. On xfs it reproducibly outputs nnon one buffer at buf[0] => 0x00,00,00,00 non-zero read at offset 0" where on extfs, it sometimes passes and sometimes yields the same error, but with varying, high offset. Unfortunately I only now actually read aiodio_sparse.c and it seems to do something entirely different from what the name and description suggestested at first... It needs to be further checked. Created attachment 142103 [details]
Patch for aiodio_sparse.c
I changed the patch to first write the file with O_DIRECT with gaps and then
verify it with classic I/O. It deterministically fails on xfs now and works on
ext3. I am however not sure at all if it really tests what it is supposed to
test. It writes output to the file "file"; and if you run "hexdump file" after
it has completed on xfs, I get
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0010000 0101 0101 0101 0101 0101 0101 0101 0101
*
0020000 0000 0000 0000 0000 0000 0000 0000 0000
*
0510000 0101 0101 0101 0101 0101 0101 0101 0101
...
while on ext3 I get
0000000 0101 0101 0101 0101 0101 0101 0101 0101
*
0010000 0000 0000 0000 0000 0000 0000 0000 0000
*
0500000 0101 0101 0101 0101 0101 0101 0101 0101
*
0510000 0000 0000 0000 0000 0000 0000 0000 0000
*
So xfs seems to put the data just behind the block where is supposed to go, if
I understood correctly.
If you read the original aiodio_sparse.c, you will see why it 'accidentally'
just sometimes passed on ext3 and failed on xfs.
Please comment if the patched aiodio_sparse.c is for some reason still
inappropriate.
I now did a little bit of testing inside a xen guest, with a fresh dd if=/dev/zero of=test count=1 seek=1G bs=1 mounted as /dev/xvdb (hexdump lists entirely nul). If I run ./verify-data /dev/xvdb 25M inside the guest and hexdump the image afterwards on the host, there's obviously lots of data in it from previously deleted files from the XFS partition. Created attachment 142119 [details]
testcase
I extended the aiodio_sparse testcase patch so O_DIRECT and writing into the
hole of a sparse file vs. writing at the end of a file can be turned on and off
compile with
gcc -g -Wall -O2 aiodio_sparse.c -o aiodio_sparse -laio
./aiodio_sparse
runs the test (fails on xfs, hexdump showing as above, passes on ext3)
./aiodio_sparse -S
runs the test with appending instead of sparse file hole writing. (passes)
./aiodio_sparse -D
runs the test without O_DIRECT (passes)
supply -k to keep the file "file" in any case.
rtc's new patch seems to show a genuine problem. It is writing staggered blocks of "1's" but when read back we get zeros. Further, if we bmap the file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..255]: 192..447 0 (192..447) 256 10000 1: [256..10239]: hole 9984 2: [10240..10495]: 10432..10687 0 (10432..10687) 256 10000 3: [10496..20479]: hole 9984 ... looks like unwritten extents aren't getting converted? [root@magnesium test]# ls -i file 133 file [root@magnesium test]# xfs_db -r /dev/hda8 xfs_db> inode 133 xfs_db> p ... xfs_db> fsblock 11543 xfs_db> type bmapbtd xfs_db> p ... recs[1-40] = [startoff,startblock,blockcount,extentflag] 1:[0,24,16,1] 2:[16,40,16,0] 3:[1280,1304,16,1] 4:[1296,1320,16,0] 5:[2560,2584,16,1] 6:[2576,2600,16,0] 7:[3840,3864,16,1] 8:[3856,3880,16,0] 9:[5120,5144,16,1] 10:[5136,5160,16,0] 11:[6400,6424,16,1] 12:[6416,6440,16,0] 13:[7680,7704,16,1] 14:[7696,7720,16,0] 15:[8960,8984,16,1] 16:[8976,9000,16,0] 17:[10240,10264,16,1] 18:[10256,10280,16,0] 19:[11520,11544,16,1] 20:[11536,11560,16,0] 21:[12800,12824,16,1] 22:[12816,12840,16,0] 23:[14080,14104,16,1] 24:[14096,14120,16,0] 25:[15360,15384,16,1] 26:[15376,15400,16,0] 27:[16640,16664,16,1] 28:[16656,16680,16,0] 29:[17920,17944,16,1] 30:[17936,17960,16,0] 31:[19200,19224,16,1] 32:[19216,19240,16,0] 33:[20480,20504,16,1] 34:[20496,20520,16,0] 35:[21760,21784,16,1] 36:[21776,21800,16,0] 37:[23040,23064,16,1] 38:[23056,23080,16,0] 39:[24320,24344,16,1] 40:[24336,24360,16,0] xfs_db> fsblock 24 xfs_db> p 000: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101 020: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101 040: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101 ... so data is there but unwritten extents weren't unflagged. If I make the filesystem with -dunwritten=0, it passes. I can confirm the results of xfs_db (the beginning of the file actually contains 01s), but I get slightly different results with bmap: file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..127]: 3688..3815 0 (3688..3815) 128 10000 1: [128..255]: 654080..654207 0 (654080..654207) 128 2: [256..10239]: hole 9984 3: [10240..10367]: 314360..314487 0 (314360..314487) 128 10000 4: [10368..10495]: 654336..654463 0 (654336..654463) 128 5: [10496..20479]: hole 9984 ... and, as already mentioned at comment #15, if I hexdump file, I get not only nuls, but also 01s (at the areas that correspond to EXT 1, 4, and so on) yep, I should have used xfs_bmap -vp, to break out the unwritten parts more clearly Also there is definitely stale data coming through here (so repeating the test may show up 1's from previous runs...) [root@magnesium mnt]# lmdd opat=1 bs=1m count=128 of=/dev/hda8 134.2177 MB in 0.9101 secs, 147.4722 MB/sec [root@magnesium mnt]# mkfs.xfs -f -dsize=100m /dev/hda8 meta-data=/dev/hda8 isize=256 agcount=6, agsize=4096 blks = sectsz=512 attr=0 data = bsize=4096 blocks=24576, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=1200, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 [root@magnesium mnt]# mount /dev/hda8 test [root@magnesium mnt]# cd test [root@magnesium test]# /root/aiodio_sparse_rtc -k non one buffer at buf[0] => 0x00,00,00,00 non-one read at offset 0 *** WARNING *** file has not been unlinked; if you don't rm it manually first, it may influence the next run [root@magnesium test]# hexdump file | more 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0014000 0000 0002 0004 0002 0008 0002 000c 0002 <--- stale disk data from lmdd 0014010 0010 0002 0014 0002 0018 0002 001c 0002 0014020 0020 0002 0024 0002 0028 0002 002c 0002 0014030 0030 0002 0034 0002 0038 0002 003c 0002 0014040 0040 0002 0044 0002 0048 0002 004c 0002 ... [root@magnesium test]# xfs_bmap -vp file | more file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..127]: 96..223 0 (96..223) 128 10000 1: [128..255]: 224..351 0 (224..351) 128 2: [256..10239]: hole 9984 3: [10240..10367]: 10336..10463 0 (10336..10463) 128 10000 4: [10368..10495]: 10464..10591 0 (10464..10591) 128 5: [10496..20479]: hole 9984 ... this almost looks like it is converting the extents at the wrong offsets. If we look at bmap output from a filesystem w/o unwritten extents: [root@magnesium test]# more /tmp/xfs_bmap_nounwritten file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..127]: 96..223 0 (96..223) 128 1: [128..10239]: hole 10112 2: [10240..10367]: 10336..10463 0 (10336..10463) 128 3: [10368..20479]: hole 10112 This actually looks like it's converting extents at the wrong offsets. Created attachment 142124 [details]
testcase plus dirty free block generation
Actually, what you get if you hexdump the file is not data written by the most
recent aiodio_sparse run, but it is data from unallocated blocks on the disk.
It just happened to contain 01s from previous aiodio_sparse runs if you invoked
it several times. I changed the patch to create dirty free blocks; these then
clearly show up in the resulting file.
whoops take the xen folks off cc: - sorry about that guys This looks like a core kernel problem, I'll see if I can get this fixed upstream. finished_one_bio calls dio_complete calls xfs_end_io_direct with an offset, but: offset = dio->iocb->ki_pos; so this is the -current- post-IO position, not the IO start point that dio_complete expects. So, xfs converts the the wrong region. Ouch! XFS seems to be the only filesystem that uses the offset passed to the end_io function, so only it is affected. This seems to fix it up: Index: linux-2.6.18/fs/direct-io.c =================================================================== --- linux-2.6.18.orig/fs/direct-io.c +++ linux-2.6.18/fs/direct-io.c @@ -256,7 +256,8 @@ static void finished_one_bio(struct dio if (dio->io_error) transferred = dio->io_error; - dio_complete(dio, offset, transferred); + /* dio_complete wants starting point of IO */ + dio_complete(dio, offset-transferred, transferred); /* Complete AIO later if falling back to buffered i/o */ if (dio->result == dio->size || [root@magnesium test]# /root/aiodio_sparse_rtc [root@magnesium test]# /root/aiodio_sparse_rtc [root@magnesium test]# /root/aiodio_sparse_rtc [root@magnesium test]# xfs_info . | grep unwritten = sunit=0 swidth=0 blks, unwritten=1 Zach Brown's upstream AIO work actually seems to resolve this issue, so rather than trying to patch things up in 2.6.19, I think we'll just let FC6 catch up with upstream to resolve this. If someone has a need that is more urgent than this, let me know and I'll see if I can find a little time to verify the patch in the previous comment & get it into FC6 prior to 2.6.20.... This is a regression that I introduced when fixing a problem with reading a partial block at the end of a file opened with O_DIRECT. For the offending patch, see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178084. I agree that the right approach is to revert the behaviour. Thanks to Eric for the following link: http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg02646.html Minor update; I re-ran the patched aiodio_sparse testcase with printk's ahead of calls to dio_complete on both rhel5 and 2.6.21. rhel5: finished_one_bio call dio_complete offset 65536, ki_pos 65536 finished_one_bio call dio_complete offset 5308416, ki_pos 5308416 finished_one_bio call dio_complete offset 10551296, ki_pos 10551296 finished_one_bio call dio_complete offset 15794176, ki_pos 15794176 finished_one_bio call dio_complete offset 21037056, ki_pos 21037056 finished_one_bio call dio_complete offset 26279936, ki_pos 26279936 finished_one_bio call dio_complete offset 31522816, ki_pos 31522816 finished_one_bio call dio_complete offset 36765696, ki_pos 36765696 finished_one_bio call dio_complete offset 42008576, ki_pos 42008576 finished_one_bio call dio_complete offset 47251456, ki_pos 47251456 finished_one_bio call dio_complete offset 52494336, ki_pos 52494336 finished_one_bio call dio_complete offset 57737216, ki_pos 57737216 finished_one_bio call dio_complete offset 62980096, ki_pos 62980096 finished_one_bio call dio_complete offset 68222976, ki_pos 68222976 finished_one_bio call dio_complete offset 73465856, ki_pos 73465856 finished_one_bio call dio_complete offset 78708736, ki_pos 78708736 finished_one_bio call dio_complete offset 83951616, ki_pos 83951616 finished_one_bio call dio_complete offset 89194496, ki_pos 89194496 finished_one_bio call dio_complete offset 94437376, ki_pos 94437376 finished_one_bio call dio_complete offset 99680256, ki_pos 99680256 2.6.21: dio_bio_end_aio call dio_complete offset/ki_pos 0 dio_bio_end_aio call dio_complete offset/ki_pos 5242880 dio_bio_end_aio call dio_complete offset/ki_pos 10485760 dio_bio_end_aio call dio_complete offset/ki_pos 15728640 dio_bio_end_aio call dio_complete offset/ki_pos 20971520 dio_bio_end_aio call dio_complete offset/ki_pos 26214400 dio_bio_end_aio call dio_complete offset/ki_pos 31457280 dio_bio_end_aio call dio_complete offset/ki_pos 36700160 dio_bio_end_aio call dio_complete offset/ki_pos 41943040 dio_bio_end_aio call dio_complete offset/ki_pos 47185920 dio_bio_end_aio call dio_complete offset/ki_pos 52428800 dio_bio_end_aio call dio_complete offset/ki_pos 57671680 dio_bio_end_aio call dio_complete offset/ki_pos 62914560 dio_bio_end_aio call dio_complete offset/ki_pos 68157440 dio_bio_end_aio call dio_complete offset/ki_pos 73400320 dio_bio_end_aio call dio_complete offset/ki_pos 78643200 dio_bio_end_aio call dio_complete offset/ki_pos 83886080 dio_bio_end_aio call dio_complete offset/ki_pos 89128960 dio_bio_end_aio call dio_complete offset/ki_pos 94371840 dio_bio_end_aio call dio_complete offset/ki_pos 99614720 so it seems clear that upstream (with Zach Brown's rewrite) is calling dio_complete with the IO startpoint, whereas rhel5 is calling it with the dio finish point (start+offset) Eric, can you post the kernel patches that generated this output, please? Thanks! Sure... lazy pasting whitespace-mangled patches, sorry: --- fs/direct-io.c.orig 2007-04-30 20:04:17.000000000 -0500 +++ fs/direct-io.c 2007-04-30 16:36:35.000000000 -0500 @@ -256,6 +256,7 @@ if (dio->io_error) transferred = dio->io_error; + printk("finished_one_bio call dio_complete offset %lld, ki_pos %lld, block_in file %lld bytes\n", offset, dio->iocb->ki_pos, dio->block_in_file << dio->blkbits); dio_complete(dio, offset, transferred); /* Complete AIO later if falling back to buffered i/o */ @@ -1138,6 +1139,7 @@ if (rw == READ && (offset + transferred > i_size)) transferred = i_size - offset; } + printk("direct_io_worker call dio_complete offset %lld, ki_pos %lld\n", offset, dio->iocb->ki_pos); dio_complete(dio, offset, transferred); if (ret == 0) ret = transferred; Index: linux-2.6.21/fs/direct-io.c =================================================================== --- linux-2.6.21.orig/fs/direct-io.c +++ linux-2.6.21/fs/direct-io.c @@ -283,7 +283,9 @@ static int dio_bio_end_aio(struct bio *b spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - int ret = dio_complete(dio, dio->iocb->ki_pos, 0); + int ret; + printk("dio_bio_end_aio call dio_complete offset/ki_pos %lld\n", dio->iocb->ki_pos); + ret = dio_complete(dio, dio->iocb->ki_pos, 0); aio_complete(dio->iocb, ret, 0); kfree(dio); } @@ -1112,6 +1114,7 @@ direct_io_worker(int rw, struct kiocb *i spin_unlock_irqrestore(&dio->bio_lock, flags); BUG_ON(!dio->is_async && ret2 != 0); if (ret2 == 0) { + printk("direct_io_worker call dio-complete offset %lld , ki_pos %lld\n", offset, dio->iocb->ki_pos); ret = dio_complete(dio, offset, ret); kfree(dio); } else It looks like the patch is newer than the output. The RHEL 5 runs didn't print out a block_in_file column. Jeff, that's true, sorry - I updated it after we talked on IRC. With that patch, the block_in_file was the same value as the other values, i.e.: finished_one_bio call dio_complete offset 65536, ki_pos 65536, block_in file 65536 This bug should be fixed in current kernels. Can you please re-test and update the bug as appropriate? Thanks! FWIW running patched aio_dio_sparse on xfs with the F8 kernel was fine. I suppose the original XEN test would be the clincher. :) I'm going to close this. We think it's fixed. If it isn't, feel free to re-open. |