Bug 217098

Summary: XFS issue with AIO + DIO + sparse files
Product: [Fedora] Fedora Reporter: Peter Backes <rtc>
Component: kernelAssignee: Jeff Moyer <jmoyer>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: berrange, esandeen, sct, wtogami, yugzhang
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-04 19:09:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch for aiodio_sparse.c
none
testcase
none
testcase plus dirty free block generation none

Description Peter Backes 2006-11-24 02:58:52 UTC
Description of problem:
I am trying to install a FC6 XEN guest box on my FC6 host OS with the guest
box's image being stored on an XFS partition of the host OS. However the
installation process fails badly; the image gets corrupted sooner or later when
it installs the packages, and the installation then bombs out with "rpmdb:
PANIC: fatal region error detected; run recovery". Actually once I tried (which
I did about 10 times), the installation succeeded. But when I ran  "yum update"
in this surviver, it died too, also corrupting the ext3 partition in the image
stored on the xfs partition, as I found out on VM shutdown:
===
  Updating  : gtk2                         ##################### [ 23/111] 
  Updating  : GConf2                       ##################### [ 24/111] 
  Updating  : selinux-policy               ##################### [ 25/111] 
  Updating  : parted                       ##################### [ 26/111] 
  Installing: startup-notification         ##################### [ 27/111] 
  Updating  : mesa-libGL                   ##################### [ 28/111] 
  Installing: libXres                      ##################### [ 29/111] 
error: libwnck-2.16.0-4.fc6: Header V3 DSA signature: BAD, key ID 4f2a6fd2
  Installing: notification-daemon          ##################### [ 31/111] 
error: unpacking of archive failed on file
/usr/lib/notification-daemon-1.0/engines/libstandard.so;45663b4e: cpio: read
  Installing: libnotify                    ##################### [ 32/111] 
error: unpacking of archive failed on file /usr/lib/libnotify.so.1.1.0;45663b4e:
cpio: read
  Installing: bluez-gnome                  ##################### [ 33/111] 
error: unpacking of archive failed on file
/usr/share/doc/bluez-gnome-0.6/COPYING;45663b4e: cpio: MD5 sum mismatch
*** glibc detected *** /usr/bin/python: malloc(): memory corruption: 0x0a262668 ***


^C
^C
[1]+  Stopped                 yum update
===

Strangely, I didn't notice any other problems with XEN partitions on the host
OS, although I have also /usr, /var, /var/lib as xfs. rpm -qVa does not show any
corruption...Perhaps it is specific to sparse files such as the image? 

If I store the image on an ext3 partition, no such corruption happens.

Actually today this was the first time I was using FC6 and the first time I
played with XFS.

Version-Release number of selected component (if applicable):
Current packages (yum update) as of Fri Nov 24 03:08:52 CET 2006

How reproducible:
It is spurious and occurs randomly during the package installation in the guest
OS, or as written above also in during operation in the running OS, if it
proceeds that far. 

Steps to Reproduce:
1. Installed FC6 with "selinux=0 xfs text askmethod"; yum update; booted updated
kernel
2.
/usr/sbin/xenguest-install -n test -f /home/test.img -r 230 -s 2 -l
ftp://ftp.fu-berlin.de/linux/fedora/core/6/i386/os -x "ip=(IP) netmask=(netmask)
gateway=(gateway) dns=(dns) ks=ftp://.../ks.cfg"

ERROR: Installs currently require 256 megs of RAM.

How much RAM should be allocated (in megabytes)? 230
Would you like to enable graphics support? (yes or no) yes
...

=== ks.cfg ===
install
url --url ftp://ftp.fu-berlin.de/linux/fedora/core/6/i386/os
lang en_US
keyboard de-latin1-nodeadkeys
xconfig --driver "nv" --resolution 1024x768 --depth 24
network --device eth0 --bootproto static --ip (IP) --netmask (netmask) --gateway
(gateway) --nameserver (nameserver) --hostname test.(domain)
rootpw --iscrypted (encrypted root pw)
firewall --enabled --port=22:tcp
firstboot --disable
authconfig --enableshadow --enablemd5
selinux --disabled
timezone right/Europe/Berlin
bootloader --location=mbr --driveorder=xvda
# The following is the partition information you requested
# Note that any partitions you deleted are not expressed
# here so unless you clear all partitions first, this is
# not guaranteed to work
clearpart --all
part swap --size=512 --asprimary
part / --fstype ext3 --size=1 --grow

%packages
@core
tzdata
yum
===

3. wait
  
Actual results:
bombs out with guest OS image corruption.

Expected results:
should proceed flawlessly.

Additional info:
The machine has 512 MB ram; thus I can only use 230 MB for the guest VM.
uname -a:
Linux (host machine) 2.6.18-1.2849.fc6xen #1 SMP Fri Nov 10 13:56:52 EST 2006
i686 i686 i386 GNU/Linux
Layout on host OS:
#part /boot --fstype ext3 --size=128 --asprimary
#part /home --fstype xfs --size=4096
#part /var --fstype xfs --size=2048
#part /usr --fstype xfs --size=2048
#part swap --size=1024 --asprimary
#part / --fstype ext3 --size=512 --asprimary
#part /var/lib --fstype xfs --size=1 --grow

Comment 1 Peter Backes 2006-11-24 03:01:04 UTC
"any other problems with XEN partitions" should read "any other problems with
xfs partitions", sorry.

Comment 2 Eric Sandeen 2006-11-24 03:31:38 UTC
Does the underlying xfs partition check OK after you encounter the guest FS
corruption?

Comment 3 Daniel Berrangé 2006-11-24 03:56:30 UTC
I have experianced this first hand yesterday too - the blktap  driver for file
backed domains appears to be totally non-functional on XFS. While it will do
read operations ok, any  write operation appears to be completely lost. 
Changing to the legacy loopback based driver for file domains, and everything
works as expected on XFS. The same works fine on ext3, NFS or GFS.

The main difference I can think of between blktap & loop driver, is that blktap
uses  O_DIRECT + async IO  for doing all disk writes, while the loop driver is
synchronous & buffered. Perhaps XFS is not supporting ths O_DIRECT/aio
combination correctly (or even at all ?)



Comment 4 Eric Sandeen 2006-11-24 04:00:39 UTC
XFS should handle O_DIRECT & aio just fine (AFAIK xfs was the first linux
filesystem to support O_DIRECT) :) ... I'll have to read up on blktap to see how
it works.

Comment 5 Peter Backes 2006-11-24 15:05:55 UTC
The host OS xfs partition hosting the guest OS image checks fine.

Comment 6 Daniel Berrangé 2006-11-24 15:46:03 UTC
Eric - FYI, the blktap code in question is actually mostly in userspace - only a
small stub is in kernel, which exposes a magic device to userspace - a daemon in
userspace deals with the actual I/O. From the Xen SRPM, look at the stuff under
the tools/blktap/ directory within the Xen tar.gz.  There are several different
drivers available - in Fedora/RHEL we default to the 'tap:aio:' driver which is
tools/blktap/drivers/block-aio.c source file.


Comment 7 Peter Backes 2006-11-24 16:43:47 UTC
I wrote above that "It is spurious and occurs randomly during the package
installation in the guest OS, or as written above also in during operation in
the running OS, if it proceeds that far." But in fact, it seems not to be random
at all. I just ran another test, and it failed at the exact same point:

  Installing: startup-notification         ##################### [ 27/111] 
  Updating  : mesa-libGL                   ##################### [ 28/111] 
  Installing: libXres                      ##################### [ 29/111] 
error: libwnck-2.16.0-4.fc6: Header V3 DSA signature: BAD, key ID 4f2a6fd2
  Installing: notification-daemon          ##################### [ 31/111] 
/var/tmp/rpm-tmp.38377: line 3:  1642 Segmentation fault      gconftool-2
--makefile-install-rule /etc/gconf/schemas/notification-daemon.schemas >/dev/null
  Installing: libnotify                    ##################### [ 32/111] 
error: unpacking of archive failed on file /usr/lib/libnotify.so.1.1.0;45671e73:
cpio: MD5 sum mismatch
  Installing: bluez-gnome                  ##################### [ 33/111] 
error: unpacking of archive failed on file
/usr/bin/bluetooth-properties;45671e73: cpio: read
*** glibc detected *** /usr/bin/python: double free or corruption (!prev):
0x0b179258 ***

I conjecture that the randomness is actually no randomness but dependency on 
different memory size and swap device size parameters I used. I will further
test this hypothesis and see if I can find out which parameters exactly are
determining whether it will already fail during installation or only later
during yum update. In fact, "free" shows the following:

[1]+  Stopped                 yum update
# free
             total       used       free     shared    buffers     cached
Mem:        235700     216064      19636          0      10012     131576
-/+ buffers/cache:      74476     161224
Swap:       522104          4     522100

It seems to just have just written its first bytes to the virtual swap
partition; perhaps there is some relation to this.

It also leads to the further conjecture that the corruption issue might be bound
to certain areas of the image file or something similar. 



Comment 8 Daniel Berrangé 2006-11-24 18:21:15 UTC
It is not at all random - it happens very consistently. If you install a guest
using a block device backed disk, and then add a second disk backed by a file on
XFS you can easily reproduce the problem by writing to the 2nd disk. Stephen
Tweedie's verify-data tool will show the problem happens on basically *any*
write - they never successfully write their data out - future reads of data just
written always return NULLs instead of the magic bit pattern.
http://people.redhat.com/sct/src/verify-data/


Comment 9 Peter Backes 2006-11-24 19:43:13 UTC
I will try the verify-data. But I am a little bit confused... Basically as I
wrote above I already succeeded installing the guest system completely, and only
a subsequent yum update triggered the issue (albeit twice at the very same
position). How would that be consistent with "they never successfully write
their data out"? Either our problems are different or there must be some
condition for the problem to occur. Can you describe a little bit more in detail
what you are observing exactly? 

Comment 10 Peter Backes 2006-11-24 20:55:43 UTC
Okay, I can confirm the verify-data diagnostics. I swapoff'd the swap partition
(/dev/xvda2) inside a guest whose image was stored on an xfs partition and used
it for testing. I ran verify-data /dev/xvda2 10M and it showed exactly what
Daniel said. Note the following:
1. Blocks that will be bad when reading already take a suspiciously long time to
write.
2. If you subsequently run verify-data with the identical command line, it will
not produce any error anymore. If you however run verify-data with the step
divided by two (say, verify-data /dev/xvda2 5M)it will produce the problem only
for each second block. So obviously the problem happens only the *first time* a
block is written to.
3. If you dd if=/zero of=/dev/loop0 on the losetup'd image on the host to remove
any holes and make it non-sparse, then the problem will disappear entirely.


Comment 11 Stephen Tweedie 2006-11-24 21:22:13 UTC
Thanks for the testing --- this sounds very much like a core kernel problem, not
a bug in xen, so I'll reassign for now.

Comment 12 Peter Backes 2006-11-25 03:38:00 UTC
Yes, so it seems. I reproduced it with
http://developer.osdl.org/daniel/AIO/TESTS/aiodio_sparse.c plus the following patch:

--- aiodio_sparse.c.1	2003-11-12 20:53:46.000000000 +0100
+++ aiodio_sparse.c	2006-11-25 03:48:33.347155503 +0100
@@ -29,8 +29,8 @@
 	p = buf;
 
 	while (size > 0) {
-		if (*buf != 0) {
-			fprintf(stderr, "non zero buffer at buf[%d] => 0x%02x,%02x,%02x,%02x\n",
+		if (*buf != 1) {
+			fprintf(stderr, "non one buffer at buf[%d] => 0x%02x,%02x,%02x,%02x\n",
 				buf - p, (unsigned int)buf[0],
 				size > 1 ? (unsigned int)buf[1] : 0,
 				size > 2 ? (unsigned int)buf[2] : 0,
@@ -147,7 +147,7 @@
 			unlink(filename);
 			return;
 		}
-		memset(bufptr, 0, writesize);
+		memset(bufptr, 1, writesize);
 		io_prep_pwrite(iocbs[i], fd, bufptr, writesize, offset);
 		offset += writesize;
 	}

(gcc -g -Wall -O2 aiodio_sparse.c -o aiodio_sparse -laio)

I cross-checked on an ext3 partition where it works fine. I also checked the
current FC6 non-xen kernel (2.6.18-1.2849.fc6), which shows the same problem. So
it is a general XFS problem, unrelated to xen.


Comment 13 Peter Backes 2006-11-25 05:16:52 UTC
Addendum: It actually fails on ext3fs, sometimes, too, but not in a
deterministic fashion. On xfs it reproducibly outputs 

nnon one buffer at buf[0] => 0x00,00,00,00
non-zero read at offset 0"

where on extfs, it sometimes passes and sometimes yields the same error, but
with varying, high offset.


Comment 14 Peter Backes 2006-11-25 05:29:13 UTC
Unfortunately I only now actually read aiodio_sparse.c and it seems to do
something entirely different from what the name and description suggestested at
first... It needs to be further checked.

Comment 15 Peter Backes 2006-11-25 06:46:13 UTC
Created attachment 142103 [details]
Patch for aiodio_sparse.c

I changed the patch to first write the file with O_DIRECT with gaps and then
verify it with classic I/O. It deterministically fails on xfs now and works on
ext3. I am however not sure at all if it really tests what it is supposed to
test. It writes output to the file "file"; and if you run "hexdump file" after
it has completed on xfs, I get

0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0010000 0101 0101 0101 0101 0101 0101 0101 0101
*
0020000 0000 0000 0000 0000 0000 0000 0000 0000
*
0510000 0101 0101 0101 0101 0101 0101 0101 0101
...

while on ext3 I get 
0000000 0101 0101 0101 0101 0101 0101 0101 0101
*
0010000 0000 0000 0000 0000 0000 0000 0000 0000
*
0500000 0101 0101 0101 0101 0101 0101 0101 0101
*
0510000 0000 0000 0000 0000 0000 0000 0000 0000
*

So xfs seems to put the data just behind the block where is supposed to go, if
I understood correctly.

If you read the original aiodio_sparse.c, you will see why it 'accidentally'
just sometimes passed on ext3 and failed on xfs.

Please comment if the patched aiodio_sparse.c is for some reason still
inappropriate.

Comment 16 Peter Backes 2006-11-25 07:28:54 UTC
I now did a little bit of testing inside a xen guest, with a fresh dd
if=/dev/zero of=test count=1 seek=1G bs=1 mounted as /dev/xvdb (hexdump lists
entirely nul). If I run ./verify-data /dev/xvdb 25M inside the guest and hexdump
the image afterwards on the host, there's obviously lots of data in it from
previously deleted files from the XFS partition.

Comment 17 Peter Backes 2006-11-25 18:29:34 UTC
Created attachment 142119 [details]
testcase

I extended the aiodio_sparse testcase patch so O_DIRECT and writing into the
hole of a sparse file vs. writing at the end of a file can be turned on and off


compile with
gcc -g -Wall -O2 aiodio_sparse.c -o aiodio_sparse -laio

./aiodio_sparse
    runs the test (fails on xfs, hexdump showing as above, passes on ext3)
./aiodio_sparse -S
    runs the test with appending instead of  sparse file hole writing. (passes)

./aiodio_sparse -D
    runs the test without O_DIRECT (passes)

supply -k to keep the file "file" in any case.

Comment 18 Eric Sandeen 2006-11-25 20:11:46 UTC
rtc's new patch seems to show a genuine problem.  It is writing staggered 
blocks of "1's" but when read back we get zeros.  Further, if we bmap the file:

 EXT: FILE-OFFSET       BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..255]:         192..447          0 (192..447)         256 10000
   1: [256..10239]:     hole                                  9984
   2: [10240..10495]:   10432..10687      0 (10432..10687)     256 10000
   3: [10496..20479]:   hole                                  9984
...

looks like unwritten extents aren't getting converted?

[root@magnesium test]# ls -i file
133 file
[root@magnesium test]# xfs_db -r /dev/hda8
xfs_db> inode 133
xfs_db> p
...
xfs_db> fsblock 11543
xfs_db> type bmapbtd
xfs_db> p
...
recs[1-40] = [startoff,startblock,blockcount,extentflag] 1:[0,24,16,1]
2:[16,40,16,0] 3:[1280,1304,16,1] 4:[1296,1320,16,0] 5:[2560,2584,16,1]
6:[2576,2600,16,0] 7:[3840,3864,16,1] 8:[3856,3880,16,0] 9:[5120,5144,16,1]
10:[5136,5160,16,0] 11:[6400,6424,16,1] 12:[6416,6440,16,0] 13:[7680,7704,16,1]
14:[7696,7720,16,0] 15:[8960,8984,16,1] 16:[8976,9000,16,0]
17:[10240,10264,16,1] 18:[10256,10280,16,0] 19:[11520,11544,16,1]
20:[11536,11560,16,0] 21:[12800,12824,16,1] 22:[12816,12840,16,0]
23:[14080,14104,16,1] 24:[14096,14120,16,0] 25:[15360,15384,16,1]
26:[15376,15400,16,0] 27:[16640,16664,16,1] 28:[16656,16680,16,0]
29:[17920,17944,16,1] 30:[17936,17960,16,0] 31:[19200,19224,16,1]
32:[19216,19240,16,0] 33:[20480,20504,16,1] 34:[20496,20520,16,0]
35:[21760,21784,16,1] 36:[21776,21800,16,0] 37:[23040,23064,16,1]
38:[23056,23080,16,0] 39:[24320,24344,16,1] 40:[24336,24360,16,0]
xfs_db> fsblock 24
xfs_db> p
000: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101
020: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101
040: 01010101 01010101 01010101 01010101 01010101 01010101 01010101 01010101
...

so data is there but unwritten extents weren't unflagged.

If I make the filesystem with -dunwritten=0, it passes.


Comment 19 Peter Backes 2006-11-25 22:09:12 UTC
I can confirm the results of xfs_db (the beginning of the file actually contains
01s), but I get slightly different results with bmap:

file:
 EXT: FILE-OFFSET       BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..127]:         3688..3815        0 (3688..3815)       128 10000
   1: [128..255]:       654080..654207    0 (654080..654207)   128
   2: [256..10239]:     hole                                  9984
   3: [10240..10367]:   314360..314487    0 (314360..314487)   128 10000
   4: [10368..10495]:   654336..654463    0 (654336..654463)   128
   5: [10496..20479]:   hole                                  9984
...
and, as already mentioned at comment #15, if I hexdump file, I get not only
nuls, but also 01s (at the areas that correspond to EXT 1, 4, and so on) 


Comment 20 Eric Sandeen 2006-11-25 22:25:25 UTC
yep, I should have used xfs_bmap -vp, to break out the unwritten parts more clearly

Also there is definitely stale data coming through here (so repeating the test
may show up 1's from previous runs...)

[root@magnesium mnt]# lmdd opat=1 bs=1m count=128 of=/dev/hda8
134.2177 MB in 0.9101 secs, 147.4722 MB/sec
[root@magnesium mnt]# mkfs.xfs -f -dsize=100m /dev/hda8
meta-data=/dev/hda8              isize=256    agcount=6, agsize=4096 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=24576, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal log           bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks
realtime =none                   extsz=65536  blocks=0, rtextents=0
[root@magnesium mnt]# mount /dev/hda8 test
[root@magnesium mnt]# cd test
[root@magnesium test]# /root/aiodio_sparse_rtc -k
non one buffer at buf[0] => 0x00,00,00,00
non-one read at offset 0
*** WARNING *** file has not been unlinked; if you don't rm it manually first,
it may influence the next run
[root@magnesium test]# hexdump file | more
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0014000 0000 0002 0004 0002 0008 0002 000c 0002 <--- stale disk data from lmdd
0014010 0010 0002 0014 0002 0018 0002 001c 0002
0014020 0020 0002 0024 0002 0028 0002 002c 0002
0014030 0030 0002 0034 0002 0038 0002 003c 0002
0014040 0040 0002 0044 0002 0048 0002 004c 0002
...
[root@magnesium test]# xfs_bmap -vp file  | more
file:
 EXT: FILE-OFFSET       BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..127]:         96..223           0 (96..223)          128 10000
   1: [128..255]:       224..351          0 (224..351)         128
   2: [256..10239]:     hole                                  9984
   3: [10240..10367]:   10336..10463      0 (10336..10463)     128 10000
   4: [10368..10495]:   10464..10591      0 (10464..10591)     128
   5: [10496..20479]:   hole                                  9984
...

this almost looks like it is converting the extents at the wrong offsets.

If we look at bmap output from a filesystem w/o unwritten extents:

[root@magnesium test]# more /tmp/xfs_bmap_nounwritten 
file:
 EXT: FILE-OFFSET       BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..127]:         96..223           0 (96..223)          128
   1: [128..10239]:     hole                                 10112
   2: [10240..10367]:   10336..10463      0 (10336..10463)     128
   3: [10368..20479]:   hole                                 10112

This actually looks like it's converting extents at the wrong offsets.

Comment 21 Peter Backes 2006-11-25 22:26:25 UTC
Created attachment 142124 [details]
testcase plus dirty free block generation

Actually, what you get if you hexdump the file is not data written by the most
recent aiodio_sparse run, but it is data from unallocated blocks on the disk.
It just happened to contain 01s from previous aiodio_sparse runs if you invoked
it several times. I changed the patch to create dirty free blocks; these then
clearly show up in the resulting file.

Comment 22 Eric Sandeen 2006-11-26 02:01:57 UTC
whoops take the xen folks off cc: - sorry about that guys

Comment 23 Eric Sandeen 2006-11-26 03:25:15 UTC
This looks like a core kernel problem, I'll see if I can get this fixed upstream.

finished_one_bio calls dio_complete calls xfs_end_io_direct with an offset, but:

offset = dio->iocb->ki_pos;

so this is the -current- post-IO position, not the IO start point that
dio_complete expects.  So, xfs converts the the wrong region.  Ouch!

XFS seems to be the only filesystem that uses the offset passed to the end_io
function, so only it is affected.

This seems to fix it up:

Index: linux-2.6.18/fs/direct-io.c
===================================================================
--- linux-2.6.18.orig/fs/direct-io.c
+++ linux-2.6.18/fs/direct-io.c
@@ -256,7 +256,8 @@ static void finished_one_bio(struct dio 
 			if (dio->io_error)
 				transferred = dio->io_error;
 
-			dio_complete(dio, offset, transferred);
+			/* dio_complete wants starting point of IO */
+			dio_complete(dio, offset-transferred, transferred);
 
 			/* Complete AIO later if falling back to buffered i/o */
 			if (dio->result == dio->size ||

[root@magnesium test]# /root/aiodio_sparse_rtc 
[root@magnesium test]# /root/aiodio_sparse_rtc 
[root@magnesium test]# /root/aiodio_sparse_rtc 
[root@magnesium test]# xfs_info . | grep unwritten
         =                       sunit=0      swidth=0 blks, unwritten=1



Comment 24 Eric Sandeen 2006-12-15 21:00:59 UTC
Zach Brown's upstream AIO work actually seems to resolve this issue, so rather
than trying to patch things up in 2.6.19, I think we'll just let FC6 catch up
with upstream to resolve this.  If someone has a need that is more urgent than
this, let me know and I'll see if I can find a little time to verify the patch
in the previous comment & get it into FC6 prior to 2.6.20....



Comment 25 Jeff Moyer 2007-03-27 21:48:49 UTC
This is a regression that I introduced when fixing a problem with reading a
partial block at the end of a file opened with O_DIRECT.  For the offending
patch, see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178084.

I agree that the right approach is to revert the behaviour.

Comment 26 Jeff Moyer 2007-03-27 21:59:24 UTC
Thanks to Eric for the following link:
  http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg02646.html

Comment 27 Eric Sandeen 2007-04-30 21:32:16 UTC
Minor update; I re-ran the patched aiodio_sparse testcase with printk's ahead of
calls to dio_complete on both rhel5 and 2.6.21.

rhel5:
finished_one_bio call dio_complete offset 65536, ki_pos 65536
finished_one_bio call dio_complete offset 5308416, ki_pos 5308416
finished_one_bio call dio_complete offset 10551296, ki_pos 10551296
finished_one_bio call dio_complete offset 15794176, ki_pos 15794176
finished_one_bio call dio_complete offset 21037056, ki_pos 21037056
finished_one_bio call dio_complete offset 26279936, ki_pos 26279936
finished_one_bio call dio_complete offset 31522816, ki_pos 31522816
finished_one_bio call dio_complete offset 36765696, ki_pos 36765696
finished_one_bio call dio_complete offset 42008576, ki_pos 42008576
finished_one_bio call dio_complete offset 47251456, ki_pos 47251456
finished_one_bio call dio_complete offset 52494336, ki_pos 52494336
finished_one_bio call dio_complete offset 57737216, ki_pos 57737216
finished_one_bio call dio_complete offset 62980096, ki_pos 62980096
finished_one_bio call dio_complete offset 68222976, ki_pos 68222976
finished_one_bio call dio_complete offset 73465856, ki_pos 73465856
finished_one_bio call dio_complete offset 78708736, ki_pos 78708736
finished_one_bio call dio_complete offset 83951616, ki_pos 83951616
finished_one_bio call dio_complete offset 89194496, ki_pos 89194496
finished_one_bio call dio_complete offset 94437376, ki_pos 94437376
finished_one_bio call dio_complete offset 99680256, ki_pos 99680256

2.6.21:
dio_bio_end_aio call dio_complete offset/ki_pos 0
dio_bio_end_aio call dio_complete offset/ki_pos 5242880
dio_bio_end_aio call dio_complete offset/ki_pos 10485760
dio_bio_end_aio call dio_complete offset/ki_pos 15728640
dio_bio_end_aio call dio_complete offset/ki_pos 20971520
dio_bio_end_aio call dio_complete offset/ki_pos 26214400
dio_bio_end_aio call dio_complete offset/ki_pos 31457280
dio_bio_end_aio call dio_complete offset/ki_pos 36700160
dio_bio_end_aio call dio_complete offset/ki_pos 41943040
dio_bio_end_aio call dio_complete offset/ki_pos 47185920
dio_bio_end_aio call dio_complete offset/ki_pos 52428800
dio_bio_end_aio call dio_complete offset/ki_pos 57671680
dio_bio_end_aio call dio_complete offset/ki_pos 62914560
dio_bio_end_aio call dio_complete offset/ki_pos 68157440
dio_bio_end_aio call dio_complete offset/ki_pos 73400320
dio_bio_end_aio call dio_complete offset/ki_pos 78643200
dio_bio_end_aio call dio_complete offset/ki_pos 83886080
dio_bio_end_aio call dio_complete offset/ki_pos 89128960
dio_bio_end_aio call dio_complete offset/ki_pos 94371840
dio_bio_end_aio call dio_complete offset/ki_pos 99614720

so it seems clear that upstream (with Zach Brown's rewrite) is calling
dio_complete with the IO startpoint, whereas rhel5 is calling it with the dio
finish point (start+offset)

Comment 28 Jeff Moyer 2007-04-30 22:32:23 UTC
Eric, can you post the kernel patches that generated this output, please?

Thanks!

Comment 29 Eric Sandeen 2007-05-01 01:08:04 UTC
Sure... lazy pasting whitespace-mangled patches, sorry:

--- fs/direct-io.c.orig 2007-04-30 20:04:17.000000000 -0500
+++ fs/direct-io.c      2007-04-30 16:36:35.000000000 -0500
@@ -256,6 +256,7 @@
                        if (dio->io_error)
                                transferred = dio->io_error;
 
+                       printk("finished_one_bio call dio_complete offset %lld,
ki_pos %lld, block_in file %lld bytes\n", offset, dio->iocb->ki_pos,
dio->block_in_file << dio->blkbits);
                        dio_complete(dio, offset, transferred);
 
                        /* Complete AIO later if falling back to buffered i/o */
@@ -1138,6 +1139,7 @@
                        if (rw == READ && (offset + transferred > i_size))
                                transferred = i_size - offset;
                }
+               printk("direct_io_worker call dio_complete offset %lld, ki_pos
%lld\n", offset, dio->iocb->ki_pos);
                dio_complete(dio, offset, transferred);
                if (ret == 0)
                        ret = transferred;


Index: linux-2.6.21/fs/direct-io.c
===================================================================
--- linux-2.6.21.orig/fs/direct-io.c
+++ linux-2.6.21/fs/direct-io.c
@@ -283,7 +283,9 @@ static int dio_bio_end_aio(struct bio *b
        spin_unlock_irqrestore(&dio->bio_lock, flags);
 
        if (remaining == 0) {
-               int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
+               int ret;
+               printk("dio_bio_end_aio call dio_complete offset/ki_pos %lld\n",
dio->iocb->ki_pos);
+               ret  = dio_complete(dio, dio->iocb->ki_pos, 0);
                aio_complete(dio->iocb, ret, 0);
                kfree(dio);
        }
@@ -1112,6 +1114,7 @@ direct_io_worker(int rw, struct kiocb *i
        spin_unlock_irqrestore(&dio->bio_lock, flags);
        BUG_ON(!dio->is_async && ret2 != 0);
        if (ret2 == 0) {
+               printk("direct_io_worker call dio-complete offset %lld , ki_pos
%lld\n", offset, dio->iocb->ki_pos);
                ret = dio_complete(dio, offset, ret);
                kfree(dio);
        } else



Comment 30 Jeff Moyer 2007-05-01 15:00:12 UTC
It looks like the patch is newer than the output.  The RHEL 5 runs didn't print
out a block_in_file column.

Comment 31 Eric Sandeen 2007-05-01 15:03:40 UTC
Jeff, that's true, sorry - I updated it after we talked on IRC.  With that
patch, the block_in_file was the same value as the other values, i.e.:

finished_one_bio call dio_complete offset 65536, ki_pos 65536, block_in file 65536 

Comment 32 Jeff Moyer 2007-08-17 17:51:21 UTC
This bug should be fixed in current kernels.  Can you please re-test and update
the bug as appropriate?

Thanks!

Comment 33 Eric Sandeen 2007-08-17 18:11:53 UTC
FWIW running patched aio_dio_sparse on xfs with the F8 kernel was fine.

I suppose the original XEN test would be the clincher. :)

Comment 34 Jeff Moyer 2007-10-04 19:09:40 UTC
I'm going to close this.  We think it's fixed.  If it isn't, feel free to re-open.