Bug 563103

Summary: sysdeps/posix/preadv.c: preadv emulation does not align user buffer, fails sometimes if file was opened with O_DIRECT
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: glibcAssignee: Jeff Law <law>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: amit.shah, berrange, bstein, drjohnson1, dwmw2, ehabkost, fweimer, itamar, jakub, jaswinder, jforbes, knoel, meyering, pbonzini, rth, schwab, scottt.tw, tburke, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-26 16:59:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
List of RPM N-E-V-R differences between failed build & previous libguestfs-1.0.82-7.fc13
none
strace of Rawhide qemu running on RHEL 5 kernel
none
glibc preadv/pwritev alignment patch
none
Updated preadv/pwritev alignment patch for glibc.
none
Updated preadv/pwritev alignment patch for glibc. none

Description Richard W.M. Jones 2010-02-09 08:52:23 UTC
Description of problem:

[I haven't got a good handle on this bug yet, so I don't know if it's
kernel, qemu, libguestfs, the environment I'm testing it in, or what.
This is just a placeholder at the moment.]

A guest running under qemu generates large numbers of I/O errors.
For virtio the disks are unusable.  For IDE it causes intermittent errors.

This occurs with *both* virtio and IDE emulation.

Errors are like this:

(virtio)
[   40.119245] Buffer I/O error on device vdb, logical block 9
[   40.119245] Buffer I/O error on device vdb, logical block 10
[   40.171245] end_request: I/O error, dev vdb, sector 88
[   40.185187] end_request: I/O error, dev vdb, sector 112
[   40.199245] end_request: I/O error, dev vdb, sector 144
[   40.241245] end_request: I/O error, dev vdb, sector 184
[   40.250245] end_request: I/O error, dev vdb, sector 240

(ide)
[   28.851576] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   28.852086] ata1.00: BMDMA stat 0x5
[   28.853135] ata1.00: failed command: READ DMA
[   28.853688] ata1.00: cmd c8/00:30:00:00:00/00:00:00:00:00/e0 tag 0 dma 24576 in
[   28.853711]          res 41/04:30:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
[   28.854088] ata1.00: status: { DRDY ERR }
[   28.854409] ata1.00: error: { ABRT }
[   29.192484] ata1.00: configured for MWDMA2
[   29.315598] ata1.01: configured for MWDMA2
[   29.316707] ata1: EH complete

Version-Release number of selected component (if applicable):

2:qemu-kvm-0.12.2-5.fc13.x86_64
0:kernel-2.6.33-0.40.rc7.git0.fc13.x86_64

How reproducible:

Always when running libguestfs tests in Koji.  See:

http://koji.fedoraproject.org/koji/getfile?taskID=1965779&name=build.log
http://koji.fedoraproject.org/koji/getfile?taskID=1965551&name=build.log
http://koji.fedoraproject.org/koji/getfile?taskID=1968521&name=build.log
http://koji.fedoraproject.org/koji/getfile?taskID=1970421&name=build.log

Steps to Reproduce:
1. (Unknown -- not yet reproducible for me locally)
2.
3.
  
Actual results:


Expected results:


Additional info:

The qemu command line is:

/usr/bin/qemu-kvm \
    -drive file=test1.img,cache=off,if=virtio \
    -drive file=test2.img,cache=off,if=virtio \
    -drive file=test3.img,cache=off,if=virtio \
    -drive file=../images/test.iso,snapshot=on,if=virtio \
    -nodefaults \
    -nographic \
    -serial stdio \
    -m 500 \
    -no-reboot \
    -no-hpet \
    -net user,vlan=0,net=10.0.2.0/8 \
    -net nic,model=virtio,vlan=0 \
    -kernel /tmp/libguestfsiMAuGJ/kernel \
    -initrd /tmp/libguestfsiMAuGJ/initrd \
    -append 'panic=1 console=ttyS0 udevtimeout=300 noapic acpi=off printk.time=1 cgroup_disable=memory selinux=0 guestfs_vmchannel=tcp:10.0.2.2:58682 guestfs_verbose=1 '

In the IDE case, it's the same except 'if=ide'.

Comment 1 Richard W.M. Jones 2010-02-09 10:43:23 UTC
In the IDE case we see errors but for some reason the guest is
usually able to recover enough to continue.  For example, in the
following successful build done using IDE emulation:

http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/build.log

there were lots of IDE errors reported by the guest, but the guest
was able to recover enough to run all the tests to completion.

With virtio for whatever reason these errors are unrecoverable.

Comment 2 Richard W.M. Jones 2010-02-09 13:05:56 UTC
Additional info:

The host disk files are sparsely allocated, if that makes any difference.

Comment 3 Richard W.M. Jones 2010-02-09 15:22:36 UTC
This patch looks very interesting:

http://www.mail-archive.com/kvm@vger.kernel.org/msg28004.html

Note that the host kernel in the failure case (ie. Koji) is
something strange, like RHEL 5, so it doesn't necessarily
behave like a recent kernel would.

Comment 4 Richard W.M. Jones 2010-02-10 17:11:50 UTC
I've managed to reproduce this using a RHEL 5 kernel with
Rawhide userspace.  My mock configuration is summarised
here:

http://rwmj.wordpress.com/2010/02/10/tip-mock-build-rawhide-packages-on-rhel-5/

Comment 5 Daniel Berrangé 2010-02-10 17:24:06 UTC
Created attachment 390060 [details]
List of RPM  N-E-V-R differences between failed build & previous libguestfs-1.0.82-7.fc13

I compared the RPM N-E-V-Rs from this failed build

http://koji.fedoraproject.org/koji/getfile?taskID=1965779&name=root.log

With this previous working build

http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/root.log

The new Glibc or new QEMU builds look like good candidates for further investigation as to what's causing this

Comment 6 Richard W.M. Jones 2010-02-10 18:35:22 UTC
(In reply to comment #5)
> With this previous working build
> 
> http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/root.log

Thanks, but unfortunately this build wasn't a working build.
Although it did struggle through all the tests, it was still
throwing lots of virtual IDE errors as you can see if you look
for the word 'FAILED' in the build log:

http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/build.log

For some reason IDE seems to tolerate this, whereas with virtio
it fails immediately.  (We just switched over to using virtio which
is why this bug is affecting us now).

I have investigated this and it looks like a pread+O_DIRECT alignment
issue.  The fd of the block device backing file is opened with O_DIRECT,
but calls to pread() don't align the buffer.  (Why the buffer is not
aligned with the RHEL 5 kernel, but aligned with a Rawhide kernel
is still a mystery).

Comment 7 Richard W.M. Jones 2010-02-10 18:41:45 UTC
Created attachment 390083 [details]
strace of Rawhide qemu running on RHEL 5 kernel

This is the strace output from qemu when running on a RHEL 5
kernel.  The failure seems to stem from lines like this:

19438 pread(8, 0x2aaad8531dc0, 12288, 0) = -1 EINVAL (Invalid argument)

Comment 8 Richard W.M. Jones 2010-02-11 10:55:11 UTC
Oh joy, I see what's going on here.  The problem is the following code
in glibc:

ssize_t
PREADV (int fd, const struct iovec *vector, int count, OFF_T offset)
{
  /* Find the total number of bytes to be read.  */
  size_t bytes = 0;
  for (int i = 0; i < count; ++i)
    {
      /* Check for ssize_t overflow.  */
      if (SSIZE_MAX - bytes < vector[i].iov_len)
        {
          __set_errno (EINVAL);
          return -1;
        }
      bytes += vector[i].iov_len;
    }

  /* Allocate a temporary buffer to hold the data.  We should normally
     use alloca since it's faster and does not require synchronization
     with other threads.  But we cannot if the amount of memory
     required is too large.  */
  char *buffer;
  char *malloced_buffer __attribute__ ((__cleanup__ (ifree))) = NULL;
  if (__libc_use_alloca (bytes))
    buffer = (char *) __alloca (bytes);
  else
    {
      malloced_buffer = buffer = (char *) malloc (bytes);
      if (buffer == NULL)
        return -1;
    }

  /* Read the data.  */
  ssize_t bytes_read = PREAD (fd, buffer, bytes, offset);
[...]

Because the old RHEL 5 kernel does not have preadv system call, the code
above in glibc is invoked to emulate it.  As you can see, it emulates it
by allocating a temporary buffer and calling pread.  However the temporary
buffer is not aligned.  Because 'fd' is opened with O_DIRECT, the temporary
buffer should be aligned to 512 bytes.  As a result, the pread syscall fails
with -EINVAL.

The EINVAL failure is passed up to qemu and converts into a disk error
for the guest.

Comment 9 Richard W.M. Jones 2010-02-11 18:50:43 UTC
Created attachment 390320 [details]
glibc preadv/pwritev alignment patch

This is a potential patch to glibc.  Also a build of glibc containing
this patch:

http://koji.fedoraproject.org/koji/taskinfo?taskID=1977849

*NOTE* I am still testing whether this patch fixes the issue.

Comment 10 Richard W.M. Jones 2010-02-11 18:54:22 UTC
The patch in comment 9 *does* fix the problem for me.
Please consider something like this for inclusion in glibc.

Comment 11 Jim Meyering 2010-02-11 19:06:24 UTC
Hi Rich,
The patch looks ok, but you may want to avoid overflow, in case adding 512 is too large.  Also, an optimization would be to use fcntl's F_GETFD to avoid that padding in the common case.  Also, is 512 always enough?  I have a vague recollection that some FS types require page-size alignment.

Comment 12 chellwig@redhat.com 2010-02-12 12:55:15 UTC
(In reply to comment #11)
> Hi Rich,
> The patch looks ok, but you may want to avoid overflow, in case adding 512 is
> too large.  Also, an optimization would be to use fcntl's F_GETFD to avoid that
> padding in the common case.  Also, is 512 always enough?  I have a vague
> recollection that some FS types require page-size alignment.    

It's more complicated than that.   Historically Linux needed page size alignment for direct I/O, which was later reduced to block device sector size.  That means so far for any recent kernel 512 was usually enough.  But now with the 4k sector drives showing up that's not going to be enough anymore.  I would suggest to stay with page size alignment for now.

Comment 13 Richard W.M. Jones 2010-02-12 15:54:59 UTC
Created attachment 390514 [details]
Updated preadv/pwritev alignment patch for glibc.

This is an updated patch.

**NOTE** I am still testing it.

Comment 14 Richard Henderson 2010-02-12 18:14:37 UTC
I think a better patch would be the simpler one.  Always align the buffer.
Don't bother checking in with fcntl.  Use __valloc instead of getpagesize
plus memalign.  __valloc is a non-exported symbol, and so does not create
an unwanted PLT entry within libc itself, and is by its definition always
page aligned.

Comment 15 Richard W.M. Jones 2010-02-12 18:51:58 UTC
Firstly I can report that the patch in comment 13 tests out OK.

(In reply to comment #14)
> I think a better patch would be the simpler one.  Always align the buffer.
> Don't bother checking in with fcntl.  Use __valloc instead of getpagesize
> plus memalign.  __valloc is a non-exported symbol, and so does not create
> an unwanted PLT entry within libc itself, and is by its definition always
> page aligned.    

Yes, using __valloc is obviously better.

Not sure about whether to always align.  There is some memory
penalty to doing that.  If we always call __valloc, then there is also
a very real thread synchronization penalty we have to pay as well.

Comment 16 Avi Kivity 2010-02-13 13:27:51 UTC
If the input buffer is unaligned, you can safely skip alignment.

Comment 17 Richard W.M. Jones 2010-02-15 11:15:28 UTC
(In reply to comment #16)
> If the input buffer is unaligned, you can safely skip alignment.    

Yup, that's a good point. I was wondering what we could do if the
input buffer was aligned, which is completely the wrong way round
to think about it.

Comment 18 Paolo Bonzini 2010-02-15 22:52:54 UTC
However, for safety I'd do the alignment (to 4k) even if the input buffer is only 512-byte aligned.

Comment 19 Richard W.M. Jones 2010-02-15 23:19:23 UTC
Discussion on qemu-devel:
http://lists.gnu.org/archive/html/qemu-devel/2010-02/threads.html#00823

Comment 20 Richard W.M. Jones 2010-02-16 11:51:00 UTC
Created attachment 394509 [details]
Updated preadv/pwritev alignment patch for glibc.

Updated patch:

 - Uses __valloc.
 - Avoid doing fcntl test if callers buffers are not aligned.

**NOTE** I am still testing this patch.

Comment 21 Richard W.M. Jones 2010-02-16 14:42:25 UTC
Comment on attachment 394509 [details]
Updated preadv/pwritev alignment patch for glibc.

Patch is bad.

Comment 22 Bug Zapper 2010-03-15 14:28:27 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 23 Richard W.M. Jones 2010-03-30 10:29:11 UTC
Still present in Rawhide.

Comment 24 Richard W.M. Jones 2010-03-30 10:51:04 UTC
I worked around this in libguestfs by adding a LD_PRELOAD
object which just intercepts calls to p{read,write}{,64}v and
makes them all return ENOSYS:

http://cvs.fedoraproject.org/viewvc/devel/libguestfs/libguestfs.spec?r1=1.163&r2=1.166

The bug still exists in glibc, but at least we can now run the
libguestfs tests in Koji.

Comment 25 d. johnson 2010-07-06 16:53:33 UTC
Thank you for taking the time to report this bug. Updates to this package have been released since it was first reported. If you have time to update the package and re-test, please do so and report the results here. You can obtain the updated package by typing 'yum update <package>' or using the graphical updater, Software Update.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 26 Richard W.M. Jones 2010-07-06 17:31:45 UTC
This is a very hard bug to test, but I can see from looking
at the latest glibc code and comparing this to comment 8
that the bug is not fixed yet.

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/preadv.c;h=791077399e32ccf9066f47f40b12eb22ab0ccabc;hb=HEAD#l47

Comment 27 Glauber Costa 2010-07-07 12:53:45 UTC
RHEL has a very easy reproducer for this, here:

https://bugzilla.redhat.com/show_bug.cgi?id=606953

Comment 28 Glauber Costa 2010-07-07 12:56:15 UTC
Ooops, forget last comment.
Wrong bug.

Right one is https://bugzilla.redhat.com/show_bug.cgi?id=555649

Comment 29 Bug Zapper 2010-07-30 10:50:21 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle.
Changing version to '14'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 30 chellwig@redhat.com 2010-10-15 20:30:09 UTC
Did this ever get fixed?

Comment 31 Richard W.M. Jones 2010-10-18 07:36:34 UTC
(In reply to comment #30)
> Did this ever get fixed?

No.  However IIUC the latest glibc in Fedora was changed
so that it now assumes a more recent kernel, and therefore
the preadv/pwritev emulation path is compiled out, which
is a fix of sorts.

Comment 32 Fedora Admin XMLRPC Client 2011-11-14 19:43:55 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 33 Jeff Law 2012-01-26 16:59:37 UTC
Given everyone's time constraints and the fact that glibc is assuming a reasonably modern kernel which avoids this problem, I'm closing this BZ.

Comment 34 Richard W.M. Jones 2012-01-26 18:31:00 UTC
I was bitten by this bug again just yesterday.

The theory that glibc requires a modern kernel is a sound one.
Unfortunately our Brew (and Koji) builders still run modern
glibc on some ancient RHEL 5 kernel.

Comment 35 Jeff Law 2012-01-26 20:01:51 UTC
It's funny you should mention that, I was bitten by robust list bits a couple times in the last few weeks.  Our releng folks told me the build machines are supposed to be running RHEL 6 kernels and are supposed to be updated roughly monthly.  Clearly that's not the case.

Sigh....

Comment 36 Richard W.M. Jones 2012-09-27 18:16:48 UTC
RHEL 6 builds in brew still use a RHEL 5 builder.  I
added 'uname -a' to one of my builds:

Linux x86-009.build.bos.redhat.com 2.6.18-308.13.1.el5 #1 SMP Thu Jul 26 05:45:09 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

I am told that RHEL *7* builds run on a RHEL 6 builder.