Description of problem: [I haven't got a good handle on this bug yet, so I don't know if it's kernel, qemu, libguestfs, the environment I'm testing it in, or what. This is just a placeholder at the moment.] A guest running under qemu generates large numbers of I/O errors. For virtio the disks are unusable. For IDE it causes intermittent errors. This occurs with *both* virtio and IDE emulation. Errors are like this: (virtio) [ 40.119245] Buffer I/O error on device vdb, logical block 9 [ 40.119245] Buffer I/O error on device vdb, logical block 10 [ 40.171245] end_request: I/O error, dev vdb, sector 88 [ 40.185187] end_request: I/O error, dev vdb, sector 112 [ 40.199245] end_request: I/O error, dev vdb, sector 144 [ 40.241245] end_request: I/O error, dev vdb, sector 184 [ 40.250245] end_request: I/O error, dev vdb, sector 240 (ide) [ 28.851576] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 28.852086] ata1.00: BMDMA stat 0x5 [ 28.853135] ata1.00: failed command: READ DMA [ 28.853688] ata1.00: cmd c8/00:30:00:00:00/00:00:00:00:00/e0 tag 0 dma 24576 in [ 28.853711] res 41/04:30:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error) [ 28.854088] ata1.00: status: { DRDY ERR } [ 28.854409] ata1.00: error: { ABRT } [ 29.192484] ata1.00: configured for MWDMA2 [ 29.315598] ata1.01: configured for MWDMA2 [ 29.316707] ata1: EH complete Version-Release number of selected component (if applicable): 2:qemu-kvm-0.12.2-5.fc13.x86_64 0:kernel-2.6.33-0.40.rc7.git0.fc13.x86_64 How reproducible: Always when running libguestfs tests in Koji. See: http://koji.fedoraproject.org/koji/getfile?taskID=1965779&name=build.log http://koji.fedoraproject.org/koji/getfile?taskID=1965551&name=build.log http://koji.fedoraproject.org/koji/getfile?taskID=1968521&name=build.log http://koji.fedoraproject.org/koji/getfile?taskID=1970421&name=build.log Steps to Reproduce: 1. (Unknown -- not yet reproducible for me locally) 2. 3. Actual results: Expected results: Additional info: The qemu command line is: /usr/bin/qemu-kvm \ -drive file=test1.img,cache=off,if=virtio \ -drive file=test2.img,cache=off,if=virtio \ -drive file=test3.img,cache=off,if=virtio \ -drive file=../images/test.iso,snapshot=on,if=virtio \ -nodefaults \ -nographic \ -serial stdio \ -m 500 \ -no-reboot \ -no-hpet \ -net user,vlan=0,net=10.0.2.0/8 \ -net nic,model=virtio,vlan=0 \ -kernel /tmp/libguestfsiMAuGJ/kernel \ -initrd /tmp/libguestfsiMAuGJ/initrd \ -append 'panic=1 console=ttyS0 udevtimeout=300 noapic acpi=off printk.time=1 cgroup_disable=memory selinux=0 guestfs_vmchannel=tcp:10.0.2.2:58682 guestfs_verbose=1 ' In the IDE case, it's the same except 'if=ide'.
In the IDE case we see errors but for some reason the guest is usually able to recover enough to continue. For example, in the following successful build done using IDE emulation: http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/build.log there were lots of IDE errors reported by the guest, but the guest was able to recover enough to run all the tests to completion. With virtio for whatever reason these errors are unrecoverable.
Additional info: The host disk files are sparsely allocated, if that makes any difference.
This patch looks very interesting: http://www.mail-archive.com/kvm@vger.kernel.org/msg28004.html Note that the host kernel in the failure case (ie. Koji) is something strange, like RHEL 5, so it doesn't necessarily behave like a recent kernel would.
I've managed to reproduce this using a RHEL 5 kernel with Rawhide userspace. My mock configuration is summarised here: http://rwmj.wordpress.com/2010/02/10/tip-mock-build-rawhide-packages-on-rhel-5/
Created attachment 390060 [details] List of RPM N-E-V-R differences between failed build & previous libguestfs-1.0.82-7.fc13 I compared the RPM N-E-V-Rs from this failed build http://koji.fedoraproject.org/koji/getfile?taskID=1965779&name=root.log With this previous working build http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/root.log The new Glibc or new QEMU builds look like good candidates for further investigation as to what's causing this
(In reply to comment #5) > With this previous working build > > http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/root.log Thanks, but unfortunately this build wasn't a working build. Although it did struggle through all the tests, it was still throwing lots of virtual IDE errors as you can see if you look for the word 'FAILED' in the build log: http://kojipkgs.fedoraproject.org/packages/libguestfs/1.0.82/7.fc13/data/logs/x86_64/build.log For some reason IDE seems to tolerate this, whereas with virtio it fails immediately. (We just switched over to using virtio which is why this bug is affecting us now). I have investigated this and it looks like a pread+O_DIRECT alignment issue. The fd of the block device backing file is opened with O_DIRECT, but calls to pread() don't align the buffer. (Why the buffer is not aligned with the RHEL 5 kernel, but aligned with a Rawhide kernel is still a mystery).
Created attachment 390083 [details] strace of Rawhide qemu running on RHEL 5 kernel This is the strace output from qemu when running on a RHEL 5 kernel. The failure seems to stem from lines like this: 19438 pread(8, 0x2aaad8531dc0, 12288, 0) = -1 EINVAL (Invalid argument)
Oh joy, I see what's going on here. The problem is the following code in glibc: ssize_t PREADV (int fd, const struct iovec *vector, int count, OFF_T offset) { /* Find the total number of bytes to be read. */ size_t bytes = 0; for (int i = 0; i < count; ++i) { /* Check for ssize_t overflow. */ if (SSIZE_MAX - bytes < vector[i].iov_len) { __set_errno (EINVAL); return -1; } bytes += vector[i].iov_len; } /* Allocate a temporary buffer to hold the data. We should normally use alloca since it's faster and does not require synchronization with other threads. But we cannot if the amount of memory required is too large. */ char *buffer; char *malloced_buffer __attribute__ ((__cleanup__ (ifree))) = NULL; if (__libc_use_alloca (bytes)) buffer = (char *) __alloca (bytes); else { malloced_buffer = buffer = (char *) malloc (bytes); if (buffer == NULL) return -1; } /* Read the data. */ ssize_t bytes_read = PREAD (fd, buffer, bytes, offset); [...] Because the old RHEL 5 kernel does not have preadv system call, the code above in glibc is invoked to emulate it. As you can see, it emulates it by allocating a temporary buffer and calling pread. However the temporary buffer is not aligned. Because 'fd' is opened with O_DIRECT, the temporary buffer should be aligned to 512 bytes. As a result, the pread syscall fails with -EINVAL. The EINVAL failure is passed up to qemu and converts into a disk error for the guest.
Created attachment 390320 [details] glibc preadv/pwritev alignment patch This is a potential patch to glibc. Also a build of glibc containing this patch: http://koji.fedoraproject.org/koji/taskinfo?taskID=1977849 *NOTE* I am still testing whether this patch fixes the issue.
The patch in comment 9 *does* fix the problem for me. Please consider something like this for inclusion in glibc.
Hi Rich, The patch looks ok, but you may want to avoid overflow, in case adding 512 is too large. Also, an optimization would be to use fcntl's F_GETFD to avoid that padding in the common case. Also, is 512 always enough? I have a vague recollection that some FS types require page-size alignment.
(In reply to comment #11) > Hi Rich, > The patch looks ok, but you may want to avoid overflow, in case adding 512 is > too large. Also, an optimization would be to use fcntl's F_GETFD to avoid that > padding in the common case. Also, is 512 always enough? I have a vague > recollection that some FS types require page-size alignment. It's more complicated than that. Historically Linux needed page size alignment for direct I/O, which was later reduced to block device sector size. That means so far for any recent kernel 512 was usually enough. But now with the 4k sector drives showing up that's not going to be enough anymore. I would suggest to stay with page size alignment for now.
Created attachment 390514 [details] Updated preadv/pwritev alignment patch for glibc. This is an updated patch. **NOTE** I am still testing it.
I think a better patch would be the simpler one. Always align the buffer. Don't bother checking in with fcntl. Use __valloc instead of getpagesize plus memalign. __valloc is a non-exported symbol, and so does not create an unwanted PLT entry within libc itself, and is by its definition always page aligned.
Firstly I can report that the patch in comment 13 tests out OK. (In reply to comment #14) > I think a better patch would be the simpler one. Always align the buffer. > Don't bother checking in with fcntl. Use __valloc instead of getpagesize > plus memalign. __valloc is a non-exported symbol, and so does not create > an unwanted PLT entry within libc itself, and is by its definition always > page aligned. Yes, using __valloc is obviously better. Not sure about whether to always align. There is some memory penalty to doing that. If we always call __valloc, then there is also a very real thread synchronization penalty we have to pay as well.
If the input buffer is unaligned, you can safely skip alignment.
(In reply to comment #16) > If the input buffer is unaligned, you can safely skip alignment. Yup, that's a good point. I was wondering what we could do if the input buffer was aligned, which is completely the wrong way round to think about it.
However, for safety I'd do the alignment (to 4k) even if the input buffer is only 512-byte aligned.
Discussion on qemu-devel: http://lists.gnu.org/archive/html/qemu-devel/2010-02/threads.html#00823
Created attachment 394509 [details] Updated preadv/pwritev alignment patch for glibc. Updated patch: - Uses __valloc. - Avoid doing fcntl test if callers buffers are not aligned. **NOTE** I am still testing this patch.
Comment on attachment 394509 [details] Updated preadv/pwritev alignment patch for glibc. Patch is bad.
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Still present in Rawhide.
I worked around this in libguestfs by adding a LD_PRELOAD object which just intercepts calls to p{read,write}{,64}v and makes them all return ENOSYS: http://cvs.fedoraproject.org/viewvc/devel/libguestfs/libguestfs.spec?r1=1.163&r2=1.166 The bug still exists in glibc, but at least we can now run the libguestfs tests in Koji.
Thank you for taking the time to report this bug. Updates to this package have been released since it was first reported. If you have time to update the package and re-test, please do so and report the results here. You can obtain the updated package by typing 'yum update <package>' or using the graphical updater, Software Update. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
This is a very hard bug to test, but I can see from looking at the latest glibc code and comparing this to comment 8 that the bug is not fixed yet. http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/preadv.c;h=791077399e32ccf9066f47f40b12eb22ab0ccabc;hb=HEAD#l47
RHEL has a very easy reproducer for this, here: https://bugzilla.redhat.com/show_bug.cgi?id=606953
Ooops, forget last comment. Wrong bug. Right one is https://bugzilla.redhat.com/show_bug.cgi?id=555649
This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle. Changing version to '14'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Did this ever get fixed?
(In reply to comment #30) > Did this ever get fixed? No. However IIUC the latest glibc in Fedora was changed so that it now assumes a more recent kernel, and therefore the preadv/pwritev emulation path is compiled out, which is a fix of sorts.
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
Given everyone's time constraints and the fact that glibc is assuming a reasonably modern kernel which avoids this problem, I'm closing this BZ.
I was bitten by this bug again just yesterday. The theory that glibc requires a modern kernel is a sound one. Unfortunately our Brew (and Koji) builders still run modern glibc on some ancient RHEL 5 kernel.
It's funny you should mention that, I was bitten by robust list bits a couple times in the last few weeks. Our releng folks told me the build machines are supposed to be running RHEL 6 kernels and are supposed to be updated roughly monthly. Clearly that's not the case. Sigh....
RHEL 6 builds in brew still use a RHEL 5 builder. I added 'uname -a' to one of my builds: Linux x86-009.build.bos.redhat.com 2.6.18-308.13.1.el5 #1 SMP Thu Jul 26 05:45:09 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux I am told that RHEL *7* builds run on a RHEL 6 builder.