537077 – error codes aren't always propagated up through the block layer (e.g. -ENOSPC)

Bug 537077 - error codes aren't always propagated up through the block layer (e.g. -ENOSPC)

Summary: error codes aren't always propagated up through the block layer (e.g. -ENOSPC)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kvm
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Kevin Wolf
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	537083 560623
TreeView+	depends on / blocked

Reported:	2009-11-12 12:09 UTC by Eduardo Habkost
Modified:	2013-01-09 22:00 UTC (History)
CC List:	13 users (show)
Fixed In Version:	kvm-83-153.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:	520693
Clones:	537083 560623 (view as bug list)
Environment:
Last Closed:	2010-03-30 07:52:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0271	0	normal	SHIPPED_LIVE	Important: kvm security, bug fix and enhancement update	2010-03-29 13:19:48 UTC

Description Eduardo Habkost 2009-11-12 12:09:20 UTC

This is an issue found while testing Bug #520693: on some cases, on block device read or write errors, the error code is not propagated up.

An example is grow_refcount_table(), that always returns -EIO on write errors, but there are probably many other places where error codes may be lost the same way.


+++ This bug was initially created as a clone of Bug #520693 +++

[...]
--- Additional comment from rsibley on 2009-09-18 15:22:50 EDT ---

Testing kvm-83-105.el5_4.3 on rhel5.4 released.

w2k3 VM w/20GB raw device format Qcow2, w/ latest VIRTIO drivers.

Using IOmeter Sequential Writes 2K, 4K, 8K, 16K, 32K, 64K transfer request size.

On RHEV-M I get the following errors:

VM has paused due to no Storage space error.

VM has paused due to unknown storage error (error code: EIO).

The errors above occurred during sequential writes @32k and 64K transfer req size.


I'm also seeing a 2x increase in throughput over Bare Metal w/sequential reads @64K trns req size.

[...]
--- Additional comment from ehabkost on 2009-11-06 08:33:33 EDT ---

(In reply to comment #41)
> The host has an MSA 1000 total amount of disk storage 474GB, used space 87GB,
> free space 387GB.

I noticed that the qcow images are being stored on LVM volumes, and VDSM expects qemu to pause the guest on ENOSPC errors so the volume can be resized and the guest could continue. This means that the "no storage space" error is expected. The problem is that ENOSPC errors are not being handled properly on some parts of the code. Sometimes this causes crashes like the one you saw, and sometimes it makes the code incorrectly return EIO instead of ENOSPC.

However, this may be a different issue, not related to the patches being tracked here at this BZ. We need to know if the same crashes are reproducible using a KVM version without the performance patches. See Comment #25 for info about versions with and without those patches.

Comparing versions with and without the 64k-cluster patch (added on kvm-83-117.el5) is interesting, too.

--- Additional comment from kwolf on 2009-11-09 07:18:32 EDT ---

(In reply to comment #42)
> (In reply to comment #41)
> > The host has an MSA 1000 total amount of disk storage 474GB, used space 87GB,
> > free space 387GB.
> 
> I noticed that the qcow images are being stored on LVM volumes, and VDSM
> expects qemu to pause the guest on ENOSPC errors so the volume can be resized
> and the guest could continue. This means that the "no storage space" error is
> expected. The problem is that ENOSPC errors are not being handled properly on
> some parts of the code. Sometimes this causes crashes like the one you saw, and
> sometimes it makes the code incorrectly return EIO instead of ENOSPC.
> 
> However, this may be a different issue, not related to the patches being
> tracked here at this BZ. We need to know if the same crashes are reproducible
> using a KVM version without the performance patches. See Comment #25 for info
> about versions with and without those patches.

I think Ulrich found the reason why these patches seem to cause the problems in his comment https://bugzilla.redhat.com/show_bug.cgi?id=531827#c17. The error always existed, but it was ignored previously. A side effect of the performance patches is that the errors are propagated to the caller now.

Trying to summarize the findings and what we need to do:
- The occurrence of ENOSPC is to be expected and not a bug
- We need to apply the performance patches (we know now that they are a bug fix, too)
- We need to backport the error handling fix for grow_refcount_table from upstream
- We need to make sure that no EIO is returned when ENOSPC is meant in fact (this one still needs some analysis)

Does this list look right? Did I forget anything?

Comment 7 YangFeng 2010-02-04 09:01:52 UTC

Hi Eduardo,

I am a tester in Beijing team.  I fail to reproduce bug 537077 on kvm-83-137.el5.  I think I should misunderstand your steps. Could you tell me the detailed steps and iometer's config you used.


Following are the steps I used:
1.  Start a VM with command:
#qemu-kvm -drive file=/usr/kvm-autotest/client/tests/kvm/images/win2003-32-virtio.qcow2,if=virtio,boot=on  -net nic,vlan=0,model=e1000,macaddr=00:AE:A9:13:4E:00 -net tap,vlan=0,ifname=e1000_0_6001,script=/usr/kvm-autotest/client/tests/kvm/scripts/qemu-ifup-switch,downscript=no -m 2048 -smp 1 -usbdevice tablet -rtc-td-hack -no-hpet -cpu qemu64,+sse2 -no-kvm-pit-reinjection -vnc :0

2.  Login VM by vncviewer.

3.  Install IOmeter got from http://sourceforge.net/projects/iometer/files/iometer-stable/2006-07-27/iometer-2006.07.27.win32.i386-setup.exe/download.

4.  Run IOmeter with write only 2K, 4K, 8K, 16K, 32K, 64K transfer request on C: for almost one hour.

But I did not see any error message in console and IOmeter.

Could you help me verify the test steps and show me iometer config you used.

Thanks very much!

Comment 8 Eduardo Habkost 2010-02-04 11:01:05 UTC

I didn't do any testing, I just filed this bug for the error found on bug #520693 comment 8. The problem was found by Bob Sibley.

You'll probably need RHEV-M and VDSM to reproduce this bug, as it depends on getting -ENOSPC errors being reported during normal operation. Tou won't get any -ENOSPC errors if you use a pre-existing image. RHEV-M, on the other hand, uses on-demand resizing of LVM volumes, and depend on proper -ENOSPC error reporting.

Comment 9 YangFeng 2010-02-24 08:44:33 UTC

Hi all

Thanks for your help!

Here i just want to confirm with you about how to verify this bug?


1: I used to try this way on kvm-83-105.el5_4.13 and kvm-83-153.el5.
1. Greate a qcow2 image in LVM
    [root@localhost ~]# pvcreate /dev/sda1
    Physical volume "/dev/sda1" successfully created
    [root@localhost ~]# vgcreate vgtest /dev/sda1
    Volume group "vgtest" successfully created
    [root@localhost ~]# lvcreate -n lvtest -L 1G vgtest
    Logical volume "lvtest" created
    [root@localhost ~]# qemu-img create -f qcow2 /dev/vgtest/lvtest 5G
    Formatting '/dev/vgtest/lvtest', fmt=qcow2, size=5242880 kB
2. Start a VM with command:
    [root@localhost ~]# /usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -usbdevice tablet -cpu qemu64,+sse2 -drive file=win2003-64-virtio.qcow2,if=virtio,boot=on,format=qcow2,cache=off,werror=stop -smp 2 -m 2G -vnc :1 -net nic,macaddr=20:20:20:11:12:56,model=virtio,vlan=0 -net tap,script=/etc/qemu-ifup,vlan=0 -monitor stdio -drive file=/dev/vgtest/lvtest,if=virtio,format=qcow2,cache=off,werror=stop
3. Run "notify all on" in monitor
   (qemu) notify all on
4. Using IOmeter Sequential Writes 32K transfer request size in guest.
5. In monitor i see following message:
(qemu) # RTC: new time is UTC-28802
# RTC: new time is UTC-28802
# RTC: new time is UTC-28802
# RTC: new time is UTC-28802
# RTC: new time is UTC-28802
# RTC: new time is UTC-28802
# RTC: new time is UTC-28802
# VM is stopped due to disk write error: virtio1: No space left on device
info status
VM status: paused
(qemu) c
(qemu) # VM is stopped due to disk write error: virtio1: No space left on device

So i fail to reproduce this bug on kvm-83-105.el5_4.13 and kvm-83-153.el5.  

Could i use this way to verify this bug?


2: Kevin Wolf used to refer to following in his email.

What you could do to manually provoke an I/O error is to use an image on
NFS, set a breakpoint in one of the functions that didn't return the
right error code, stop NFS when the function is called and look if the
right results is returned. Uli Obergfell has described this kind of
reproducing such bugs in Bugzilla, you can refer to
https://bugzilla.redhat.com/show_bug.cgi?id=531827#c17 for example.

Could i follow this way to verify this bug?  If yes, I just want to know how many function need i set breakpoint to check? 
How can we make sure all the error code have been correct for this bug?


3: We have a qcow2 auto test suit base on autotest, it contains some function test such as
    - prepare:
        only image_copy
    - raw_function:
        only qemuio_test
    - normal:
        only boot reboot linux_s4 mig2.load.dbench
    - function:
        only fillup_disk format_disk multi_disk.default ioquit hdparm
    - endurance:
        only autotest.bonnie autotest.parallel_dd autotest.dbench autotest.ctcs2 autotest.iozone autotest.disktest
    - end:
        only shutdown

Could i run this suite on IDE/VIRTIO Windows/Linux qcow2 image to verify this bug?  
I think this way should be more reasonable than others.  
Any suggestion on how to verify this bug?

Comment 10 YangFeng 2010-03-02 06:41:42 UTC

Run qcow2 test suite with IDE/VIRTIO, Windows/Linux, qcow2 on kvm-83-153.el5.
No KVM bug is found. 
Few case fail for script/testing environment issue, so i waived them in list.

job link in our virtlab:
https://virtlab.englab.nay.redhat.com/job/6505/details/
https://virtlab.englab.nay.redhat.com/job/6527/details/
https://virtlab.englab.nay.redhat.com/job/6506/details/
https://virtlab.englab.nay.redhat.com/job/6467/details/
https://virtlab.englab.nay.redhat.com/job/6407/details/
https://virtlab.englab.nay.redhat.com/job/6389/details/

Comment 16 errata-xmlrpc 2010-03-30 07:52:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html

Note You need to log in before you can comment on or make changes to this bug.