624607 – [qemu] [rhel6] guest installation stop (pause) on 'eother' event over COW disks (thin-provisioning)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 624607 - [qemu] [rhel6] guest installation stop (pause) on 'eother' event over COW disks (thin-provisioning)

Summary: [qemu] [rhel6] guest installation stop (pause) on 'eother' event over COW dis...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Luiz Capitulino
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	559201 580954 QMPBlockError
TreeView+	depends on / blocked

Reported:	2010-08-17 08:29 UTC by Haim
Modified:	2014-01-13 00:46 UTC (History)
CC List:	15 users (show)
Fixed In Version:	qemu-kvm-0.12.1.2-2.119.el6
Doc Type:	Bug Fix
Doc Text:	IMPORTANT: this is an internal interface consumed only by libvirt. Users should only know about libvirt related impact and new functionality (which is not described here). Cause: The BLOCK_IO_ERROR event provides limited error information. Consequence: Debugging of I/O related errors is limited. Change: Add more information to the BLOCK_IO_ERROR event. Result: It's now easier to debug I/O related errors.
Clone Of:
Clones:	QMPBlockError (view as bug list)
Environment:
Last Closed:	2011-05-19 11:29:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1005654	0	high	CLOSED	two VM has been paused due to a storage IO error	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2011:0534	0	normal	SHIPPED_LIVE	Important: qemu-kvm security, bug fix, and enhancement update	2011-05-19 11:20:36 UTC

Internal Links: 1005654

Description Haim 2010-08-17 08:29:49 UTC

Description of problem:

I hit the following issue when working with 'qemu-kvm-0.12.1.2-2.109'; 
created guest machine with 2 COW disks, and 1 RAW on iscsi storage. 
during guest OS installation, I get the following error from libvirt: 

0:22:58.515: debug : qemuMonitorJSONIOProcessEvent:99 : handle BLOCK_IO_ERROR 
handler=0x4795d0 data=0x7fbe3c1ab300
10:22:58.515: debug : qemuMonitorEmitIOError:856 : mon=0x7fbe14095c70
10:22:58.515: debug : qemuMonitorJSONIOProcess:188 : Total used 206 bytes out 
of 206 available in buffer
10:22:58.515: debug : remoteRelayDomainEventIOErrorReason:274 : Relaying 
domain io error boom-poo 42 /rhev/data-
center/6acd4aff-334a-44e1-8370-048f1ba9962b/6c4
717af-38c1-47ea-846d-e8ecea1cd633/images/2b497bc2-8c67-4390-9172-
fc42c1ae9cb3/30e11db4-4249-42c3-8940-a1b861e83ced virtio-disk1 1 eother

or 

libvirtEventLoop::INFO::2010-08-17 
10:22:56,294::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,297::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,298::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,298::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,299::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her

guest machine goes to pause, and refuses to go up, there is no actual problem with the storage.


it looks like its related to the fact I'm using thin-provisioning (COW) disk.
this is a test blocker for us. 
I am available at #{KVM,VIRT,TLV} IRC rooms for more info if needed

2.6.32-59.1.el6.x86_64
libvirt-0.8.1-23.el6.x86_64
vdsm-4.9-12.3.x86_64
device-mapper-multipath-0.4.9-25.el6.x86_64
lvm2-2.02.72-4.el6.x86_64
qemu-kvm-0.12.1.2-2.109.el6.x86_64


repro: 

1) using libvirt, create guest machine with 2 disks;
   - 4G cow 
   - 1G raw
2) install Linux OS on it (on 4G partition).

Comment 7 Luiz Capitulino 2010-08-19 13:08:17 UTC

We report EPERM, this is probably EACCESS.

Comment 8 Kevin Wolf 2010-08-19 13:20:35 UTC

Yes, the cause was an EACCES.

Comment 9 Luiz Capitulino 2010-10-14 13:17:56 UTC

We have two possible solutions for this one:

 1. Wait for the new error framework upstream. This is the Right Thing,
    as the new framework is going to allow for inclusion of error objects
    in events, which is the right fix for this BZ.

    The problem is that the work on the new framework has barely started and
    we don't know how long it will take and how hard it's going to be to
    backport it

 2. Just extend the rhel6's vendor extension. This is easy to do, but we
    obviously can't go too far, as this solution doesn't escalate and that
    interface doesn't and won't exist upstream either

So, maybe we can try to do 1, if it fails we do 2. In any case I believe libvirt will have to be updated too.

Comment 10 Kevin Wolf 2010-10-14 14:18:38 UTC

This is mostly a debugging feature, so it would already be very helpful to have the concrete errno value in the QMP message even if libvirt doesn't use it (bonus points for an additional human readable error message via strerror). When given a stopped VM that has failed, I can easily detach libvirt/VDSM and attach manually with netcat to the QMP socket. Actually, QE was able to do that themselves and provide me with the QMP error message.

However, in the reported cases with 6.0, I had to log in on that machine myself, install debuginfos, attach gdb and set a breakpoint at the right place, continue the VM and catch the next error this way, just the find the error number somewhere in the backtrace.

Of course, this would still be a workaround until the Right Thing is available, but it should be very easy to implement.

Comment 11 Luiz Capitulino 2010-10-15 13:03:49 UTC

(In reply to comment #10)
> This is mostly a debugging feature, so it would already be very helpful to have
> the concrete errno value in the QMP message even if libvirt doesn't use it
> (bonus points for an additional human readable error message via strerror).
> When given a stopped VM that has failed, I can easily detach libvirt/VDSM and
> attach manually with netcat to the QMP socket. Actually, QE was able to do that
> themselves and provide me with the QMP error message.
> 
> However, in the reported cases with 6.0, I had to log in on that machine
> myself, install debuginfos, attach gdb and set a breakpoint at the right place,
> continue the VM and catch the next error this way, just the find the error
> number somewhere in the backtrace.

Ouch. Does it happen often? If it does, I'll consider fixing it for the
Z stream.

Comment 12 Kevin Wolf 2010-10-15 13:16:48 UTC

Not too often, I did it this way like three or four times. I think having it in 6.1 would be good enough.

Comment 13 Luiz Capitulino 2010-11-02 19:51:28 UTC

(In reply to comment #10)
> This is mostly a debugging feature, so it would already be very helpful to have
> the concrete errno value in the QMP message even if libvirt doesn't use it
> (bonus points for an additional human readable error message via strerror).

Let me confirm I got this right.

Adding the errno value in the QMP event and a human message to stderr would be enough to solve this issue for rhel6.1?

I'm asking because that solution is unlikely to be visible to regular users, IOW, regular users are going to see what's reported by libvirt, like what is described in the original report.

However, it's very unlikely we'll get the Right Thing in time. Just want to be sure we're on the same page.

Comment 14 Kevin Wolf 2010-11-03 08:45:55 UTC

I think we are on same page.

Of course, I'd be happy to see the Right Thing with integration in libvirt and VDSM, but I understand that this won't be ready for 6.1. So I think it would be already a major improvement if attaching to QMP manually would be enough, so that you wouldn't need to use gdb to debug problems.

Comment 19 Luiz Capitulino 2010-11-10 16:07:18 UTC

Changed the 'version' field by accident, change it back to 6.0 and update the correct field (which is 'Target Release').

Comment 22 Shirley Zhou 2010-12-23 09:27:18 UTC

Reproduce this bug on 113.
{"timestamp": {"seconds": 1293087190, "microseconds": 26525}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_reason": "eperm", "operation": "write", "action": "stop"}}

{"timestamp": {"seconds": 1293088121, "microseconds": 201391}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_reason": "enospc", "operation": "write", "action": "stop"}}

{"timestamp": {"seconds": 1293093292, "microseconds": 341860}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}

And verify this bug on qemu-kvm-0.12.1.2-2.128.el6.x86_64.

(qemu) block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)

{"timestamp": {"seconds": 1293093630, "microseconds": 236418}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Operation not permitted", "errno": 1}, "__com.redhat_reason": "eperm", "operation": "write", "action": "stop"}}


(qemu)block I/O error in device 'drive-virtio-disk0': No space left on device (28)
{"timestamp": {"seconds": 1293094234, "microseconds": 457275}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "No space left on device", "errno": 28}, "__com.redhat_reason": "enospc", "operation": "write", "action": "stop"}}

(qemu) block I/O error in device 'drive-virtio-disk0': Input/output error (5)
{"timestamp": {"seconds": 1293093711, "microseconds": 499425}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Input/output error", "errno": 5}, "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}

From above monitor and qmp message, debug ability of the BLOCK_IO_ERROR event has improved.

Comment 23 Haim 2011-01-04 15:01:01 UTC

verified on:

vdsm-4.9-34.el6.x86_64
libvirt-0.8.6-1.el6.x86_64
qemu-kvm-0.12.1.2-2.113.el6_0.3.x86_64

installed fresh operating system on 4G cow disk, followed the log and saw that lvextend was initiated when high water mark reached, and disk was extened from 0.5 to 2G.

Comment 24 Luiz Capitulino 2011-05-05 17:46:00 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
IMPORTANT: this is an internal interface consumed only by libvirt. Users should only know about libvirt related impact and new functionality (which is not described here).

Cause: The BLOCK_IO_ERROR event provides limited error information.

Consequence: Debugging of I/O related errors is limited.

Change: Add more information to the BLOCK_IO_ERROR event.

Result: It's now easier to debug I/O related errors.

Comment 25 Kate Grainger 2011-05-11 04:58:43 UTC

Hi Luiz, does this sound about right for the external errata text?

When starting a virtual machine that uses thin-provisioning (COW) disk, QEMU would fail to connect to the virtual I/O disk and the virtual machine would go into the pause state without returning much error information. QEMU now returns more verbose error information to help you debug any I/O-related errors.

Comment 26 Luiz Capitulino 2011-05-11 13:12:26 UTC

Hi Kate, I think I would change 'QEMU would fail' by 'QEMU could fail' or 'in an I/O failure scenario'..., otherwise the text is good.

To be honest, I'm not 100% sure it makes sense to report this change to users, but it won't hurt either.

Comment 27 errata-xmlrpc 2011-05-19 11:29:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html

Comment 28 errata-xmlrpc 2011-05-19 12:47:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html

Note You need to log in before you can comment on or make changes to this bug.