Bug 624607

Summary:	[qemu] [rhel6] guest installation stop (pause) on 'eother' event over COW disks (thin-provisioning)
Product:	Red Hat Enterprise Linux 6	Reporter:	Haim <hateya>
Component:	qemu-kvm	Assignee:	Luiz Capitulino <lcapitulino>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	medium
Version:	6.0	CC:	antillon.maurizio, armbru, danken, ehabkost, hateya, kgrainge, kwolf, llim, mgoldboi, mkenneth, szhou, tburke, virt-maint, xtian, yeylon
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-0.12.1.2-2.119.el6	Doc Type:	Bug Fix
Doc Text:	IMPORTANT: this is an internal interface consumed only by libvirt. Users should only know about libvirt related impact and new functionality (which is not described here). Cause: The BLOCK_IO_ERROR event provides limited error information. Consequence: Debugging of I/O related errors is limited. Change: Add more information to the BLOCK_IO_ERROR event. Result: It's now easier to debug I/O related errors.	Story Points:	---
Clone Of:
Clones:	QMPBlockError (view as bug list)		Environment:
Last Closed:	2011-05-19 11:29:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	559201, 580954, 643019

Description Haim 2010-08-17 08:29:49 UTC

Description of problem:

I hit the following issue when working with 'qemu-kvm-0.12.1.2-2.109'; 
created guest machine with 2 COW disks, and 1 RAW on iscsi storage. 
during guest OS installation, I get the following error from libvirt: 

0:22:58.515: debug : qemuMonitorJSONIOProcessEvent:99 : handle BLOCK_IO_ERROR 
handler=0x4795d0 data=0x7fbe3c1ab300
10:22:58.515: debug : qemuMonitorEmitIOError:856 : mon=0x7fbe14095c70
10:22:58.515: debug : qemuMonitorJSONIOProcess:188 : Total used 206 bytes out 
of 206 available in buffer
10:22:58.515: debug : remoteRelayDomainEventIOErrorReason:274 : Relaying 
domain io error boom-poo 42 /rhev/data-
center/6acd4aff-334a-44e1-8370-048f1ba9962b/6c4
717af-38c1-47ea-846d-e8ecea1cd633/images/2b497bc2-8c67-4390-9172-
fc42c1ae9cb3/30e11db4-4249-42c3-8940-a1b861e83ced virtio-disk1 1 eother

or 

libvirtEventLoop::INFO::2010-08-17 
10:22:56,294::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,297::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,298::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,298::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her
libvirtEventLoop::INFO::2010-08-17 
10:22:56,299::libvirtvm::739::vds.vmlog.a95ab45d-3dd5-424e-
a3c7-315d40d9248c::abnormal vm stop device virtio-disk1 error eot
her

guest machine goes to pause, and refuses to go up, there is no actual problem with the storage.


it looks like its related to the fact I'm using thin-provisioning (COW) disk.
this is a test blocker for us. 
I am available at #{KVM,VIRT,TLV} IRC rooms for more info if needed

2.6.32-59.1.el6.x86_64
libvirt-0.8.1-23.el6.x86_64
vdsm-4.9-12.3.x86_64
device-mapper-multipath-0.4.9-25.el6.x86_64
lvm2-2.02.72-4.el6.x86_64
qemu-kvm-0.12.1.2-2.109.el6.x86_64


repro: 

1) using libvirt, create guest machine with 2 disks;
   - 4G cow 
   - 1G raw
2) install Linux OS on it (on 4G partition).

Comment 7 Luiz Capitulino 2010-08-19 13:08:17 UTC

We report EPERM, this is probably EACCESS.

Comment 8 Kevin Wolf 2010-08-19 13:20:35 UTC

Yes, the cause was an EACCES.

Comment 9 Luiz Capitulino 2010-10-14 13:17:56 UTC

We have two possible solutions for this one:

 1. Wait for the new error framework upstream. This is the Right Thing,
    as the new framework is going to allow for inclusion of error objects
    in events, which is the right fix for this BZ.

    The problem is that the work on the new framework has barely started and
    we don't know how long it will take and how hard it's going to be to
    backport it

 2. Just extend the rhel6's vendor extension. This is easy to do, but we
    obviously can't go too far, as this solution doesn't escalate and that
    interface doesn't and won't exist upstream either

So, maybe we can try to do 1, if it fails we do 2. In any case I believe libvirt will have to be updated too.

Comment 10 Kevin Wolf 2010-10-14 14:18:38 UTC

This is mostly a debugging feature, so it would already be very helpful to have the concrete errno value in the QMP message even if libvirt doesn't use it (bonus points for an additional human readable error message via strerror). When given a stopped VM that has failed, I can easily detach libvirt/VDSM and attach manually with netcat to the QMP socket. Actually, QE was able to do that themselves and provide me with the QMP error message.

However, in the reported cases with 6.0, I had to log in on that machine myself, install debuginfos, attach gdb and set a breakpoint at the right place, continue the VM and catch the next error this way, just the find the error number somewhere in the backtrace.

Of course, this would still be a workaround until the Right Thing is available, but it should be very easy to implement.

Comment 11 Luiz Capitulino 2010-10-15 13:03:49 UTC

(In reply to comment #10)
> This is mostly a debugging feature, so it would already be very helpful to have
> the concrete errno value in the QMP message even if libvirt doesn't use it
> (bonus points for an additional human readable error message via strerror).
> When given a stopped VM that has failed, I can easily detach libvirt/VDSM and
> attach manually with netcat to the QMP socket. Actually, QE was able to do that
> themselves and provide me with the QMP error message.
> 
> However, in the reported cases with 6.0, I had to log in on that machine
> myself, install debuginfos, attach gdb and set a breakpoint at the right place,
> continue the VM and catch the next error this way, just the find the error
> number somewhere in the backtrace.

Ouch. Does it happen often? If it does, I'll consider fixing it for the
Z stream.

Comment 12 Kevin Wolf 2010-10-15 13:16:48 UTC

Not too often, I did it this way like three or four times. I think having it in 6.1 would be good enough.

Comment 13 Luiz Capitulino 2010-11-02 19:51:28 UTC

(In reply to comment #10)
> This is mostly a debugging feature, so it would already be very helpful to have
> the concrete errno value in the QMP message even if libvirt doesn't use it
> (bonus points for an additional human readable error message via strerror).

Let me confirm I got this right.

Adding the errno value in the QMP event and a human message to stderr would be enough to solve this issue for rhel6.1?

I'm asking because that solution is unlikely to be visible to regular users, IOW, regular users are going to see what's reported by libvirt, like what is described in the original report.

However, it's very unlikely we'll get the Right Thing in time. Just want to be sure we're on the same page.

Comment 14 Kevin Wolf 2010-11-03 08:45:55 UTC

I think we are on same page.

Of course, I'd be happy to see the Right Thing with integration in libvirt and VDSM, but I understand that this won't be ready for 6.1. So I think it would be already a major improvement if attaching to QMP manually would be enough, so that you wouldn't need to use gdb to debug problems.

Comment 19 Luiz Capitulino 2010-11-10 16:07:18 UTC

Changed the 'version' field by accident, change it back to 6.0 and update the correct field (which is 'Target Release').

Comment 22 Shirley Zhou 2010-12-23 09:27:18 UTC

Reproduce this bug on 113.
{"timestamp": {"seconds": 1293087190, "microseconds": 26525}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_reason": "eperm", "operation": "write", "action": "stop"}}

{"timestamp": {"seconds": 1293088121, "microseconds": 201391}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_reason": "enospc", "operation": "write", "action": "stop"}}

{"timestamp": {"seconds": 1293093292, "microseconds": 341860}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}

And verify this bug on qemu-kvm-0.12.1.2-2.128.el6.x86_64.

(qemu) block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)

{"timestamp": {"seconds": 1293093630, "microseconds": 236418}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Operation not permitted", "errno": 1}, "__com.redhat_reason": "eperm", "operation": "write", "action": "stop"}}


(qemu)block I/O error in device 'drive-virtio-disk0': No space left on device (28)
{"timestamp": {"seconds": 1293094234, "microseconds": 457275}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "No space left on device", "errno": 28}, "__com.redhat_reason": "enospc", "operation": "write", "action": "stop"}}

(qemu) block I/O error in device 'drive-virtio-disk0': Input/output error (5)
{"timestamp": {"seconds": 1293093711, "microseconds": 499425}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "__com.redhat_debug_info": {"message": "Input/output error", "errno": 5}, "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}

From above monitor and qmp message, debug ability of the BLOCK_IO_ERROR event has improved.

Comment 23 Haim 2011-01-04 15:01:01 UTC

verified on:

vdsm-4.9-34.el6.x86_64
libvirt-0.8.6-1.el6.x86_64
qemu-kvm-0.12.1.2-2.113.el6_0.3.x86_64

installed fresh operating system on 4G cow disk, followed the log and saw that lvextend was initiated when high water mark reached, and disk was extened from 0.5 to 2G.

Comment 24 Luiz Capitulino 2011-05-05 17:46:00 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
IMPORTANT: this is an internal interface consumed only by libvirt. Users should only know about libvirt related impact and new functionality (which is not described here).

Cause: The BLOCK_IO_ERROR event provides limited error information.

Consequence: Debugging of I/O related errors is limited.

Change: Add more information to the BLOCK_IO_ERROR event.

Result: It's now easier to debug I/O related errors.

Comment 25 Kate Grainger 2011-05-11 04:58:43 UTC

Hi Luiz, does this sound about right for the external errata text?

When starting a virtual machine that uses thin-provisioning (COW) disk, QEMU would fail to connect to the virtual I/O disk and the virtual machine would go into the pause state without returning much error information. QEMU now returns more verbose error information to help you debug any I/O-related errors.

Comment 26 Luiz Capitulino 2011-05-11 13:12:26 UTC

Hi Kate, I think I would change 'QEMU would fail' by 'QEMU could fail' or 'in an I/O failure scenario'..., otherwise the text is good.

To be honest, I'm not 100% sure it makes sense to report this change to users, but it won't hurt either.

Comment 27 errata-xmlrpc 2011-05-19 11:29:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html

Comment 28 errata-xmlrpc 2011-05-19 12:47:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html