Bug 596279

Summary: QMP: does not report the real cause of PCI device assignment failure
Product: Red Hat Enterprise Linux 6 Reporter: Daniel Berrangé <berrange>
Component: qemu-kvmAssignee: Luiz Capitulino <lcapitulino>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: alex.williamson, armbru, juzhang, lcapitulino, mjenner, tburke, virt-maint
Target Milestone: rcFlags: juzhang: needinfo-
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.85.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-10 21:24:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 559201    

Description Daniel Berrangé 2010-05-26 14:10:22 UTC
Description of problem:
Attempting to hotplug a PCI device to a guest fails, returning a generic error message with no useful information. The useful error info is merely printed to stderr. This latter info should be included in the QMP error response


Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.64.el6 + Alex W's patches for PCI configfd support

How reproducible:
Always

Steps to Reproduce:
1. Launch a basic QEMU VM with QMP active
2. {"execute":"getfd","arguments":{"fdname":"fd-hostdev0"}}
3. Attempt to hotplug a PCI device that shares an IRQ with another device
   {"driver":"pci-assign","host":"15:00.0","id":"hostdev0","configfd":"fd-hostdev0","bus":"pci.0","addr":"0x5"}}
3.
  
Actual results:
{"error": {"class": "DeviceInitFailed", "desc": "Device 'pci-assign' could not be initialized", "data": {"device": "pci-assign"}}}

And stderr contains

Failed to assign irq for "hostdev0": Operation not permitted
Perhaps you are assigning a device that shares an IRQ with another device?



Expected results:
{"error": {"class": "DeviceInitFailed", "desc": "Device 'pci-assign' could not be initialized. Failed to assign irq for device: Operation not permitted.", "data": {"device": "pci-assign"}}}


Additional info:

Comment 2 RHEL Program Management 2010-05-28 11:55:34 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Alex Williamson 2010-06-18 14:41:53 UTC
This is 90% a QMP issue.  pci-assign is a victim of an error reporting infrastructure that doesn't provide a means for a driver to report a useful error message via QMP.

Comment 4 Luiz Capitulino 2010-06-23 18:14:35 UTC
This specific case surfers of two problems. The first is our problematic error infrastructure and the other is that, when planning conversions to QMP, we decided to do 'shallow' conversions, that's, we decided to get the major error cause instead of digging too deep in call chains.

The reason is simple: we will need several releases to do a full conversion to whatever error infrastructure we use.

Now, there are two ways of doing what is proposed in 'Expected results', we could:

1. Add a new error for this specific case (easy)

2. Add optional members to the QError object and use it to fill the info we want (not trivial)

Item 1 is an ugly workaround which is not suitable for upstream (and if we do it here, we would need it in other places too). Item 2 is something we would like to have, but I'm afraid we won't be able to do it on time for rhel6 (furthermore this solution is more like a wish have than a bug).

I propose moving this one to rhel6.1.

Comment 5 Markus Armbruster 2010-06-24 06:14:35 UTC
We could do a semi-shallow conversion, as suggested by Dan Berrange.  Something like:

     if (qdev_init(qdev) < 0) {
-        qerror_report(QERR_DEVICE_INIT_FAILED, driver);
+        if (!qerror_is_set())
+            qerror_report(QERR_DEVICE_INIT_FAILED, driver);
         return NULL;
     }

This permits converting drivers one by one instead of all at once.  Maybe that's what you meant by 1.  I wouldn't call it an ugly work-around, and I think it's worth a try upstream.

Comment 6 Daniel Berrangé 2010-06-24 09:23:19 UTC
> I propose moving this one to rhel6.1.    

Ordinarily I'd agree, but PCI device assignment is one of the most troublesome, likely to fail, operations the user can do with QEMU hotplug. So anything we can do to improve error reporting upon failure will reduce bogus bug reports about PCI assignment not working.

Comment 7 Luiz Capitulino 2010-06-24 13:47:30 UTC
(In reply to comment #6)
> > I propose moving this one to rhel6.1.    
> 
> Ordinarily I'd agree, but PCI device assignment is one of the most troublesome,
> likely to fail, operations the user can do with QEMU hotplug. So anything we
> can do to improve error reporting upon failure will reduce bogus bug reports
> about PCI assignment not working.    

Ok, let's try to improve it then.

Comment 8 Luiz Capitulino 2010-06-24 13:57:26 UTC
(In reply to comment #5)
> We could do a semi-shallow conversion, as suggested by Dan Berrange.  Something
> like:
> 
>      if (qdev_init(qdev) < 0) {
> -        qerror_report(QERR_DEVICE_INIT_FAILED, driver);
> +        if (!qerror_is_set())
> +            qerror_report(QERR_DEVICE_INIT_FAILED, driver);
>          return NULL;
>      }
> 
> This permits converting drivers one by one instead of all at once.  Maybe
> that's what you meant by 1.  I wouldn't call it an ugly work-around, and I
> think it's worth a try upstream.    

Yes, it's exactly what I meant by 1.

I think this is not ideal because we will keep spreading QError the wrong way, maybe upstream will care, maybe it won't.

I will try it anyway, as this is our only way to report error today.

Comment 9 Luiz Capitulino 2010-06-24 17:41:53 UTC
Ok, was working on this and found two problems:

1. There are lots of errors in hw/device-assigment.c that are reported with fprintf() (27 calls) or perror() (11 calls). Of course that the numbers are very different if we consider initialization code only

2. This code is in qemu-kvm, but not in upstream qemu. Which means that we would be extending a bad interface into a downstream fork

So, I think the most reasonable thing to do is:

1. Identify the most important errors we want to report to users, ie. the ones that will avoid what is described in comment #6

2. Let's introduce those errors in rhel6 only, as vendor extensions (assuming this is not a problem to libvirt, of course)

Comment 14 juzhang 2010-08-18 06:15:38 UTC
I tried all unused PCI device on my machine,no luck.can't reproduce this issue.in order to reproduce this issue.I borrowed a 82541nic(don't support MSI/MSIX).however,I still not sure cover this issue.

Tested with qemu-kvm-0.12.1.2-2.82.el6(unfixed version)
---------------------------------------------------------
steps:
1. Unbind device from host(82541nic,don't support MSI/MSIX)
echo "8086 107c" > /sys/bus/pci/drivers/pci-stub/new_id
echo 0000:37:04.0 > /sys/bus/pci/devices/0000\:37\:04.0/driver/unbind
echo 0000:37:04.0 > /sys/bus/pci/drivers/pci-stub/bind

2. boot guest
#/usr/libexec/qemu-kvm -S -M rhel6.0.0 -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name test -rtc base=utc -boot c -drive file=/root/zhangjunyi/RHEL-Server-6.0-64-virtio.qcow2,if=none,id=drive-virtio-disk0,boot=on,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,mac=76:00:40:3F:20:20,bus=pci.0,addr=0x4,id=hello -boot c -vnc :10 -qmp tcp:0:4444,server,nowait -monitor stdio

3. telnet the qmp server
#{"execute":"qmp_capabilities"}

4.hotplug 82541nic using qmp command
#{"execute":"device_add","arguments":{"driver":"pci-assign","host":"37:04.0","id":"test1"}}


results:
1.{"error": {"class": "DeviceInitFailed", "desc": "Device 'pci-assign' could not be initialized", "data": {"device": "pci-assign"}}}
2.(qemu) Failed to assign irq for "test1": Input/output error
Perhaps you are assigning a device that shares an IRQ with another device?


Tested with qemu-kvm-0.12.1.2-2.109.el6(fixed version) using the same steps
---------------------------------------------------------
After step4.

Results:
1.{"error": {"class": "DeviceInitFailed", "desc": "Device 'pci-assign' could not be initialized: Failed to assign irq: Input/output error", "data": {"device": "pci-assign", "__com.redhat_strerror": "Input/output error", "__com.redhat_reason": "Failed to assign irq"}}}

2.(qemu) Failed to assign irq for "test1": Input/output error
Perhaps you are assigning a device that shares an IRQ with another device?


I have two question

1.why not allowed to assign device when guest all ready exist the same IRQ value.in my option,in modern pci device almost support IRQ sharing.

2.If my scenarios not cover this issue.would you please give some tips?

Comment 15 Luiz Capitulino 2010-08-20 18:12:46 UTC
(In reply to comment #14)

> 1.why not allowed to assign device when guest all ready exist the same IRQ
> value.in my option,in modern pci device almost support IRQ sharing.

Not sure if I got it.

> 2.If my scenarios not cover this issue.would you please give some tips?

It does, you can see that "desc" has been extended in the fixed package.

Comment 16 juzhang 2010-08-23 02:00:59 UTC
(In reply to comment #15)
> (In reply to comment #14)
> 
> > 1.why not allowed to assign device when guest all ready exist the same IRQ
> > value.in my option,in modern pci device almost support IRQ sharing.
> 
> Not sure if I got it.
> 
> > 2.If my scenarios not cover this issue.would you please give some tips?
> 
> It does, you can see that "desc" has been extended in the fixed package.


Thanks Luiz,about first question,if you got it,please let me via mail or IRC.

Based on comment15,mark this issue as verified.

Comment 17 Luiz Capitulino 2010-09-22 17:18:29 UTC
Any reason not to close this one?

Comment 18 releng-rhel@redhat.com 2010-11-10 21:24:58 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.