618226 – Cannot create PV guest with PCI device assigned after restarting xend

Bug 618226 - Cannot create PV guest with PCI device assigned after restarting xend

Summary: Cannot create PV guest with PCI device assigned after restarting xend

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Novotny
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514500
TreeView+	depends on / blocked

Reported:	2010-07-26 13:13 UTC by Yufang Zhang
Modified:	2014-02-02 22:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-08-31 12:39:59 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
full xend.log (916.05 KB, text/plain) 2010-07-26 13:13 UTC, Yufang Zhang	no flags	Details
xm dmesg log (10.37 KB, text/plain) 2010-07-29 03:04 UTC, Yufang Zhang	no flags	Details
domain0 dmesg (22.85 KB, text/plain) 2010-07-29 03:04 UTC, Yufang Zhang	no flags	Details
xend.log (115.28 KB, text/plain) 2010-07-29 03:05 UTC, Yufang Zhang	no flags	Details
Scripts used to try to reproduce (982 bytes, application/octet-stream) 2010-07-30 11:44 UTC, Michal Novotny	no flags	Details
bug reproducer (8.87 MB, application/x-gzip) 2010-08-01 15:22 UTC, Yufang Zhang	no flags	Details
View All

Description Yufang Zhang 2010-07-26 13:13:55 UTC

Created attachment 434411 [details]
full xend.log

Description of problem:
After restarting xend, we sometimes cannot create PV guests with PCI device assigned. 

Version-Release number of selected component (if applicable):
kernel-xen-devel-2.6.18-206.el5
xen-3.0.3-114.el5virttest30.g47a7c9c
xen-debuginfo-3.0.3-114.el5virttest30.g47a7c9c
xen-devel-3.0.3-114.el5virttest30.g47a7c9c
kernel-xen-2.6.18-206.el5
xen-libs-3.0.3-114.el5virttest30.g47a7c9c

How reproducible:
Sometimes

Steps to Reproduce:
1. Bind PCI device to pciback driver
2. # xm pci-list-assignable-device 
0000:03:00.0

3. Create the a PV guest with the PCI device assigned to it
4. Shutdown the PV guest
5. Restart xend
6. Try to create the PV guest again with the PCI device assigned. Got error like this:
# xm cr /tmp/rhel5.4-64-pv.cfg -c
Using config file "/tmp/rhel5.4-64-pv.cfg".
Using <class 'grub.GrubConf.GrubConfigFile'> to parse /grub/menu.lst
Error: (22, 'Invalid argument')

  
Actual results:
The guest couldn't be created.

Expected results:
Guest should be created successfully. 

Additional info:

Comment 1 Yufang Zhang 2010-07-27 09:46:36 UTC

Testing this bug on a HP Z400 machine, we could create the PV guest with PCI device assigned successfully even after we restart xend several times.

Comment 2 Michal Novotny 2010-07-27 11:35:42 UTC

Thanks for your testing Yufang. The issue here is that the error 22 (-EINVAL = Invalid argument) that's coming from physdev_op() hypercall. Since this appears hardware-related bug I'm adding Andrew to CC list because there may be a bug in the pciback driver or somewhere in the kernel/hypervisor (the implementation of physdev_op() is *not* in the user-space and this seems to be a hypercall).

Michal

Comment 3 Michal Novotny 2010-07-27 11:36:18 UTC

Andrew,
what do you think about that?

Michal

Comment 4 Andrew Jones 2010-07-27 15:37:48 UTC

I think we should take a look at 'xm dmesg' after getting the error. We may also find dom0's and the guest's dmesg useful, so please attach them as well.

Thanks,
Drew

Comment 5 Michal Novotny 2010-07-27 15:45:23 UTC

It took me some investigation and it's coming from setupDevice() method of /usr/lib64/python2.4/site-packages/xen/xend/server/pciif.py file and the physdev_map_pirq() function of libxc as defined:

            rc = xc.physdev_map_pirq(domid = fe_domid,
                                   index = dev.irq,
                                   pirq  = dev.irq)

On the machine used for testing (Dell 760) the fe_domid is being set to the front-end domain id and the dev.irq was set to 255 so both the index and pirq was set to value of 255. The function is being defined in tools/libxc/xc_physdev.c source code file where it returns -EINVAL in the case the pIRQ is not being set or -ENOSYS when it's not being implemented in the hypervisor (which is not the case and also, the pIRQ is being set so it didn't match any of those conditions and therefore the results is the value coming from do_physdev_op() function. The type here is MAP_PIRQ_TYPE_GSI (0x1) with PHYSDEVOP_map_pirq command and then in tools/libxc/xc_private.h there's a definition of do_physdev_op() accepting the command cmd and the op argument which is using ioctl() call to as hypercall pass a privileged command to the hypervisor.

Now it's coming to the kernel at linux-2.6-xen-sparse/drivers/xen/privcmd/privcmd.c and here it passes the hypercall to the hypervisor if I understand it correctly.

In the hypervisor code there's a arch/x86/physdev.c which is having the PHYSDEVOP_map_pirq command handling defined and that is trying to map pIRQ using the physdev_map_pirq() function and this function returns -EINVAL if the map is not defined - is empty:

    if ( !map )
        return -EINVAL;

at the beginning of the code or later, for MAP_PIRQ_TYPE_GSI, when the IRQ is invalid or IOAPIC vector is not being found/valid. This is defined in arch/x86/physdev.c file on line 60. Since index should be set to value 255 and this is within the limit there may be missing/not valid IO_APIC vector (which is basically irq_vector[index] entry in the irq_vector table - the table is being filled using the assign_irq_vector() function as defined in arch/x86/io_apic.c) so if my understanding is correct, we would need more information coming directly from the hypervisor. I can see there are some dprintk() calls with relevant messages so those messages should be printed using `xm dmesg` log.  Could you please try on the problematic machine and provide me `xm dmesg` log output?

Thanks a lot,
Michal

Comment 6 Michal Novotny 2010-07-27 15:47:00 UTC

(In reply to comment #4)
> I think we should take a look at 'xm dmesg' after getting the error. We may
> also find dom0's and the guest's dmesg useful, so please attach them as well.
> 
> Thanks,
> Drew    

Based on my investigation above I guess the `xm dmesg` log is the only one to help since this is issue on all the PV guests on this host machine from what I know but only on this host. Other hosts doesn't have this issue as based on our yesterday's IRC conversation.

Michal

Comment 13 Yufang Zhang 2010-07-29 03:04:12 UTC

Created attachment 435177 [details]
xm dmesg log

Comment 14 Yufang Zhang 2010-07-29 03:04:48 UTC

Created attachment 435178 [details]
domain0 dmesg

Comment 15 Yufang Zhang 2010-07-29 03:05:45 UTC

Created attachment 435179 [details]
xend.log

Comment 17 Andrew Jones 2010-07-29 05:36:52 UTC

Ok, this is interesting. The problem occurred, then Michal rebooted the machine and then the problem couldn't reproduce. Yufang then rebooted the machine again and the problem could reproduce. It seems we're missing something here. Are we rebooting the machine the same way? i.e. normal old reboot, nothing special with grub? I don't see anything strange on the command lines of xen or dom0 from Yufang's latest log uploads, so then are we doing the same thing before we're attempting to reproduce the bug after it's been rebooted?

Is the host set up correctly to capture a core? If so, let's get one while the bug is reproducible, and then experiment more with this rebooting stuff to see if we can find what makes it go away and come back.

Drew

Comment 18 Yufang Zhang 2010-07-29 07:25:12 UTC

I think we did miss some steps in previous comments:

(1) yufang produce this bug with xen-3.0.3-114.el5virttest30.g47a7c9c
(2) yufang upgrade xen package to xen-3.0.3-114.el5 for some other reasons
(3) Michal log into the machine and reboot it, couldn't reproduce the bug after rebooting
(4) yufang log into the machine, couldn't reproduce this bug neither. 
(5) yufang upgrade xen package to xen-3.0.3-114.el5virttest30.g47a7c9c again, couldn't reproduce this bug without rebooting the host.
(6) yufang reboot the host and finally reproduce the bug again.

So I think (2) and (5) are missed in previous steps. Am I right, Michal? Don't know if this bug is reproduced only with xen-3.0.3-114.el5virttest30.g47a7c9c and if rebooting the host has some impacts on this scenario.

Comment 19 Michal Novotny 2010-07-29 11:56:41 UTC

(In reply to comment #18)
> I think we did miss some steps in previous comments:
> 
> (1) yufang produce this bug with xen-3.0.3-114.el5virttest30.g47a7c9c
> (2) yufang upgrade xen package to xen-3.0.3-114.el5 for some other reasons
> (3) Michal log into the machine and reboot it, couldn't reproduce the bug after
> rebooting
> (4) yufang log into the machine, couldn't reproduce this bug neither. 
> (5) yufang upgrade xen package to xen-3.0.3-114.el5virttest30.g47a7c9c again,
> couldn't reproduce this bug without rebooting the host.
> (6) yufang reboot the host and finally reproduce the bug again.
> 
> So I think (2) and (5) are missed in previous steps. Am I right, Michal? Don't
> know if this bug is reproduced only with xen-3.0.3-114.el5virttest30.g47a7c9c
> and if rebooting the host has some impacts on this scenario.    

Yufang,
what about having g47a7c9c package installed and having the PCI device enabled in the PV guest configuration file and trying multiple reboots? Like I say, this seems to be coming from the hypervisor. If you can reproduce it, could you please attach output of xm dmesg ?

We need a reliable reproducer so try having the g47a7c9c version of xen package and also the same version of kernel-xen package and multiple reboots please.

Thanks,
Michal

Comment 20 Andrew Jones 2010-07-29 12:26:12 UTC

(In reply to comment #19)
> We need a reliable reproducer so try having the g47a7c9c version of xen package
> and also the same version of kernel-xen package and multiple reboots please.

Michal, to me it sounds like Yufang already did exactly what you're asking him to do. I think you should now try it, not him again, to ensure what we think the difference is, was actually the difference. If the only difference between both of your reproduce attempts were the version of the userspace, then we've already narrowed this bug down to the userspace, and then even to just a handful of patches for that matter.

Comment 21 Andrew Jones 2010-07-29 12:29:40 UTC

(In reply to comment #18)
> (5) yufang upgrade xen package to xen-3.0.3-114.el5virttest30.g47a7c9c again,
> couldn't reproduce this bug without rebooting the host.
> (6) yufang reboot the host and finally reproduce the bug again.

Yufang, did you do 'service xend restart' after upgrading xen in step (5)? If not, that would explain why the reboot allowed you to reproduce (xend of course was restarted when the machine was rebooted). Without restarting xend, even if you update the package, you're still running the old code.

Comment 22 Michal Novotny 2010-07-30 11:44:37 UTC

Created attachment 435533 [details]
Scripts used to try to reproduce

Well, I did create a script doing basically all the stuff from the host boot (including probing for pciback and unbinding the device and binding it to pciback) using following versions of packages.

xen-3.0.3-114.el5virttest30.g47a7c9c
kernel-xen-2.6.18-206.el5

Also, the reason for comment 19 was that sometimes when I logged into the machine the PV guest was using the PCI device and sometimes not so this was confusing.

Try using this:
 1) reboot the host first
 2) run reproduce.sh from the attached tarball for about 10 times, it always passed and the PCI device was always visible in the guest (the scripts on the host machine are available at /root/bz618226)
  [ the script is restarting xend after the PV guest is shutdown ]
 3) reboot the machine, run a loop of 100 runs
 4) reboot the machine again, run a loop again

No problems occurred at all using the versions described above. If there would be a problem, it would create a log file named xm-dmesg-err-$TIMESTAMP.log in ./log of the reproducer tree. There was no such file and all the guests were started successfully. Of course, I changed `xm console rhel-pv-pci` to `xm shutdown rhel-pv-pci` after a short sleep to make it possible to run it in the the loop (it won't be possible when connecting to console only and not running something like `expect` tool etc.)

Could you please check the contents of reproducer script in the tarball and/or try to reproduce it yourself? I was unable to run into those issues at all. It is nice that in comment 18 you write how you got into those issue but it didn't work for me either. Could you please retest using the versions I used (above)?

Thanks,
Michal

Comment 23 Yufang Zhang 2010-08-01 15:20:17 UTC

(In reply to comment #21)
> (In reply to comment #18)
> > (5) yufang upgrade xen package to xen-3.0.3-114.el5virttest30.g47a7c9c again,
> > couldn't reproduce this bug without rebooting the host.
> > (6) yufang reboot the host and finally reproduce the bug again.
> 
> Yufang, did you do 'service xend restart' after upgrading xen in step (5)? If
> not, that would explain why the reboot allowed you to reproduce (xend of course
> was restarted when the machine was rebooted). Without restarting xend, even if
> you update the package, you're still running the old code.    

I think find the reason that triggers the problem for creating PV guest with pci device assigned. It is due to that I upgrade/downgrade xen packages after binding pci device to pciback driver but before creating the the guest. I have modified your scripts to reproduce the bug. And it is almost always reproducible with this script. You could reboot the host and have a try with this new script(A few modifications may be needed for log collecting).

Comment 24 Yufang Zhang 2010-08-01 15:22:09 UTC

Created attachment 435878 [details]
bug reproducer

Comment 25 Michal Novotny 2010-08-02 08:46:23 UTC

(In reply to comment #23)
> (In reply to comment #21)
> > (In reply to comment #18)
> > > (5) yufang upgrade xen package to xen-3.0.3-114.el5virttest30.g47a7c9c again,
> > > couldn't reproduce this bug without rebooting the host.
> > > (6) yufang reboot the host and finally reproduce the bug again.
> > 
> > Yufang, did you do 'service xend restart' after upgrading xen in step (5)? If
> > not, that would explain why the reboot allowed you to reproduce (xend of course
> > was restarted when the machine was rebooted). Without restarting xend, even if
> > you update the package, you're still running the old code.    
> 
> I think find the reason that triggers the problem for creating PV guest with
> pci device assigned. It is due to that I upgrade/downgrade xen packages after
> binding pci device to pciback driver but before creating the the guest. I have
> modified your scripts to reproduce the bug. And it is almost always
> reproducible with this script. You could reboot the host and have a try with
> this new script(A few modifications may be needed for log collecting).    

Yufang, I can see this is well-reproducible using the reproduce you attached and correct me if I'm wrong but I think you should always update both xen and kernel-xen packages and I'm not sure whether older kernel-xen version and newer xen version could be working fine together - although I don't think so.

For case of RHEL-5 upgrades the dom0 will always upgrade kernel-xen as well as xen package versions so I guess that if the upgrading and rebooting dom0 (for new kernel-xen to be loaded) will also upgrade the xen package itself and after this reboot everything is working fine (which is what I think according to my observations) we don't have to care about this that much.

Also, the problem is a bit of incompatibility before reboot since `xm dmesg` say:

(XEN) physdev.c:67: dom5: map irq with no vector 0
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) physdev.c:67: dom6: map irq with no vector 0

What *may* be connected to this very closely is the "Allowing physinfo call with newer ABI version" line AFAIK. I think this basically means that there *may* be some kind of incompatibility and that's why this warning is being printed to the xm dmesg output.

When I restart the daemon, the only new lines in `xm dmesg` are those lines:
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) physdev.c:67: dom7: map irq with no vector 0

I think hypervisor may need to be reinitialized when upgrading to the newer version of xen packages since it somehow lost track of IO_APIC_VECTOR according to this line:

(XEN) physdev.c:67: dom7: map irq with no vector 0

Like stated above, the vector is basically having the value of  IO_APIC_VECTOR(map->index) so the value is apparently being lost.

Andrew, I know you're pretty busy however I'd like to ask you about using older version of kernel-xen package and newer version of xen package. The reproducer Yufang did is basically doing upgrading or downgrading the xen package before running it. I was never able to run into those issues using any version of xen package after I rebooted dom0. Is it even supported to use older version of kernel-xen and newer xen package? If the reboot after upgrade/downgrade fix it, what could we do about this one?

Thanks,
Michal

Comment 26 Andrew Jones 2010-08-02 09:03:14 UTC

We should support [up|down]grading xen userspace without the need to change the HV/dom0 kernel. If there's a reason we can't, then things get messy; release notes, rpm dependencies, etc. Not good, and to my knowledge, it hasn't happened since 5.2 when the last large rebase was done. I didn't look at the current list of reproducer steps, but I think if there's a 'restart xend' anywhere in there, that that would be the clue to where the problem is. xend must be losing important state, or it needs to change the state in the HV to match it's new state. In other words, xend needs to deal with the pci binding across xend restarts.

Andrew

Comment 27 Michal Novotny 2010-08-02 09:49:22 UTC

(In reply to comment #26)
> We should support [up|down]grading xen userspace without the need to change the
> HV/dom0 kernel. If there's a reason we can't, then things get messy; release
> notes, rpm dependencies, etc. Not good, and to my knowledge, it hasn't happened
> since 5.2 when the last large rebase was done. I didn't look at the current
> list of reproducer steps, but I think if there's a 'restart xend' anywhere in
> there, that that would be the clue to where the problem is. xend must be losing
> important state, or it needs to change the state in the HV to match it's new
> state. In other words, xend needs to deal with the pci binding across xend
> restarts.
> 
> Andrew    

Well, I've been talking to Mirek today about this and we support it but there are some issues sometimes - like with the PCI things. The PCI passthru *may* be broken when using older kernel-xen and newer xen packages from what I've been told. Also, keep in mind that this is reproducible only on this machine.

PCI binding lives across the restart only if there was no upgrade and the HV message stays for itself that the PCI binding on the hypervisor got lost so that's why it's basically returning -EINVAL. The only thing Xend itself could do for that is that it could try to tell the HV to reinitialize the bindings for the PCI stuff.

Well, also, I think the problem may be in the pciback driver because I tried to remove the driver using `rmmod pciback` and run the script again (there's a probing for this module and binding the device to pciback then) and it was working fine to run the guest and the pci device was visible in the guest. Isn't it possible there's a bug in a pciback driver? Isn't there some kind of reinitialization call in the pciback driver?

Also, Yufang, please. If I'm working on this one, please don't change from ASSIGNED state to NEW. It would look like nobody's currently working on this one and that's not true since I'm working on this one.

Thanks,
Michal

Comment 28 Michal Novotny 2010-08-02 11:49:59 UTC

Well,
I did install -80.el5 version and it was working fine. Then I tried to upgrade to -114 and it was no longer working saying the "Invalid argument" message again. Then I tried to downgrade from -114 to -105 and it was still not working. When I downgraded to -94 version it was still not working but when I downgraded to -80 version it was working so basically it was working with the version I started with.

I did try rebooting the dom0 when having -94 version (and kernel-xen-2.6.18-206.el5) and I was able to make run the guest with PCI device connected. Then I did try downgrading to -80 version and it was working fine as well. I tried to run reproducer between -94 and -105 versions and it was working fine as well. Even for the virttest package versions is was working fine.

I did try later using -105 version installed and rebooting dom0 and also doing the same steps. I tried switching between -105 and virttest package versions and it was working fine in 19 tries that the reproducer does but the last, 20th try it failed with the message so the problem is that it's not that well reproducible. But when it was reproduced I tried downgrading to -105 version and it was not working any longer.

(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) physdev.c:67: dom25: map irq with no vector 0

When I remove module pciback and load it again, I'm seeing:

(XEN) PCI remove device 03:00.0
(XEN) PCI add device 00:00.0
(XEN) PCI add device 00:1e.0
(XEN) PCI add device 00:1f.0
(XEN) PCI add device 01:00.0
(XEN) PCI add device 03:00.0

in the `xm dmesg` output and now it's working fine and when starting guest with PCI device there was additional:

(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
(XEN) PCI add device 03:00.0
(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version

so this has been done by the pciback driver itself. The problem is that is appears not to be 100% reproducible and it can take some time to reproduce it and since the "PCI add device" are being printed when the pciback device is added/probed.

Unfortunately testing was showing that it most probably was there always but we never hit it yet since this is reproducible on one machine only AFAIK.

Also, there's some kind of comment:

/* wait for xend to configure us */
err = xenbus_switch_state(dev, XenbusStateInitWait);
if (err)
goto out;

in the pciback driver itself so according to this command I guess that the reinit code in the xend is not being met correctly so I'll try to investigate the xend code path for PCI devices.

Michal

Comment 29 Michal Novotny 2010-08-02 13:59:06 UTC

Well,
I've been testing this a little more and it turned out that this is always reproducible and almost randomly so the information that is happens more often is irrelevant now since I found out that it's not 100% reproducible and sometimes it's reproducible immediately and sometimes even after 100 tries and also for just restarting xend finally (I did write a loop of 100 tries to run the guest with PCI device attached and it was reproducible as well but after 83 tries in my testing).

Nevertheless I found a workaround when this happens which is to rebind the device ID to the pciback driver. This workaround proved to be working fine for me.

# echo $BDF > /sys/bus/pci/drivers/pciback/unbind
# echo $BDF > /sys/bus/pci/drivers/pciback/bind

The BDF notation is the notation like 0000:03:00.0 or something like that. For using the PV PCI passthru the pciback driver has to be loaded which is *not* done automatically and user needs to do it manually.

The problem seems to be that when you restart xend the binding gets lost sometimes (like I say, it's not 100% reproducible). Provided the fact the device has to be unbound from the original driver and bound to the pciback driver manually the user has to unbind and rebind the device to the driver manually.

For now, we have a workaround but for obvious security reasons the pciback driver is not being loaded manually and doing the commands above automatically is not the option then.

Michal

Comment 30 Andrew Jones 2010-08-02 15:46:32 UTC

Can you reproduce it more reliably if you add another pci device into the setup and then swap the order in which they are attached after shutting down the guest and restarting xend?

Comment 31 Michal Novotny 2010-08-02 15:52:06 UTC

(In reply to comment #30)
> Can you reproduce it more reliably if you add another pci device into the setup
> and then swap the order in which they are attached after shutting down the
> guest and restarting xend?    

Nope, basically it happens randomly all the time. Sometimes it happens in 100 tries, and sometimes is requires more than 250 tries so basically no luck when adding the second guest with PCI device. So, it doesn't seem that we have a reliable reproducer for that.

Michal

Comment 32 Michal Novotny 2010-08-02 15:56:34 UTC

Also, I saw there's the PCI device mapping code in the Xend (calling lowlevel physdev_map_pirq() function) but there's no unmapping code but it's working fine on most of the configuration, more specifically all except this Dell 760.

Michal

Comment 33 Michal Novotny 2010-08-02 16:45:07 UTC

Well, I did some extended logging for libxc and I may have found something relevant. In the log files there's:

xc_physdev_map_pirq: Entering with domid=334, index=17
xc_physdev_map_pirq: Entering with domid=335, index=16
xc_physdev_map_pirq: Entering with domid=336, index=255

which is basically just a hypervisor call PHYSDEVOP_map_pirq() but the index is basically what's being set from python using this line: index = dev.irq .

Therefore, it appears that the IRQ for device is somehow bad since it should be 17 instead of 255. The value is being set to 255 from the sysfs path for the particular PCI device. This is being read from the PCI device path (/sys/bus/pci/devices/$DEVICE_ID/irq) and the results seems to be 255 for this failure case. This is what it wrote to my logfile:

...
PCI DEVICE Path: /sys/bus/pci/devices/0000:03:00.0/irq
PCI DEVICE IRQ: 17
PCI DEVICE Path: /sys/bus/pci/devices/0000:03:00.0/irq
PCI DEVICE IRQ: 255
...

reading the contents from sysfs it returned following:
# cat /sys/bus/pci/devices/0000:03:00.0/irq
255

So the IRQ that it was bound to was 255, not 17 like it should be and it in the first case. The second case is having some bogus number of IRQ and the device is not on this IRQ, which is basically why hypervisor complains using this message:

(XEN) physdev.c:67: dom3: map irq with no vector 0

Also, the device seems to be bound to the pciback driver:
# cat /sys/bus/pci/drivers/pciback/slots 
0000:03:00.0
0000:00:1b.0
# cat /sys/bus/pci/drivers/pciback/permissive 
0000:03:00.0
#

When I use the workaround mentioned above I'm having these results:

# echo "0000:03:00.0" > /sys/bus/pci/drivers/pciback/unbind
# echo "0000:03:00.0" > /sys/bus/pci/drivers/pciback/bind
# cat /sys/bus/pci/devices/0000:03:00.0/irq
17
# cat /sys/bus/pci/drivers/pciback/slots 
0000:03:00.0
0000:00:1b.0
#  cat /sys/bus/pci/drivers/pciback/permissive 
#

The interesting thing is that the device disappeared from the pciback/permissive entry. I can't claim I understand PCI and stuff but couldn't this be the issue? What does it mean anyway?

Thanks,
Michal

Comment 34 Andrew Jones 2010-08-03 07:35:16 UTC

Great work on tracking the irq. Were you able to correlate it's changes to any
of the other actions you were doing? e.g. guest restarts, xend restarts, etc.
or did it seem random? In the end it may very well be a hardware issue. To test
that we can try booting into a bare-metal kernel and modprobing the nic on and
off so it has to reinit over and over. Monitor the irq each time and see if
it's ever 255.

Note, I don't think changing from 17 to 16 indicates a problem. I doubt that it
should always be 17, but rather any free irq in a certain range. 255 on the
other hand looks like an error.

Comment 35 Michal Novotny 2010-08-03 08:48:37 UTC

(In reply to comment #34)
> Great work on tracking the irq. Were you able to correlate it's changes to any
> of the other actions you were doing? e.g. guest restarts, xend restarts, etc.
> or did it seem random?

No, it happens randomly all the time. Also, for just guest restarts it passed just 20 times and then it was stopped. Keep in mind this is not my machine and this is the remote machine so maybe somebody stopped it on person, unfortunately when starting it it was already midnight according to the machine time so I was unable to tell somebody otherwise that using e-mail. Maybe the e-mail did arrive to the right person or something... I'm currently running the guest restart testing again and in the meanwhile I'm studying the code path that's being met by Xend restart - to exclude the possibility the IRQ changes over the Xend restart. Like I told you yesterday, it may be possible that it's there but I doubt Xend does this part of the job - unless they're using some libxc call that's basically triggering it. I don't know yet but I'll let you know when having more results from testing and investigation of the code.

> In the end it may very well be a hardware issue. To test
> that we can try booting into a bare-metal kernel and modprobing the nic on and
> off so it has to reinit over and over.


Well, it doesn't reinit when the module is loaded, does it? So basically what we need to do is to modprobe and rmmod in the loop, right?


> Monitor the irq each time and see if it's ever 255.
> 

Exactly. It doesn't have to be 255, it may be something else, e.g. 253 or something else than the value that was before.

> Note, I don't think changing from 17 to 16 indicates a problem. I doubt that > it should always be 17, but rather any free irq in a certain range. 255 on the
> other hand looks like an error.    

Oh, I forgot to mention something: This was caught when using 2 guests with PCI device attached and IRQ 17 was for NIC and IRQ 16 was for Intel High Definition Audio so this is not a problem. But when the issue got reproduced it got reproduced for both the devices. IRQ 17 (NIC's) changed to 255 and IRQ 16 (HDA Intel's) changed to 253 or something like that.

Michal

Comment 36 Michal Novotny 2010-08-03 08:51:13 UTC

(In reply to comment #35)
> (In reply to comment #34)
> > Great work on tracking the irq. Were you able to correlate it's changes to any
> > of the other actions you were doing? e.g. guest restarts, xend restarts, etc.
> > or did it seem random?
> 
> No, it happens randomly all the time. Also, for just guest restarts it passed
> just 20 times and then it was stopped. Keep in mind this is not my machine and
> this is the remote machine so maybe somebody stopped it on person,
> unfortunately when starting it it was already midnight according to the machine
> time so I was unable to tell somebody otherwise that using e-mail. Maybe the
> e-mail did arrive to the right person or something... 

Oh, sorry, this paragraph was a little confusing. What I meant was that somebody *may* have stopped it on purpose and the e-mail notification may have been sent to incorrect person.

Sorry once again,
Michal

Comment 37 Michal Novotny 2010-08-03 09:53:03 UTC

Well, I did try the bare-metal testing when removing the module and probing for it again and it turns out that the IRQ changes on the device. By reinit I mean rmmod the driver and modprobe it again:

Reinit 0: 66
Reinit 1: 106
Reinit 2: 138
Reinit 3: 154

and so on. The script I used on the bare-metal kernel is very easy:

#!/bin/bash
i=0
while [ 1 ]; do
  IRQ=`cat /sys/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/irq`
  echo "Reinit $i: $IRQ" >> test-irq-03.log
  rmmod e1000e
  modprobe e1000e
  let i=i+1
done

So I think this is basically a hardware bug, isn't it?

What do you think, Andrew?

Thanks,
Michal

Comment 38 Andrew Jones 2010-08-03 11:20:24 UTC

(In reply to comment #37)
> So I think this is basically a hardware bug, isn't it?

No, like I said before, I think the irq will likely change (e.g. 16 was ok, even though it wasn't 17), but I don't believe it should ever be 255 (which could mean error; 255 == (char)-1). I'm not 100% sure that's true, but I'm certainly not concerned with anything that isn't 255. So the irqs you have posted for the bare-metal test all look fine to me. You could look in dmesg for error messages too while on bare-metal, but it's probably not hardware if you've done several modprobes with no strange irqs.

Comment 39 Michal Novotny 2010-08-03 12:01:24 UTC

(In reply to comment #38)
> (In reply to comment #37)
> > So I think this is basically a hardware bug, isn't it?
> 
> No, like I said before, I think the irq will likely change (e.g. 16 was ok,
> even though it wasn't 17), but I don't believe it should ever be 255 (which
> could mean error; 255 == (char)-1). I'm not 100% sure that's true, but I'm
> certainly not concerned with anything that isn't 255. So the irqs you have
> posted for the bare-metal test all look fine to me. You could look in dmesg for
> error messages too while on bare-metal, but it's probably not hardware if
> you've done several modprobes with no strange irqs.    

I can't do it anymore since the machine is the remote machine and it's having the very same driver for both NICs on the system. When I was doing this I had to ask Yufang for testing since it killed the network connection. Nevertheless the IRQ changed every reinit phase so basically I think this is why it's that strange.

First probe we have some IRQ and HV is aware of this IRQ nevertheless the IRQ changes and HV doesn't have the IO_APIC_VECTOR established for this new IRQ and it's having it for old IRQ only. Not for the new one so basically I think this is the reason for -EINVAL coming from HV -> the IRQ changes but the HV is not aware. The test confirmed that the IRQ changes but the HV is having the old IRQ bound to itself - not the new one.

Michal

Comment 40 Michal Novotny 2010-08-03 12:14:26 UTC

(In reply to comment #38)
> (In reply to comment #37)
> > So I think this is basically a hardware bug, isn't it?
> 
> No, like I said before, I think the irq will likely change (e.g. 16 was ok,
> even though it wasn't 17), but I don't believe it should ever be 255 (which
> could mean error; 255 == (char)-1). I'm not 100% sure that's true, but I'm
> certainly not concerned with anything that isn't 255. So the irqs you have
> posted for the bare-metal test all look fine to me. You could look in dmesg for
> error messages too while on bare-metal, but it's probably not hardware if
> you've done several modprobes with no strange irqs.    

Maybe I understand what you mean. Keep in mind that the IRQ was the same for all the cases it did work.

IRQ 16 was HDA Intel *always* it was working
IRQ 17 was e1000e *always* it was working

When one of those IRQs changed the HV was *not* aware and it *always* failed with -EINVAL so my guess is that if the IRQ changes the HV is not aware of it and that why it fails.

Michal

Comment 41 Andrew Jones 2010-08-03 12:30:06 UTC

OK. Sounds like you have a theory, so you should investigate more to find out if that's true. Why did you put needinfo on me to tell me this? Please only set needinfo if you actually need it, and please try to be more specific with questions, i.e. investigate your ideas deep enough first that your questions are truly specific.

Comment 42 Michal Novotny 2010-08-03 13:13:33 UTC

(In reply to comment #41)
> OK. Sounds like you have a theory, so you should investigate more to find out
> if that's true. Why did you put needinfo on me to tell me this? Please only set
> needinfo if you actually need it, and please try to be more specific with
> questions, i.e. investigate your ideas deep enough first that your questions
> are truly specific.    

Oh, sorry for the needinfo now. It was by accident. In fact I don't know how to prove the theory some other way than the way we spoke about. The devices are being added to the hypervisor on rebinding it to the pciback (or reiniting pciback using the rmmod and modprobe again) since there are new lines of "PCI remove device" and "PCI add device" . This is having the new IRQ values apparently.

I did monitor the values:

# xm dmesg | tail
(XEN) mtrr: type mismatch for d0000000,10000000 old: uncachable new: write-combining
(XEN) mtrr: type mismatch for d0000000,10000000 old: uncachable new: write-combining
... PCI add device (from modprobe pciback) ...
(XEN) PCI add device 03:00.0
# echo 0000:03:00.0 > /sys/bus/pci/drivers/e1000e/unbind
# xm dmesg | tail -n 2
(XEN) PCI add device 03:00.0
(XEN) PCI remove device 03:00.0
# echo 0000:03:00.0 > /sys/bus/pci/drivers/pciback/new_slot
# xm dmesg | tail -n 2
(XEN) PCI remove device 03:00.0
(XEN) PCI add device 03:00.0
# xm dmesg | tail -n 2
(XEN) PCI add device 03:00.0
(XEN) PCI add device 03:00.0

In standard system dmesg there were those lines:

ACPI: PCI interrupt for device 0000:03:00.0 disabled
pciback 0000:03:00.0: seizing device
PCI: Enabling device 0000:03:00.0 (0000 -> 0003)
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 17
ACPI: PCI interrupt for device 0000:03:00.0 disabled

so it appears that basically removing the device and adding it back to the hypervisor (which basically calls pci_remove_device() and then pci_add_device() functions) make HV aware of new IRQ since it's telling HV a new interrupt vector and HV stores it so that's why he's aware of this - because the new IRQ from the device is being known and saved to the hypervisor.

Please note that when I rebind the device from pciback and bind to it again it always write following lines to system log (dmesg):

pciback 0000:03:00.0: seizing device
PCI: Enabling device 0000:03:00.0 (0000 -> 0003)
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 17
ACPI: PCI interrupt for device 0000:03:00.0 disabled

In the pciback driver itself it's enabling the device using the pcistub_init_device() function. There's a device initialization code on the PCI bus including this code and comment:

        /* HACK: Force device (& ACPI) to determine what IRQ it's on - we
         * must do this here because pcibios_enable_device may specify
         * the pci device's true irq (and possibly its other resources)
         * if they differ from what's in the configuration space.
         * This makes the assumption that the device's resources won't
         * change after this point (otherwise this code may break!)
         */
        dev_dbg(&dev->dev, "enabling device\n");
        err = pci_enable_device(dev);
        if (err)
                goto config_release;

Basically the pci_enable_device() is what's printing this line to the dmesg:

PCI: Enabling device 0000:03:00.0 (0000 -> 0003)

So basically the interrupt vector is 17 so that's why it changes from the value of 255 to 17. Because the device directly is telling pciback that it's on IRQ 17.

To make a summary the rebinding to the pciback works because it's reinitializing the PCI device and therefore it has a valid IRQ value and then using it is disabled on the ACPI level (I don't know why it's disabled but it looks like this is done by pci_enable_device() directly) like stated on the lines:

ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 17
ACPI: PCI interrupt for device 0000:03:00.0 disabled

which is coming from acpi_pci_irq_disable() function in drivers/acpi/pci_irq.c of the kernel source codes.

I don't know what else to do with this one since I think this is pretty straight-forward if there are "PCI remove device" and "PCI add device" entries in the xm dmesg when doing the rebind and according to the seizing device messages coming from the pciback driver.

Michal

Comment 43 Andrew Jones 2010-08-03 13:41:30 UTC

Based just on what I've read here, it looks to me like the comment in the code you just quoted is the source of the problem

         * This makes the assumption that the device's resources won't
         * change after this point (otherwise this code may break!)

Does upstream still have this assumption in pciback? Or is there something more intelligent that you can try backporting and testing?

Comment 44 Michal Novotny 2010-08-03 13:43:00 UTC

Also, keep in mind that this issue was reproduced only on Dell 760 machine and the hardware of the test machine. No other system was affected so the dynamic IRQ change is what's causing the issue.

Provided the fact we have a workaround that's pretty great working - and doing those writes manually is not the option since the user is having a need to modprobe for pciback manually when adding the device.

I guess we should write a KBase article about using pciback since I wouldn't know if somebody didn't help with this pciback stuff.

I recommend some article like:

For using PCI passthru in the PV guest you have to modprobe pciback driver manually and unbind the PCI device from it's driver, create a new slot in the pciback driver containing this device and bind the device to the driver using:

# echo $BDF > /sys/bus/pci/drivers/$OLD_DRIVER/unbind
# echo $BDF > /sys/bus/pci/drivers/pciback/new_slot
# echo $BDF > /sys/bus/pci/drivers/pciback/bind

You should use BDF notation in $BDF argument. When you're running on hardware that changes the IRQ is may happen that on PV guest creation with PCI device assigned you can run into "Invalid argument" error message with appropriate line in the `xm dmesg` output telling you something like:

(XEN) physdev.c:67: dom{$DOMID}: map irq with no vector 0

This means that hypervisor is not having binding for this particular device because of device IRQ changed. To fix this you need to rebind the device to the pciback driver to get the correct IRQ value using:

# echo $BDF > /sys/bus/pci/drivers/pciback/unbind
# echo $BDF > /sys/bus/pci/drivers/pciback/bind

This will fix the IRQ value in sysfs to match the one that is being associated with the hypervisor and PV guest will be able to run again.

Michal

Comment 45 Michal Novotny 2010-08-03 13:44:54 UTC

(In reply to comment #43)
> Based just on what I've read here, it looks to me like the comment in the code
> you just quoted is the source of the problem
> 
>          * This makes the assumption that the device's resources won't
>          * change after this point (otherwise this code may break!)
> 
> Does upstream still have this assumption in pciback? Or is there something more
> intelligent that you can try backporting and testing?    

Nope, the same code is in upstream version of pciback as well.

Michal

Comment 46 Michal Novotny 2010-08-03 15:08:00 UTC

Well, I wrote article at about this issue and PV PCI passthru:

https://access.redhat.com/kb/docs/DOC-37723

Michal

Comment 47 Yufang Zhang 2010-08-04 03:43:02 UTC

(In reply to comment #44)
> Also, keep in mind that this issue was reproduced only on Dell 760 machine and
> the hardware of the test machine. No other system was affected so the dynamic
> IRQ change is what's causing the issue.
> 
> Provided the fact we have a workaround that's pretty great working - and doing
> those writes manually is not the option since the user is having a need to
> modprobe for pciback manually when adding the device.
> 
> I guess we should write a KBase article about using pciback since I wouldn't
> know if somebody didn't help with this pciback stuff.
> 
> I recommend some article like:
> 
> For using PCI passthru in the PV guest you have to modprobe pciback driver
> manually and unbind the PCI device from it's driver, create a new slot in the
> pciback driver containing this device and bind the device to the driver using:
> 
> # echo $BDF > /sys/bus/pci/drivers/$OLD_DRIVER/unbind
> # echo $BDF > /sys/bus/pci/drivers/pciback/new_slot
> # echo $BDF > /sys/bus/pci/drivers/pciback/bind
> 
> You should use BDF notation in $BDF argument. When you're running on hardware
> that changes the IRQ is may happen that on PV guest creation with PCI device
> assigned you can run into "Invalid argument" error message with appropriate
> line in the `xm dmesg` output telling you something like:
> 
> (XEN) physdev.c:67: dom{$DOMID}: map irq with no vector 0
> 
> This means that hypervisor is not having binding for this particular device
> because of device IRQ changed. To fix this you need to rebind the device to the
> pciback driver to get the correct IRQ value using:
> 
> # echo $BDF > /sys/bus/pci/drivers/pciback/unbind
> # echo $BDF > /sys/bus/pci/drivers/pciback/bind
> 
> This will fix the IRQ value in sysfs to match the one that is being associated
> with the hypervisor and PV guest will be able to run again.
> 
> Michal    

Hi Michal,
Could I ask why dynamic IRQ change happens at xen package upgrading/downgrading? And why the problem is only reproducible at Dell 760. Is it a hardware bug?

Yufang

Comment 48 Michal Novotny 2010-08-04 08:31:40 UTC

(In reply to comment #47)
> (In reply to comment #44)
> > Also, keep in mind that this issue was reproduced only on Dell 760 machine and
> > the hardware of the test machine. No other system was affected so the dynamic
> > IRQ change is what's causing the issue.
> > 
> > Provided the fact we have a workaround that's pretty great working - and doing
> > those writes manually is not the option since the user is having a need to
> > modprobe for pciback manually when adding the device.
> > 
> > I guess we should write a KBase article about using pciback since I wouldn't
> > know if somebody didn't help with this pciback stuff.
> > 
> > I recommend some article like:
> > 
> > For using PCI passthru in the PV guest you have to modprobe pciback driver
> > manually and unbind the PCI device from it's driver, create a new slot in the
> > pciback driver containing this device and bind the device to the driver using:
> > 
> > # echo $BDF > /sys/bus/pci/drivers/$OLD_DRIVER/unbind
> > # echo $BDF > /sys/bus/pci/drivers/pciback/new_slot
> > # echo $BDF > /sys/bus/pci/drivers/pciback/bind
> > 
> > You should use BDF notation in $BDF argument. When you're running on hardware
> > that changes the IRQ is may happen that on PV guest creation with PCI device
> > assigned you can run into "Invalid argument" error message with appropriate
> > line in the `xm dmesg` output telling you something like:
> > 
> > (XEN) physdev.c:67: dom{$DOMID}: map irq with no vector 0
> > 
> > This means that hypervisor is not having binding for this particular device
> > because of device IRQ changed. To fix this you need to rebind the device to the
> > pciback driver to get the correct IRQ value using:
> > 
> > # echo $BDF > /sys/bus/pci/drivers/pciback/unbind
> > # echo $BDF > /sys/bus/pci/drivers/pciback/bind
> > 
> > This will fix the IRQ value in sysfs to match the one that is being associated
> > with the hypervisor and PV guest will be able to run again.
> > 
> > Michal    
> 
> Hi Michal,
> Could I ask why dynamic IRQ change happens at xen package
> upgrading/downgrading? And why the problem is only reproducible at Dell 760. Is
> it a hardware bug?
> 
> Yufang    

Hi Yufang,
the dynamic IRQ is a hardware feature of this NIC. It happens when doing reinitializing the device and this is being done when in a loop. It's hardware related but not a bug. It's basically a feature Xen is not ready for.

Michal

Comment 49 Paolo Bonzini 2010-08-31 12:39:59 UTC

Xen does not support this hardware for PCI passthrough; the same is true also for the pvops version of pciback.  You need to use the workaround in comment #44.

Note You need to log in before you can comment on or make changes to this bug.