Bug 611379

Summary:

PCI-passthrough failed with 2.6.32-37 kernel (call track recorded)

Product:

Red Hat Enterprise Linux 6

Reporter:

Qianfeng Zhang <frzhang>

Component:

kernel

Assignee:

Alex Williamson <alex.williamson>

Status:

CLOSED DUPLICATE

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

low

Version:

6.0

CC:

berrange, chrisw, clalance, ddutile, juzhang, tburke, yoyzhang

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-07-23 15:01:47 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dmesg information which includes the call trace	none
out put of "lspci -v" after dettaching the VF from the host	none
The guest configuration	none
log information when starting the guest	none
dmesg including the kernel call trace again	none
/proc/interrupts	none
lspci -vvv	none
lspci -vvv for pci_0000_02_10_0	none
lspci -vv output after the failure	none
lspci -s 000:02:10.0 INTERRUPT_PIN output	none
mce log	none
function level reset script	none
lspci -vvv	none
dmesg including the kernel call trace	none
lspci -vv after the failure	none
the guest's XML	none
Qemu log	none
flr.sh	none
#>flr_new.sh 02:10.0 output	none
dmesg including kernel call trace	none

Description Qianfeng Zhang 2010-07-05 03:33:27 UTC

Description of problem:

  This issue occurred in Huawei's lab. The machine is a Intel box OEMed by Huawei. When starting the guest with the Virtual Function attached (and dettached from the host), the kernel reported some Call trace information and checking from the guest,  I could not see the attached PCI device. 

Packages used:
    
   kernel-2.6.32-37.el6.x86_64
   libvirt-0.8.1-7.el6.x86_64
   qemu-kvm-0.12.1.2-2.68.2.el6.x86_64

How reproducible:
   It is easy to reproduce on Huawei machine, whether we use RHEL6 or SLES10sp2 as a guest.  

Steps to reproduce :
   #> lspci -v
   #> virsh nodedev-list 
   #> virsh nodedev-dettach  pci_0000_02_10_0
   #> lspci -v                // see the attachment for output
   #> virsh start <domid>     

  



Additional info:

Comment 1 Qianfeng Zhang 2010-07-05 03:35:30 UTC

Created attachment 429455 [details]
dmesg information which includes the call trace

This file includes the kernel  Call Trace

Comment 3 Qianfeng Zhang 2010-07-05 04:15:57 UTC

Created attachment 429457 [details]
out put of "lspci -v" after  dettaching the VF from the host

Comment 4 Qianfeng Zhang 2010-07-05 04:18:22 UTC

The PCI device used for pci-passthrough test is a "Intel Corporation 82599EB 10-Gigabit Network Connection" network interface,  the driver is "ixgbe"

Comment 7 Daniel Berrangé 2010-07-06 12:34:41 UTC

Please provide the output of 'virsh dumpxml $GUESTNAME' and /var/log/libvirt/qemu/$GUEST.log file

Comment 8 Qianfeng Zhang 2010-07-07 16:07:03 UTC

Created attachment 430103 [details]
The guest configuration

The guest configuration  (or output of  #> virsh dumpxml rhel6-g1 )

Comment 9 Qianfeng Zhang 2010-07-07 16:08:46 UTC

Created attachment 430104 [details]
log information when starting the guest

Please also pay attention to the  dmesg.txt where the kernel call trace is very clear

Comment 10 Daniel Berrangé 2010-07-07 16:15:53 UTC

The logs contain this error message showing a device assignment failure in QMEU, which wasn't reported back to libvirt:

Failed to assign irq for "hostdev0": Input/output error
Perhaps you are assigning a device that shares an IRQ with another device?
Failed to assign irq for "hostdev0": Input/output error
Perhaps you are assigning a device that shares an IRQ with another device?


If you upgrade to the qemu-kvm RPM version  from this BZ you should get proper error reporting for this condition from libvirt

https://bugzilla.redhat.com/show_bug.cgi?id=596279

Comment 11 Alex Williamson 2010-07-07 18:39:04 UTC

Yes, additionally bz585310 fixed an issue with failure to exit on irq setup which could be contributing:  https://bugzilla.redhat.com/show_bug.cgi?id=585310
This was fixed in qemu-kvm-0.12.1.2-2.71.el6.  Please retest with the latest bits.

Comment 12 Alex Williamson 2010-07-15 06:02:07 UTC

We really need some help from the submitter to debug this one.  Please update to latest bits and retest.  If it still fails, please provide the output of (from the host):

lspci -vvv -s 0000:02:10.0

and

setpci -s 0000:02:10.0 INTERRUPT_PIN

(replace the PCI device above with the virtual function used if different)

The only code path that seems like it could cause this error would be trying to setup a host INTx, but since this is a VF, by definition it shouldn't have an INTx, and the INTERRUPT_PIN should return 0.

Comment 13 Qianfeng Zhang 2010-07-15 13:16:04 UTC

Hi Alex
   According to my customer,  after upgrading to  qemu-kvm-0.12.1.2-2.71.el6,  the qemu-kvm process can exit very quickly,  but the kernel failure and call trace information is still there.

Comment 14 Alex Williamson 2010-07-15 16:37:05 UTC

I can force the same backtrace if I allow the code to try to register an INTx interrupt for a virtual function device.  At this point, it really looks like a hardware issue.  Please provide the data requested in comment12.  I'd also like to see the output of:

sudo xxd /sys/bus/devices/0000\:02\:10.0/config

(xxd is part of vim-common)

If we can see the interrupt pin is not zero, there's some kind of hardware issue and we need to understand if it's a class problem with this device and we need a workaround, or some kind of point issue or bios defect.

Comment 15 Alex Williamson 2010-07-15 18:28:10 UTC

A colleague also just pointed out this in the provided dmesg:

pci-stub 0000:02:10.0: claimed by stub
Machine check events logged
^^^^^^^^^^^^^^^^^^^^^^^^^^^
pci-stub 0000:02:10.0: claimed by stub
tun: Universal TUN/TAP device driver, 1.6
tun: (C) 1999-2004 Max Krasnyansky <maxk>
device vnet0 entered promiscuous mode
br0: port 2(vnet0) entering forwarding state
assign device: host bdf = 2:10:0
IRQ handler type mismatch for IRQ 0
current handler: timer
Pid: 2014, comm: qemu-kvm Tainted: G   M       2.6.32-37.el6.x86_64 #1
Call Trace:
 [<ffffffff810d8e06>] __setup_irq+0x376/0x3b0

Please get an mcelog from the system so we can figure out if this is related.

Comment 16 Alex Williamson 2010-07-15 18:55:36 UTC

Another thought is that the MCE listed in comment15 occurs suspiciously between two bindings of the same device to the pci-stub driver.  This is caused by the below:

(In reply to comment #0)
> Steps to reproduce :
>    #> lspci -v
>    #> virsh nodedev-list 
>    #> virsh nodedev-dettach  pci_0000_02_10_0

This one unbinds the device from ixgbevf and binds it to pci-stub

>    #> lspci -v                // see the attachment for output
>    #> virsh start <domid>     

Because the domain xml contains a host device, the start will unbind the device from it's current driver (pci-stub) to the pci-stub driver.

This redundancy should not be a problem, but given the location of the MCE, let's try removing it.  After recording /sys/bus/pci/device/0000:02:10.0/config and the mcelog, reboot the system to make sure the device is back to a working state, then simply try

#> virsh start <domid>

without first doing the nodedev-dettach.

Comment 17 Qianfeng Zhang 2010-07-16 12:50:04 UTC

Created attachment 432374 [details]
dmesg including the  kernel call trace again

This also includes the booting message in the kernel

Comment 18 Qianfeng Zhang 2010-07-16 12:50:48 UTC

Created attachment 432375 [details]
/proc/interrupts

Comment 19 Qianfeng Zhang 2010-07-16 12:51:36 UTC

Created attachment 432376 [details]
lspci -vvv

Comment 20 Qianfeng Zhang 2010-07-16 12:52:30 UTC

Created attachment 432377 [details]
lspci -vvv for pci_0000_02_10_0

Comment 21 Qianfeng Zhang 2010-07-16 12:54:11 UTC

Created attachment 432380 [details]
lspci -vv   output    after the failure

Looks the  PCI information of the  VFs are lost  after the failure

Comment 22 Qianfeng Zhang 2010-07-16 12:55:17 UTC

Created attachment 432381 [details]
lspci -s 000:02:10.0 INTERRUPT_PIN      output

Comment 23 Qianfeng Zhang 2010-07-16 13:00:45 UTC

Created attachment 432382 [details]
mce log

yes.  this  mce error can be got  by "#>mcelog"  each time just after rebooting the machine.  It occurs even we don't test with "pci-passthrough".  


I am not sure whether this "mce"  has something to do with the failure of device assigning.   Can you point me to a testing environment that has the same  "ixgbe" interface and can show the success of  "pci-passthrought" on RHEL6 Beta 2 ?

Comment 24 Qianfeng Zhang 2010-07-19 11:58:32 UTC

Alex
   Do you think the mce log  attached by me is related to the faillure and  kernel call trace ?

Comment 25 Alex Williamson 2010-07-19 14:55:47 UTC

I don't have access to a system with an 82599EB and none are available in beaker.  The mce seems to be indicating a memory parity error on a dimm, however since it didn't occur in the latest dmesg, I can't correlate the mce with the card failure.  You confirm with the post failure lspci output that the device has gone into a bad state.  This explains why we take a very unexpected code path.  By the point that happens, we've done very little with the card, it hasn't even been handed over to the guest yet.  What happens if you 'modprobe -r ixgbevf' before assigning the device to a guest?  Also, with igxbevf loaded, are you able to configure the vf devices in the host and do they work?  (note the physical function ethX devices likely need to be up for the virtual functions to receive packets)  I think we're either dealing with a hardware problem or possibly the ixgbevf driver isn't cleanly unbinding from the device.

Comment 26 Alex Williamson 2010-07-19 22:02:32 UTC

Created attachment 433021 [details]
function level reset script

I'm attaching a script that will do the same type of PCI function level reset that happens when a device is assigned to a guest.  With the system in a working state and lspci showing valid data for all devices, run this on the virtual function that you're attempting to assign to the guest, ex:

# flr.sh 02:10.0

You'll need to be root to run the script.  This should print out the lspci info for the device, preform and FLR reset, then print the resulting state of the device.  If this can generate the same type of error state with the device, then I think we can close this as a hardware defect.  If not, we probably need access to the system to debug further.

Comment 27 Qianfeng Zhang 2010-07-20 13:32:43 UTC

Hi Alex

    I just collected some information using your script.

[root@kvm22 test]# ./flr.sh 02:10.0
Before...
02:10.0 Ethernet controller: Intel Corporation 82559 Ethernet Controller Virtual Function (rev 01)
        Subsystem: Intel Corporation Device 000c
        Flags: bus master, fast devsel, latency 0
        [virtual] Memory at c0000000 (64-bit, non-prefetchable) [size=16K]
        [virtual] Memory at c0100000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: ixgbevf
        Kernel modules: ixgbevf

Device does not support FLR

Comment 28 Qianfeng Zhang 2010-07-20 13:37:55 UTC

Created attachment 433167 [details]
lspci -vvv

#>  modprobe -r  igxbevf    is  executed  first.    Looks that detaching 02:10.0 will lead  all the VFs be claimed by pci-stub

Comment 29 Qianfeng Zhang 2010-07-20 13:41:17 UTC

Created attachment 433168 [details]
dmesg including the kernel call trace

dmesg information is almost the same as the previous one

Comment 30 Qianfeng Zhang 2010-07-20 13:47:08 UTC

Created attachment 433169 [details]
lspci -vv   after the  failure

almost the same as previous one

Comment 31 Qianfeng Zhang 2010-07-20 13:49:17 UTC

Created attachment 433170 [details]
the guest's  XML

As it is

Comment 32 Qianfeng Zhang 2010-07-20 13:50:03 UTC

Created attachment 433171 [details]
Qemu log

As it is

Comment 33 Qianfeng Zhang 2010-07-20 13:50:50 UTC

Alex
   If need more information, let me know

Comment 34 Alex Williamson 2010-07-20 18:57:11 UTC

Created attachment 433241 [details]
flr.sh

New flr.sh that detects 82599 and uses same reset method as linux kernel

Comment 35 Alex Williamson 2010-07-20 19:00:14 UTC

(In reply to comment #33)
> Alex
>    If need more information, let me know    

Sorry, the original flr.sh didn't do what we wanted because the 82599 makes use of a device specific reset.  The new version detects this card and emulates the same thing the kernel does on reset.  You should see both:

Before...
<lspci output>

and

After...
<lspci output>

Please retest with this new version.  Thanks.

Comment 36 Alex Williamson 2010-07-20 22:46:32 UTC

Well, it looks like the rhel6 kernel doesn't include the 82599 device specific reset that was added to upstream.  That could be causing us to do much nastier resets, which could be causing this problem.  I've added the necessary patches to a test build, please try it here:

https://brewweb.devel.redhat.com/taskinfo?taskID=2612906

Follow the x86_64 and noarch links to get the rpms you need, install, reboot, and let us know if it resolves the problem.  Thanks.

Comment 37 Qianfeng Zhang 2010-07-22 13:41:42 UTC

Created attachment 433701 [details]
#>flr_new.sh  02:10.0   output

Look that the  MSI-X capability of the device changed from  "Enable+"  to "Enable-" .

Comment 38 Qianfeng Zhang 2010-07-22 13:43:18 UTC

With the kernel provided by you,  the failure is still there, the kernel call trace is the same as the old one

Comment 39 Qianfeng Zhang 2010-07-22 13:44:55 UTC

Created attachment 433702 [details]
dmesg  including  kernel  call trace

Collected on 2.6.32-51test.

Comment 40 Alex Williamson 2010-07-22 13:53:47 UTC

Can we get access to this system?  We're not making any progress on debugging this.

Comment 41 Alex Williamson 2010-07-23 03:11:54 UTC

We've found a system in beaker that can reproduce, no need for access at this point.

Comment 42 Alex Williamson 2010-07-23 15:01:47 UTC


*** This bug has been marked as a duplicate of bug 617116 ***