Bug 619455

Summary: Host kernel oops after a series of virsh {attach,detach}-device
Product: Red Hat Enterprise Linux 6 Reporter: Jiri Denemark <jdenemar>
Component: kernelAssignee: Alex Williamson <alex.williamson>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: alex.williamson, chayang, ddugger, ddutile, dwmw2, gcosta, jpirko, jyang, kzhang, michen, tburke, yang.z.zhang
Target Milestone: rcKeywords: Triaged
Target Release: 6.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-128.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 688646 (view as bug list) Environment:
Last Closed: 2011-05-23 20:43:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580566, 580951, 580954, 635500    
Attachments:
Description Flags
lspci -vvvs 1:0.0
none
rhel6.xml none

Description Jiri Denemark 2010-07-29 15:06:40 UTC
Description of problem:

Running 

i=1; while virsh attach-device rhel6 pci.xml; do echo $i; i=$[i+1]; sleep 1; virsh detach-device rhel6 pci.xml; sleep 1; done

results in host kernel oops after something like 250 iterations.

The operations done by the loop are roughly:
- unbind the pci device from current driver
- bind it to pci-stub
- reset it
- attach it to a guest
- detach it from the guest
- reset
- unbind from pci-stub

Version-Release number of selected component (if applicable):

kernel-2.6.32-37.el6.x86_64

How reproducible:

100% so far

Additional info:

Additional info from the oops will be provided by Paolo.

Comment 2 Paolo Bonzini 2010-07-29 15:23:55 UTC
oops info coming...

RBX = RDI = R9 = R11 = R15 = 0x0000000000000000

didn't take note of other registers

RIP 0010:[<ffffffff812640ec>]   list_del+0xc/0xa0

stack frame:

         [<ffffffff81288110>]   domain_remove_dev_info+0x40/0xe0
         [<ffffffff81289167>]   domain_exit+0x27/0x190
         [<ffffffff81288659>] ? iommu_attach_domain+0xb9/0xc0
         [<ffffffff8128bada>]   get_domain_for_dev.clone.3+0x31e/0x5d0
         [<ffffffffa01e64d2>] ? tg3_nvram_read+0xc2/0x170 [tg3]
         [<ffffffff8128c38c>]   __intel_map_single+0x19c/0x210
         [<ffffffff8114e937>] ? alloc_pages_current+0x87/0xd0
         [<ffffffff8128c4fe>]   intel_alloc_coherent+0xae/0x120
         [<ffffffffa01ea371>] ? tg3_read_mem+0xa1/0x120 [tg3]

Code: 55 48 89 e5 53 48 89 fb 48 83 ec 08 <48> 8b 47 08

<ffffffff812640e0>  55                   pushq     %rbp
<ffffffff812640e1>  48 89 e5             movq      %rsp, %rbp
<ffffffff812640e4>  53                   pushq     %rbx
<ffffffff812640e5>  48 89 fb             movq      %rdi, %rbx
<ffffffff812640e8>  48 83 ec 08          subq      $8, %rsp
<ffffffff812640ec>  48 8b 47 08          movq      $8(%rdi), %rax

Comment 3 RHEL Program Management 2010-07-29 15:27:42 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 4 Jiri Denemark 2010-07-30 13:58:15 UTC
This time 82 iterations were enough to trigger the bug.

Comment 5 RHEL Program Management 2010-08-18 21:25:59 UTC
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 7 Alex Williamson 2011-01-03 17:10:56 UTC
Is this still reproducible?  I'm using:

kernel-2.6.32-71.4.1.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64
libvirt-0.8.1-27.el6.x86_64

I've done well over 600 attach/detach cycles and haven't seen any issues.

Comment 8 Alex Williamson 2011-01-03 20:59:14 UTC
Does the device you're testing with perhaps have a PCI option ROM?  (Please provide lspci -vvv of the host device being assigned)  It looks like there may be a memory leak when dealing with option ROMs.  If you can still reproduce this, please include the guest xml the xml for the added device, and the actual host oops message.

Comment 9 Jiri Denemark 2011-01-06 14:15:58 UTC
I reproduced it after something like 150 iterations with
kernel-2.6.32-72.el6.x86_64
qemu-kvm-0.12.1.2-2.113.el6.x86_64
libvirt-0.8.6-1.el6.x86_64

and after 255 iterations with
kernel-2.6.32-94.el6.x86_64.rpm
qemu-kvm-0.12.1.2-2.128.el6.x86_64.rpm
libvirt-0.8.6-1.el6.x86_64

pci.xml:
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address bus='1' slot='0' function='0'/>
  </source>
</hostdev>

Domain XML and output of lspci -vvv will come as attachments.

Unfortunately, I can't provide the actual oops message since it scrolls out of my screen and I don't have a serial cable to redirect the output elsewhere.

Comment 10 Jiri Denemark 2011-01-06 14:17:25 UTC
Created attachment 472062 [details]
lspci -vvvs 1:0.0

Comment 11 Jiri Denemark 2011-01-06 14:18:11 UTC
Created attachment 472063 [details]
rhel6.xml

Comment 12 Alex Williamson 2011-01-06 14:48:52 UTC
Thanks Jiri.  One difference I see between our testing is that I specify the guest PCI slot, which is I think what libvirt would do too.  My xml file looks like this:

<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</hostdev>

Do you notice in your test if the assigned device in the guest changes it's slot on each iteration?  Could you do something similar to the above and see if you still get an oops?  I also note that the tested device is a tg3, which really doesn't even work with device assignment until qemu-kvm-0.12.1.2-2.127.el6.  This could have something to do with older versions failing with fewer iterations.  It also has an option ROM, albeit small, so the guest process size will grow due to bz667188.  Thanks.

Comment 13 RHEL Program Management 2011-01-07 03:58:12 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 14 Suzanne Logcher 2011-01-07 16:03:41 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 15 Jiri Denemark 2011-01-10 11:21:38 UTC
I modified the pci.xml to explicitly specify guest's PCI slot the device should be hotplugged to. The oops is still reproducible after 255 iterations. To be specific, the 255th detach causes the host to crash. I'm starting to get suspicious about this 255 magic number. Previously, the number of iterations needed to reproduce the issue was varying quite a lot but now it seems to constantly be 255.

Comment 16 Alex Williamson 2011-01-10 19:25:59 UTC
Jiri, can you confirm what your host system is?  I see intel_alloc_coherent in the backtrace Paolo provided, which implies an Intel VT-d system.  I just want to make sure that I shouldn't be looking for AMD IOMMU specific issues.

Comment 17 Alex Williamson 2011-01-10 21:47:43 UTC
Aha, this is a tg3 bug, I can reproduce using the following modified script (tg3 is 0000:05:00.0 on my system):

i=1; while echo 0000:05:00.0 > /sys/bus/pci/drivers/tg3/unbind; do echo $i; i=$[i+1]; sleep 0.5; echo 0000:05:00.0 > /sys/bus/pci/drivers/tg3/bind; sleep 0.5; done

Panic:

tg3 0000:05:00.0: PME# enabled
tg3 0000:05:00.0: PCI INT A disabled
tg3 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
tg3 0000:05:00.0: PME# disabled
IOMMU: no free domain ids
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff81263bac>] list_del+0xc/0xa0
PGD 36ec79067 PUD 36ec6c067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/bus/pci/drivers/tg3/bind
CPU 3 
Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nfs lockd fscache nfs_acl auth_rpcgss xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_filter ip_tables bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log kvm_intel kvm uinput sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support tg3 i7core_edac edac_core ioatdma igb dca ext4 mbcache jbd2 raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sr_mod cdrom sd_mod crc_t10dif ahci dm_mod [last unloaded: microcode]

Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nfs lockd fscache nfs_acl auth_rpcgss xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_filter ip_tables bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log kvm_intel kvm uinput sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support tg3 i7core_edac edac_core ioatdma igb dca ext4 mbcache jbd2 raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sr_mod cdrom sd_mod crc_t10dif ahci dm_mod [last unloaded: microcode]
Pid: 3208, comm: bash Not tainted 2.6.32-94.el6.x86_64 #1 4157CTO
RIP: 0010:[<ffffffff81263bac>]  [<ffffffff81263bac>] list_del+0xc/0xa0
RSP: 0018:ffff88036a0e9b08  EFLAGS: 00010092
RAX: 0000000000000282 RBX: 0000000000000000 RCX: 0000000000003385
RDX: 0000000000000282 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff88036a0e9b18 R08: ffffffff81b9f920 R09: 0000000000000000
R10: 0000000000000038 R11: 0000000000000000 R12: ffff8803587e9f40
R13: ffff8803587e9f50 R14: 0000000000000282 R15: 0000000000000000
FS:  00007f2c2718c700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003a86c744b0 CR3: 000000036b5ec000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 3208, threadinfo ffff88036a0e8000, task ffff88036ba034e0)
Stack:
 ffff88036a0e9b68 0000000000000000 ffff88036a0e9b48 ffffffff81287f30
<0> ffff8803587e9f40 ffff880371a5e000 ffff8803587e9f40 0000000000000000
<0> ffff88036a0e9ba8 ffffffff81288f87 ffffffffffffffff 0000000000000000
Call Trace:
 [<ffffffff81287f30>] domain_remove_dev_info+0x40/0xe0
 [<ffffffff81288f87>] domain_exit+0x27/0x190
 [<ffffffff81288479>] ? iommu_attach_domain+0xb9/0xc0
 [<ffffffff8128b8fa>] get_domain_for_dev.clone.3+0x31a/0x5d0
 [<ffffffffa01754d2>] ? tg3_nvram_read+0xc2/0x170 [tg3]
 [<ffffffff8128c1ac>] __intel_map_single+0x19c/0x210
 [<ffffffff8114975a>] ? alloc_pages_current+0x9a/0x100
 [<ffffffff8128c31e>] intel_alloc_coherent+0xae/0x120
 [<ffffffffa0179371>] ? tg3_read_mem+0xa1/0x120 [tg3]
 [<ffffffffa018c52c>] tg3_init_one+0xa9a/0x1564 [tg3]
 [<ffffffff811da04e>] ? sysfs_addrm_finish+0x4e/0x290
 [<ffffffff81271817>] local_pci_probe+0x17/0x20
 [<ffffffff81272a01>] pci_device_probe+0x101/0x120
 [<ffffffff8132a0a2>] ? driver_sysfs_add+0x62/0x90
 [<ffffffff8132a240>] driver_probe_device+0xa0/0x2a0
 [<ffffffff813293fa>] driver_bind+0xca/0x110
 [<ffffffff8132877c>] drv_attr_store+0x2c/0x30
 [<ffffffff811d84a5>] sysfs_write_file+0xe5/0x170
 [<ffffffff81165e48>] vfs_write+0xb8/0x1a0
 [<ffffffff810cca12>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81166881>] sys_write+0x51/0x90
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Code: 00 ff ff ff 89 95 fc fe ff ff e9 ab fd ff ff 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 <48> 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 4c 8b 40 08 4c 39 
RIP  [<ffffffff81263bac>] list_del+0xc/0xa0
 RSP <ffff88036a0e9b08>
CR2: 0000000000000008
---[ end trace 0a9a95e2e4fa5fbc ]---

Comment 18 Don Dutile (Red Hat) 2011-01-13 19:19:33 UTC
*** Bug 635682 has been marked as a duplicate of this bug. ***

Comment 19 Alex Williamson 2011-01-26 15:03:15 UTC
Adding David Woodhouse.  It's become clear that this is an intel-iommu bug.  Any device that does DMA will allocate a domain ID from the iommu.  When the device is unbound from the driver, the domain ID is never freed and we eventually hit the limit of supported domain IDs.

Comment 20 RHEL Program Management 2011-02-01 05:33:58 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 21 RHEL Program Management 2011-02-01 18:31:24 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 23 RHEL Program Management 2011-03-17 15:29:30 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 25 Chao Yang 2011-03-21 11:57:21 UTC
Hit same issue with BCM5764M(whose driver is tg3) when using script in comment #17

Comment 26 Aristeu Rozanski 2011-03-30 14:33:24 UTC
Patch(es) available on kernel-2.6.32-128.el6

Comment 29 Chao Yang 2011-04-12 07:25:34 UTC
Verified on kernel 2.6.32-130.el6.x86_64 with BCM5764M using script in comment
#17, unbind/bind over 1000 times, did not hit kernel panic. 
This bug has been fixed according to comment #25 and comment #29.

Comment 30 errata-xmlrpc 2011-05-23 20:43:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Comment 31 Jiri Pirko 2011-06-03 14:16:27 UTC
This introduced a regression. See bug 710382 for details.