Bug 688646 - intel_iommu domain id exhaustion
Summary: intel_iommu domain id exhaustion
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.7
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: Rhel5KvmTier2
TreeView+ depends on / blocked
 
Reported: 2011-03-17 16:02 UTC by Alex Williamson
Modified: 2013-01-09 23:40 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 619455
Environment:
Last Closed: 2011-07-21 10:26:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Comment 1 RHEL Program Management 2011-03-19 22:49:07 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Chris Ward 2011-03-21 09:18:23 UTC
@Intel,

Please confirm your intentions to validate this fix if included in 5.7.0.

Thanks.

Comment 6 Jarod Wilson 2011-03-23 21:45:17 UTC
Patch(es) available in kernel-2.6.18-250.el5
Detailed testing feedback is always welcomed.

Comment 8 Chao Yang 2011-05-31 10:26:04 UTC
Reproduced on kernel 2.6.18-238.el5, repeatedly detach/attach nic card which uses tg3 as kernel driver over 244 times via a script results in host kernel panic:
IOMMU: no free domain ids
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: 
 [<ffffffff80157b56>] list_del+0x1/0x71
PGD 3183d7067 PUD 31869c067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /bus/pci/drivers/tg3/bind
CPU 0 
Modules linked in: tun autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge ipt_REJECT xt_tcpudp ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy ksm(U) kvm_intel(U) kvm(U) joydev snd_hda_intel snd_seq_dummy sr_mod cdrom snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep sg igb snd serio_raw pcspkr shpchp 8021q soundcore i7core_edac edac_mc dca tg3 tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptsas mptscsih scsi_transport_sas mptbase ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 5814, comm: repeated-unbind Tainted: G      2.6.18-238.el5 #1
RIP: 0010:[<ffffffff80157b56>]  [<ffffffff80157b56>] list_del+0x1/0x71
RSP: 0018:ffff8102974e3c18  EFLAGS: 00010007
RAX: ffff8102977996d0 RBX: 0000000000000000 RCX: ffffffff80319f28
RDX: ffffffff80319f28 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000286 R08: ffffffff80319f28 R09: 000000000000003d
R10: ffff8102974e38e8 R11: 0000000000000080 R12: ffff8102977996c0
R13: 0000000000002000 R14: ffff81032f0a2800 R15: 0000000000000000
FS:  00002ae33b24cf50(0000) GS:ffffffff80425000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000318361000 CR4: 00000000000026e0
Process repeated-unbind (pid: 5814, threadinfo ffff8102974e2000, task ffff810326aec040)
Stack:  0000000000000000 ffffffff80169d6c ffff81032f0a2800 ffff8102977996c0
 0000000000000030 ffffffff8016a20b ffff81032f0a2800 ffff8102977996c0
 0000000000000030 0000000000002000 ffff81032f0a2800 ffffffff8016b089
Call Trace:
 [<ffffffff80169d6c>] domain_remove_dev_info+0x16/0xab
 [<ffffffff8016a20b>] domain_exit+0x19/0x14b
 [<ffffffff8016b089>] get_domain_for_dev+0x30d/0x536
 [<ffffffff8016b2c5>] __get_valid_domain_for_dev+0x13/0x6d
 [<ffffffff8016b42a>] __intel_map_single+0x5d/0x172
 [<ffffffff8016b9e1>] intel_alloc_coherent+0xb3/0xd8
 [<ffffffff88215b52>] :tg3:tg3_init_one+0xa21/0x14a4
 [<ffffffff8016168a>] pci_device_probe+0x104/0x184
 [<ffffffff801cad74>] driver_helper+0x0/0x1b
 [<ffffffff80287e04>] klist_del+0x1d/0x2a
 [<ffffffff801cbab4>] driver_probe_device+0x52/0xaa
 [<ffffffff801cb84e>] driver_bind+0x9f/0x11b
 [<ffffffff8010fee2>] sysfs_write_file+0xb9/0xe8
 [<ffffffff80016a81>] vfs_write+0xce/0x174
 [<ffffffff80017339>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b 47 08 48 89 fb 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 
RIP  [<ffffffff80157b56>] list_del+0x1/0x71
 RSP <ffff8102974e3c18>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Fatal exception
 

-------Verified on kernel 2.6.18-264.el5 with same nic card using same script, detach/attach over one thousand times, host works fine. 

-------nic card info:
lspci -vvv -s 02:00.0
	Kernel driver in use: tg3
	Kernel modules: tg3

-------script used to detach/attach nic card:
#!/bin/bash

i=1; while echo 0000:01:00.0 > /sys/bus/pci/drivers/tg3/unbind; do echo $i; i=$[i+1]; sleep 0.5; echo 0000:01:00.0 > /sys/bus/pci/drivers/tg3/bind; sleep 0.5; done
 

-----conclusion:
Based on above, I think this issue has been fixed.

Comment 9 Chao Yang 2011-05-31 10:30:16 UTC
Additional info:
# lspci|grep Eth
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)

Comment 10 juzhang 2011-06-01 04:18:41 UTC
According to comment9,set this issue as verified

Comment 11 errata-xmlrpc 2011-07-21 10:26:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.