Bug 707899
Summary: | The pci resource for vf is not released after hot-removing Intel 82576 NIC [rhel-5.6.z] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Ken Reilly <kreilly> |
Component: | kernel | Assignee: | Phillip Lougher <plougher> |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.6 | CC: | chayang, chrisw, ddutile, dhoward, dwu, jarod, juzhang, moshiro, mstowe, pm-eus, prarit, qcai, tburke |
Target Milestone: | rc | Keywords: | Regression, ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.18-238.15.1.el5 | Doc Type: | Bug Fix |
Doc Text: |
Hot removing a PCIe device and, consequently, hot plugging it again caused kernel panic. This was due to a PCI resource for the SR-IOV Virtual Function (vf) not being released after the hot removing, causing the memory area in the pci_dev struct to be used by another process. With this update, when a PCIe device is removed from a system, all resources are properly released; kernel panic no longer occurs.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-07-15 06:11:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 698879 | ||
Bug Blocks: |
Description
Ken Reilly
2011-05-26 09:54:43 UTC
in kernel-2.6.18-238.15.1.el5 linux-2.6-pci-sriov-release-vf-bar-resources-when-device-is-hot-unplug.patch Reproduce&Verify this issue by following steps: 1. load fakephp module to fake pci hot plug/unplug # modprobe fakephp 2. hot unplug 82576 nic by: # echo -n 0 > /sys/bus/pci/slots/0000\:03\:00.0/power 3. hot plug 82576 nic by: # echo -n 1 > /sys/bus/pci/slots/0000\:02\:00.0/power Host pci info: # lspci | grep Eth 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 03:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) ----------Reproduced On kernel version:2.6.18-238.el5 x86_64 after step 2: *** actual result: host kernel panic *** Unable to handle kernel paging request at 0000000500000001 RIP: [<ffffffff80096ea2>] __release_region+0x24/0x8f PGD 30ab39067 PUD 0 Oops: 0000 [1] SMP last sysfs file: /bus/pci/slots/0000:03:00.0/power CPU 2 Modules linked in: fakephp autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc tpm_tis snd_hwdep igb tpm sr_mod cdrom snd 8021q tg3 tpm_bios sg shpchp i7core_edac pcspkr soundcore serio_raw edac_mc dca dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptsas mptscsih scsi_transport_sas mptbase ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 15946, comm: bash Not tainted 2.6.18-238.el5 #1 RIP: 0010:[<ffffffff80096ea2>] [<ffffffff80096ea2>] __release_region+0x24/0x8f RSP: 0018:ffff81032f9dfcf8 EFLAGS: 00010202 RAX: 0000000000000200 RBX: ffff81032f0d9470 RCX: 00000000e37fffff RDX: 0000000000000200 RSI: 00000000e3400000 RDI: 0000000500000001 RBP: 00000000e3400000 R08: 0000000000000004 R09: ffffc200100c0000 R10: ffff81032f9dfbb8 R11: ffffffff8022a0c5 R12: 00000000e37fffff R13: ffff81032c0c6780 R14: ffff81032fad4800 R15: ffff81032c0c6000 FS: 00002b0fbd223f50(0000) GS:ffff81010b3e3e40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000500000001 CR3: 0000000313b63000 CR4: 00000000000006e0 Process bash (pid: 15946, threadinfo ffff81032f9de000, task ffff810329f307e0) Stack: ffff81032fad4800 0000000000000001 ffff81032fad4800 ffffffff8015ed79 ffffc200100c0000 0000000000000001 000000000000004b ffffffff8015ee2d ffff81032fad4870 ffff81032c0c6500 ffffffff882f56d8 ffffffff882e65b3 Call Trace: [<ffffffff8015ed79>] pci_release_region+0x84/0xa6 [<ffffffff8015ee2d>] pci_release_selected_regions+0x1f/0x2b [<ffffffff882e65b3>] :igb:igb_remove+0x170/0x195 [<ffffffff8016146b>] pci_device_remove+0x24/0x3a [<ffffffff801cb969>] __device_release_driver+0x9f/0xe9 [<ffffffff801cbc60>] device_release_driver+0x2c/0x4e [<ffffffff801cb12e>] bus_remove_device+0x9d/0xb2 [<ffffffff801c9e2a>] device_del+0x129/0x1a9 [<ffffffff801c9ec7>] device_unregister+0x9/0x12 [<ffffffff8015dc6e>] pci_stop_dev+0x25/0x57 [<ffffffff8015dd83>] pci_remove_bus_device+0x37/0xa0 [<ffffffff8890815e>] :fakephp:disable_slot+0x109/0x117 [<ffffffff801671c8>] power_write_file+0xa5/0x111 [<ffffffff8010fee2>] sysfs_write_file+0xb9/0xe8 [<ffffffff80016a81>] vfs_write+0xce/0x174 [<ffffffff80017339>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b 17 48 39 ea 77 3a 48 8b 47 08 4c 39 e0 72 31 83 7f 18 RIP [<ffffffff80096ea2>] __release_region+0x24/0x8f RSP <ffff81032f9dfcf8> CR2: 0000000500000001 <0>Kernel panic - not syncing: Fatal exception --------------Verified on kernel:2.6.18-238.17.1.el5 x86_64 I executed step 2 and step 3 in a loop for 5 times, no panic occurred. after step 2: *** actual result: host works well, no panic *** # echo -n 0 > /sys/bus/pci/slots/0000\:03\:00.0/power # lspci|grep Eth 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) # dmesg ... ACPI: PCI interrupt for device 0000:03:00.1 disabled ACPI: PCI interrupt for device 0000:03:00.0 disabled after step 3: *** actual result: host works well, 82576 can acquire an IP *** # echo -n 1 > /sys/bus/pci/slots/0000\:02\:00.0/power PCI: Enabling device 0000:03:00.0 (0100 -> 0102) PCI: Enabling device 0000:03:00.1 (0100 -> 0102) -bash: echo: write error: No such device # lspci|grep Eth|grep 82576 03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 03:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) # dmesg | tail igb 0000:03:00.0: eth2: PBA No: e43709-003 igb 0000:03:00.0: Using MSI-X interrupts. 4 rx queue(s), 1 tx queue(s) PCI: Enabling device 0000:03:00.1 (0100 -> 0102) ACPI: PCI Interrupt 0000:03:00.1[B] -> GSI 40 (level, low) -> IRQ 186 PCI: Setting latency timer of device 0000:03:00.1 to 64 igb 0000:03:00.1: 0 vfs allocated igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection igb 0000:03:00.1: eth3: (PCIe:2.5Gb/s:Width x4) 00:1b:21:42:33:85 igb 0000:03:00.1: eth3: PBA No: e43709-003 igb 0000:03:00.1: Using MSI-X interrupts. 4 rx queue(s), 1 tx queue(s) # ip link show | grep eth3 15: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 # ethtool -i eth3 driver: igb version: 2.1.0-k2-1 firmware-version: 1.2-1 bus-info: 0000:03:00.1 -------------Conclusion: Based on above, this issue has been fixed. According to Comment #6, moving to VERIFIED Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Hot removing a PCIe device and, consequently, hot plugging it again caused kernel panic. This was due to a PCI resource for the SR-IOV Virtual Function (vf) not being released after the hot removing, causing the memory area in the pci_dev struct to be used by another process. With this update, when a PCIe device is removed from a system, all resources are properly released; kernel panic no longer occurs. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0927.html |