Bug 707899

Summary: The pci resource for vf is not released after hot-removing Intel 82576 NIC [rhel-5.6.z]
Product: Red Hat Enterprise Linux 5 Reporter: Ken Reilly <kreilly>
Component: kernelAssignee: Phillip Lougher <plougher>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: chayang, chrisw, ddutile, dhoward, dwu, jarod, juzhang, moshiro, mstowe, pm-eus, prarit, qcai, tburke
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-238.15.1.el5 Doc Type: Bug Fix
Doc Text:
Hot removing a PCIe device and, consequently, hot plugging it again caused kernel panic. This was due to a PCI resource for the SR-IOV Virtual Function (vf) not being released after the hot removing, causing the memory area in the pci_dev struct to be used by another process. With this update, when a PCIe device is removed from a system, all resources are properly released; kernel panic no longer occurs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-15 06:11:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 698879    
Bug Blocks:    

Description Ken Reilly 2011-05-26 09:54:43 UTC
This bug has been copied from bug #698879 and has been proposed
to be backported to 5.6 z-stream (EUS).

Comment 4 Phillip Lougher 2011-06-17 09:13:54 UTC
in kernel-2.6.18-238.15.1.el5

linux-2.6-pci-sriov-release-vf-bar-resources-when-device-is-hot-unplug.patch

Comment 6 Chao Yang 2011-06-27 08:18:50 UTC
Reproduce&Verify this issue by following steps:
1. load fakephp module to fake pci hot plug/unplug
# modprobe fakephp
2. hot unplug 82576 nic by:
# echo -n 0 > /sys/bus/pci/slots/0000\:03\:00.0/power 
3. hot plug 82576 nic by:
# echo -n 1 > /sys/bus/pci/slots/0000\:02\:00.0/power
Host pci info:
# lspci | grep Eth
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)


----------Reproduced On kernel version:2.6.18-238.el5 x86_64
after step 2:
*** actual result: host kernel panic ***
Unable to handle kernel paging request at 0000000500000001 RIP: 
 [<ffffffff80096ea2>] __release_region+0x24/0x8f
PGD 30ab39067 PUD 0 
Oops: 0000 [1] SMP 
last sysfs file: /bus/pci/slots/0000:03:00.0/power
CPU 2 
Modules linked in: fakephp autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc tpm_tis snd_hwdep igb tpm sr_mod cdrom snd 8021q tg3 tpm_bios sg shpchp i7core_edac pcspkr soundcore serio_raw edac_mc dca dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptsas mptscsih scsi_transport_sas mptbase ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 15946, comm: bash Not tainted 2.6.18-238.el5 #1
RIP: 0010:[<ffffffff80096ea2>]  [<ffffffff80096ea2>] __release_region+0x24/0x8f
RSP: 0018:ffff81032f9dfcf8  EFLAGS: 00010202
RAX: 0000000000000200 RBX: ffff81032f0d9470 RCX: 00000000e37fffff
RDX: 0000000000000200 RSI: 00000000e3400000 RDI: 0000000500000001
RBP: 00000000e3400000 R08: 0000000000000004 R09: ffffc200100c0000
R10: ffff81032f9dfbb8 R11: ffffffff8022a0c5 R12: 00000000e37fffff
R13: ffff81032c0c6780 R14: ffff81032fad4800 R15: ffff81032c0c6000
FS:  00002b0fbd223f50(0000) GS:ffff81010b3e3e40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000500000001 CR3: 0000000313b63000 CR4: 00000000000006e0
Process bash (pid: 15946, threadinfo ffff81032f9de000, task ffff810329f307e0)
Stack:  ffff81032fad4800 0000000000000001 ffff81032fad4800 ffffffff8015ed79
 ffffc200100c0000 0000000000000001 000000000000004b ffffffff8015ee2d
 ffff81032fad4870 ffff81032c0c6500 ffffffff882f56d8 ffffffff882e65b3
Call Trace:
 [<ffffffff8015ed79>] pci_release_region+0x84/0xa6
 [<ffffffff8015ee2d>] pci_release_selected_regions+0x1f/0x2b
 [<ffffffff882e65b3>] :igb:igb_remove+0x170/0x195
 [<ffffffff8016146b>] pci_device_remove+0x24/0x3a
 [<ffffffff801cb969>] __device_release_driver+0x9f/0xe9
 [<ffffffff801cbc60>] device_release_driver+0x2c/0x4e
 [<ffffffff801cb12e>] bus_remove_device+0x9d/0xb2
 [<ffffffff801c9e2a>] device_del+0x129/0x1a9
 [<ffffffff801c9ec7>] device_unregister+0x9/0x12
 [<ffffffff8015dc6e>] pci_stop_dev+0x25/0x57
 [<ffffffff8015dd83>] pci_remove_bus_device+0x37/0xa0
 [<ffffffff8890815e>] :fakephp:disable_slot+0x109/0x117
 [<ffffffff801671c8>] power_write_file+0xa5/0x111
 [<ffffffff8010fee2>] sysfs_write_file+0xb9/0xe8
 [<ffffffff80016a81>] vfs_write+0xce/0x174
 [<ffffffff80017339>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b 17 48 39 ea 77 3a 48 8b 47 08 4c 39 e0 72 31 83 7f 18 
RIP  [<ffffffff80096ea2>] __release_region+0x24/0x8f
 RSP <ffff81032f9dfcf8>
CR2: 0000000500000001
 <0>Kernel panic - not syncing: Fatal exception


--------------Verified on kernel:2.6.18-238.17.1.el5 x86_64
I executed step 2 and step 3 in a loop for 5 times, no panic occurred. 
after step 2:
*** actual result: host works well, no panic ***
# echo -n 0 > /sys/bus/pci/slots/0000\:03\:00.0/power 
# lspci|grep Eth
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)
# dmesg
...
ACPI: PCI interrupt for device 0000:03:00.1 disabled
ACPI: PCI interrupt for device 0000:03:00.0 disabled

after step 3:
*** actual result: host works well, 82576 can acquire an IP ***
# echo -n 1 > /sys/bus/pci/slots/0000\:02\:00.0/power 
PCI: Enabling device 0000:03:00.0 (0100 -> 0102)
PCI: Enabling device 0000:03:00.1 (0100 -> 0102)
-bash: echo: write error: No such device
# lspci|grep Eth|grep 82576
03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
# dmesg | tail
igb 0000:03:00.0: eth2: PBA No: e43709-003
igb 0000:03:00.0: Using MSI-X interrupts. 4 rx queue(s), 1 tx queue(s)
PCI: Enabling device 0000:03:00.1 (0100 -> 0102)
ACPI: PCI Interrupt 0000:03:00.1[B] -> GSI 40 (level, low) -> IRQ 186
PCI: Setting latency timer of device 0000:03:00.1 to 64
igb 0000:03:00.1: 0 vfs allocated
igb 0000:03:00.1: Intel(R) Gigabit Ethernet Network Connection
igb 0000:03:00.1: eth3: (PCIe:2.5Gb/s:Width x4) 00:1b:21:42:33:85
igb 0000:03:00.1: eth3: PBA No: e43709-003
igb 0000:03:00.1: Using MSI-X interrupts. 4 rx queue(s), 1 tx queue(s)
# ip link show | grep eth3
15: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
# ethtool -i eth3
driver: igb
version: 2.1.0-k2-1
firmware-version: 1.2-1
bus-info: 0000:03:00.1



-------------Conclusion:
Based on above, this issue has been fixed.

Comment 7 Chao Yang 2011-06-27 08:31:20 UTC
According to Comment #6, moving to VERIFIED

Comment 8 Martin Prpič 2011-07-12 11:52:58 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Hot removing a PCIe device and, consequently, hot plugging it again caused kernel panic. This was due to a PCI resource for the SR-IOV Virtual Function (vf) not being released after the hot removing, causing the memory area in the pci_dev struct to be used by another process. With this update, when a PCIe device is removed from a system, all resources are properly released; kernel panic no longer occurs.

Comment 9 errata-xmlrpc 2011-07-15 06:11:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0927.html