Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 676875 - ixgbe: update to 3.0.12-k2 causing a panic on boot
ixgbe: update to 3.0.12-k2 causing a panic on boot
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.1
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Andy Gospodarek
Weibing Zhang
:
: 678459 (view as bug list)
Depends On:
Blocks: 6.1KnownIssues
  Show dependency treegraph
 
Reported: 2011-02-11 11:46 EST by Jason Baron
Modified: 2014-06-29 19:03 EDT (History)
10 users (show)

See Also:
Fixed In Version: kernel-2.6.32-118.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 08:42:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
boot panic message (3.83 KB, application/text)
2011-02-11 11:46 EST, Jason Baron
no flags Details
console log (715.38 KB, application/octet-stream)
2011-03-09 04:33 EST, Weibing Zhang
no flags Details
dmesg with kernel -119 (52.60 KB, application/octet-stream)
2011-03-09 04:34 EST, Weibing Zhang
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 07:58:07 EDT

  None (edit)
Description Jason Baron 2011-02-11 11:46:03 EST
Created attachment 478279 [details]
boot panic message

Description of problem:

On lab machine cisco-b200m1-01.gsslab.rdu.redhat.com, I'm running into a panic on boot when networking is initializing.

Version-Release number of selected component (if applicable):

reproduced on kernel -115 and -102.

How reproducible:

boot box with kernel version >= -102


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:

no panic

Additional info:

I bisected this to commit:

[netdrv] ixgbe: update to upstream version 3.0.12-k2
commit 780e4d8bafbe46a9118b0cb78ceb4e39c1af7d22

here is lspci -v for the device:

06:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connectio
n (rev 01)
	Subsystem: Cisco Systems Inc Device 004a
	Flags: bus master, fast devsel, latency 0, IRQ 32
	Memory at b19a0000 (32-bit, non-prefetchable) [size=128K]
	Memory at b1940000 (32-bit, non-prefetchable) [size=256K]
	I/O ports at 1020 [size=32]
	Memory at b19c4000 (32-bit, non-prefetchable) [size=16K]
	Expansion ROM at b2000000 [disabled] [size=256K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [60] MSI-X: Enable+ Count=18 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-25-b5-ff-ff-08-17-60
	Kernel driver in use: ixgbe
	Kernel modules: ixgbe

06:00.1 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connectio
n (rev 01)
	Subsystem: Cisco Systems Inc Device 004a
	Flags: bus master, fast devsel, latency 0, IRQ 42
	Memory at b1980000 (32-bit, non-prefetchable) [size=128K]
	Memory at b1900000 (32-bit, non-prefetchable) [size=256K]
	I/O ports at 1000 [size=32]
	Memory at b19c0000 (32-bit, non-prefetchable) [size=16K]
	Expansion ROM at b2040000 [disabled] [size=256K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [60] MSI-X: Enable+ Count=18 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-25-b5-ff-ff-08-17-60
	Kernel driver in use: ixgbe
	Kernel modules: ixgbe
Comment 2 Andy Gospodarek 2011-02-11 16:19:29 EST
Got it.  Posting patch upstream now....
Comment 3 Andy Gospodarek 2011-02-11 17:04:01 EST
Posted:

http://marc.info/?l=linux-netdev&m=129746077110725&w=2

We will see what Intel thinks about it.
Comment 4 RHEL Product and Program Management 2011-02-15 13:00:01 EST
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 6 Andy Gospodarek 2011-02-18 16:53:46 EST
*** Bug 678459 has been marked as a duplicate of this bug. ***
Comment 7 Aristeu Rozanski 2011-02-23 13:36:04 EST
Patch(es) available on kernel-2.6.32-118.el6
Comment 10 Weibing Zhang 2011-03-08 03:37:06 EST
While testing kernel-2.6.32-119.el6 & kernel-2.6.32-118.el6 on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com. With eth0 using the igxbe driver, the kernel doesn't run into panic, but it prints message as pasted below. Meanwhile, eth0 cannot obtain an ip address via dhcp. Kernel repeats printing the message on console.



Messages:
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00001040/00002000
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Corrected error received: id=2200
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00001040/00002000
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001000/00002000
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Receiver ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00000041/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error        
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001000/00002000
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Receiver ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00000041/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error         (First)
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
Comment 11 Andy Gospodarek 2011-03-08 13:23:02 EST
(In reply to comment #10)
> While testing kernel-2.6.32-119.el6 & kernel-2.6.32-118.el6 on
> ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com. With eth0 using the igxbe driver,
> the kernel doesn't run into panic, but it prints message as pasted below.
> Meanwhile, eth0 cannot obtain an ip address via dhcp. Kernel repeats printing
> the message on console.
> 
> 
> 
> Messages:
> pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
> pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link
> Layer, id=0050(Transmitter ID)
> pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
> pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
> pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
> ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer,
> id=2200(Transmitter ID)
> ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00001040/00002000
> ixgbe 0000:22:00.0:    [ 6] Bad TLP               
> ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
> ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first

Does this happen just after the ixgbe driver is loaded?  Do you have dmesg output before these messages started to appear?  I suspect this is not a new problem, but we did not see this until the ixgbe panic was fixed.
Comment 14 Weibing Zhang 2011-03-09 04:33:51 EST
Created attachment 483133 [details]
console log
Comment 15 Weibing Zhang 2011-03-09 04:34:41 EST
Created attachment 483134 [details]
dmesg with kernel -119
Comment 16 Weibing Zhang 2011-03-09 04:36:38 EST
(In reply to comment #11)
> Does this happen just after the ixgbe driver is loaded?  Do you have dmesg
> output before these messages started to appear?  I suspect this is not a new
> problem, but we did not see this until the ixgbe panic was fixed.

Console logs and dmesg are attached.

Set eth0 to ONBOOT and DHCP.
#cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE="eth0"
HWADDR="00:1B:21:2C:83:B4"
NM_CONTROLLED="yes"
ONBOOT="yes"
BOOTPROTO="dhcp"

Booting with kernel-2.6.32-119.el6, here is the log from console. the message comes up after trying to obtain an IP address via DHCP.

NET: Registered protocol family 10
lo: Disabled Privacy Extensions
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  
Determining IP information for eth0...ADDRCONF(NETDEV_UP): eth0: link is not ready
ixgbe 0000:22:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: IPv6 duplicate address fe80::21b:21ff:fe2c:83b4 detected!
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=000010c1/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error         (First)
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [ 7] Bad DLLP              
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Corrected error received: id=2200


Booting with kernel-2.6.32-120.el6, here is the log from console. the message comes up after the ixgbe driver is loaded.

		Welcome to Red Hat Enterprise Linux Server
Starting udev: udev: starting version 147
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
piix4_smbus 0000:00:08.0: SMBus Host Controller at 0x440, revision 0
sr 0:0:0:0: Attached scsi generic sg0 type 5
sd 2:0:0:0: Attached scsi generic sg1 type 0
scsi 2:1:0:0: Attached scsi generic sg2 type 0
scsi 2:1:1:0: Attached scsi generic sg3 type 0
scsi 2:3:0:0: Attached scsi generic sg4 type 13
dca service started, version 1.12.1
ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.0.12-k2
ixgbe: Copyright (c) 1999-2010 Intel Corporation.
ixgbe 0000:22:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
ixgbe 0000:22:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8
ixgbe 0000:22:00.0: (PCI Express:2.5Gb/s:Width x8) 00:1b:21:2c:83:b4
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
ixgbe 0000:22:00.0: MAC: 1, PHY: 4, PBA No: E18269-001
ixgbe 0000:22:00.0: Intel(R) 10 Gigabit Network Connection
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=000010c1/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error         (First)
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [ 7] Bad DLLP              
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Corrected error received: id=2200
ses 2:3:0:0: Attached Enclosure device
EDAC MC: Ver: 2.1.0 Mar  7 2011
EDAC amd64_edac:  Ver: 3.3.0 Mar  7 2011
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: ECC is enabled by BIOS.
EDAC MC0: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:18.2
EDAC MC1: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:19.2
EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED)
Comment 17 Andy Gospodarek 2011-03-11 09:00:50 EST
I've been testing with -122 on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com and can reproduce the problem.  I'm not sure if the driver or the NIC is to blame for this issue, but I hope to narrow it down soon.  If this is the only system that demonstrates this problem, I think we can set this bug to VERIFIED.
Comment 18 Dayong Tian 2011-03-13 21:32:42 EDT
(In reply to comment #17)
> I've been testing with -122 on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com and
> can reproduce the problem.  I'm not sure if the driver or the NIC is to blame
> for this issue, but I hope to narrow it down soon.  If this is the only system
> that demonstrates this problem, I think we can set this bug to VERIFIED.

Confirmed with eng-ops, the NIC on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com was connected to a private 10Gb network which didn't have DHCP server.
https://engineering.redhat.com/rt3/Ticket/Display.html?id=104321
Comment 21 errata-xmlrpc 2011-05-19 08:42:28 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.