Bug 676875

Summary: ixgbe: update to 3.0.12-k2 causing a panic on boot
Product: Red Hat Enterprise Linux 6 Reporter: Jason Baron <jbaron>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Weibing Zhang <atzhang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: atzhang, dtian, hjia, jbaron, jburke, knoel, kzhang, mbelangia, peterm, ypei
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-118.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:42:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 676037    
Attachments:
Description Flags
boot panic message
none
console log
none
dmesg with kernel -119 none

Description Jason Baron 2011-02-11 16:46:03 UTC
Created attachment 478279 [details]
boot panic message

Description of problem:

On lab machine cisco-b200m1-01.gsslab.rdu.redhat.com, I'm running into a panic on boot when networking is initializing.

Version-Release number of selected component (if applicable):

reproduced on kernel -115 and -102.

How reproducible:

boot box with kernel version >= -102


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:

no panic

Additional info:

I bisected this to commit:

[netdrv] ixgbe: update to upstream version 3.0.12-k2
commit 780e4d8bafbe46a9118b0cb78ceb4e39c1af7d22

here is lspci -v for the device:

06:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connectio
n (rev 01)
	Subsystem: Cisco Systems Inc Device 004a
	Flags: bus master, fast devsel, latency 0, IRQ 32
	Memory at b19a0000 (32-bit, non-prefetchable) [size=128K]
	Memory at b1940000 (32-bit, non-prefetchable) [size=256K]
	I/O ports at 1020 [size=32]
	Memory at b19c4000 (32-bit, non-prefetchable) [size=16K]
	Expansion ROM at b2000000 [disabled] [size=256K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [60] MSI-X: Enable+ Count=18 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-25-b5-ff-ff-08-17-60
	Kernel driver in use: ixgbe
	Kernel modules: ixgbe

06:00.1 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connectio
n (rev 01)
	Subsystem: Cisco Systems Inc Device 004a
	Flags: bus master, fast devsel, latency 0, IRQ 42
	Memory at b1980000 (32-bit, non-prefetchable) [size=128K]
	Memory at b1900000 (32-bit, non-prefetchable) [size=256K]
	I/O ports at 1000 [size=32]
	Memory at b19c0000 (32-bit, non-prefetchable) [size=16K]
	Expansion ROM at b2040000 [disabled] [size=256K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [60] MSI-X: Enable+ Count=18 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-25-b5-ff-ff-08-17-60
	Kernel driver in use: ixgbe
	Kernel modules: ixgbe

Comment 2 Andy Gospodarek 2011-02-11 21:19:29 UTC
Got it.  Posting patch upstream now....

Comment 3 Andy Gospodarek 2011-02-11 22:04:01 UTC
Posted:

http://marc.info/?l=linux-netdev&m=129746077110725&w=2

We will see what Intel thinks about it.

Comment 4 RHEL Program Management 2011-02-15 18:00:01 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 6 Andy Gospodarek 2011-02-18 21:53:46 UTC
*** Bug 678459 has been marked as a duplicate of this bug. ***

Comment 7 Aristeu Rozanski 2011-02-23 18:36:04 UTC
Patch(es) available on kernel-2.6.32-118.el6

Comment 10 Weibing Zhang 2011-03-08 08:37:06 UTC
While testing kernel-2.6.32-119.el6 & kernel-2.6.32-118.el6 on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com. With eth0 using the igxbe driver, the kernel doesn't run into panic, but it prints message as pasted below. Meanwhile, eth0 cannot obtain an ip address via dhcp. Kernel repeats printing the message on console.



Messages:
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00001040/00002000
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Corrected error received: id=2200
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00001040/00002000
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001000/00002000
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Receiver ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00000041/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error        
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001000/00002000
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Receiver ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00000041/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error         (First)
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first

Comment 11 Andy Gospodarek 2011-03-08 18:23:02 UTC
(In reply to comment #10)
> While testing kernel-2.6.32-119.el6 & kernel-2.6.32-118.el6 on
> ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com. With eth0 using the igxbe driver,
> the kernel doesn't run into panic, but it prints message as pasted below.
> Meanwhile, eth0 cannot obtain an ip address via dhcp. Kernel repeats printing
> the message on console.
> 
> 
> 
> Messages:
> pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
> pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link
> Layer, id=0050(Transmitter ID)
> pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
> pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
> pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
> ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer,
> id=2200(Transmitter ID)
> ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=00001040/00002000
> ixgbe 0000:22:00.0:    [ 6] Bad TLP               
> ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
> ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first

Does this happen just after the ixgbe driver is loaded?  Do you have dmesg output before these messages started to appear?  I suspect this is not a new problem, but we did not see this until the ixgbe panic was fixed.

Comment 14 Weibing Zhang 2011-03-09 09:33:51 UTC
Created attachment 483133 [details]
console log

Comment 15 Weibing Zhang 2011-03-09 09:34:41 UTC
Created attachment 483134 [details]
dmesg with kernel -119

Comment 16 Weibing Zhang 2011-03-09 09:36:38 UTC
(In reply to comment #11)
> Does this happen just after the ixgbe driver is loaded?  Do you have dmesg
> output before these messages started to appear?  I suspect this is not a new
> problem, but we did not see this until the ixgbe panic was fixed.

Console logs and dmesg are attached.

Set eth0 to ONBOOT and DHCP.
#cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE="eth0"
HWADDR="00:1B:21:2C:83:B4"
NM_CONTROLLED="yes"
ONBOOT="yes"
BOOTPROTO="dhcp"

Booting with kernel-2.6.32-119.el6, here is the log from console. the message comes up after trying to obtain an IP address via DHCP.

NET: Registered protocol family 10
lo: Disabled Privacy Extensions
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  
Determining IP information for eth0...ADDRCONF(NETDEV_UP): eth0: link is not ready
ixgbe 0000:22:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: IPv6 duplicate address fe80::21b:21ff:fe2c:83b4 detected!
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=000010c1/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error         (First)
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [ 7] Bad DLLP              
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Corrected error received: id=2200


Booting with kernel-2.6.32-120.el6, here is the log from console. the message comes up after the ixgbe driver is loaded.

		Welcome to Red Hat Enterprise Linux Server
Starting udev: udev: starting version 147
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
piix4_smbus 0000:00:08.0: SMBus Host Controller at 0x440, revision 0
sr 0:0:0:0: Attached scsi generic sg0 type 5
sd 2:0:0:0: Attached scsi generic sg1 type 0
scsi 2:1:0:0: Attached scsi generic sg2 type 0
scsi 2:1:1:0: Attached scsi generic sg3 type 0
scsi 2:3:0:0: Attached scsi generic sg4 type 13
dca service started, version 1.12.1
ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.0.12-k2
ixgbe: Copyright (c) 1999-2010 Intel Corporation.
ixgbe 0000:22:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
ixgbe 0000:22:00.0: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8
ixgbe 0000:22:00.0: (PCI Express:2.5Gb/s:Width x8) 00:1b:21:2c:83:b4
pcieport 0000:00:0a.0: AER: Multiple Corrected error received: id=2200
ixgbe 0000:22:00.0: MAC: 1, PHY: 4, PBA No: E18269-001
ixgbe 0000:22:00.0: Intel(R) 10 Gigabit Network Connection
pcieport 0000:00:0a.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0050(Transmitter ID)
pcieport 0000:00:0a.0:   device [1166:0140] error status/mask=00001080/00002000
pcieport 0000:00:0a.0:    [ 7] Bad DLLP              
pcieport 0000:00:0a.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=2200(Transmitter ID)
ixgbe 0000:22:00.0:   device [8086:10c7] error status/mask=000010c1/00002000
ixgbe 0000:22:00.0:    [ 0] Receiver Error         (First)
ixgbe 0000:22:00.0:    [ 6] Bad TLP               
ixgbe 0000:22:00.0:    [ 7] Bad DLLP              
ixgbe 0000:22:00.0:    [12] Replay Timer Timeout  
ixgbe 0000:22:00.0:   Error of this Agent(2200) is reported first
pcieport 0000:00:0a.0: AER: Corrected error received: id=2200
ses 2:3:0:0: Attached Enclosure device
EDAC MC: Ver: 2.1.0 Mar  7 2011
EDAC amd64_edac:  Ver: 3.3.0 Mar  7 2011
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: ECC is enabled by BIOS.
EDAC MC0: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:18.2
EDAC MC1: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:19.2
EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED)

Comment 17 Andy Gospodarek 2011-03-11 14:00:50 UTC
I've been testing with -122 on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com and can reproduce the problem.  I'm not sure if the driver or the NIC is to blame for this issue, but I hope to narrow it down soon.  If this is the only system that demonstrates this problem, I think we can set this bug to VERIFIED.

Comment 18 Dayong Tian 2011-03-14 01:32:42 UTC
(In reply to comment #17)
> I've been testing with -122 on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com and
> can reproduce the problem.  I'm not sure if the driver or the NIC is to blame
> for this issue, but I hope to narrow it down soon.  If this is the only system
> that demonstrates this problem, I think we can set this bug to VERIFIED.

Confirmed with eng-ops, the NIC on ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com was connected to a private 10Gb network which didn't have DHCP server.
https://engineering.redhat.com/rt3/Ticket/Display.html?id=104321

Comment 21 errata-xmlrpc 2011-05-19 12:42:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html