Bug 500667

Summary: Hardware Error bringing up e1000e interface with jumbo frames
Product: Red Hat Enterprise Linux 5 Reporter: Orion Poplawski <orion>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: dzickus, james.brown, peterm, rnickel, uwe.knop
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-22 13:57:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Orion Poplawski 2009-05-13 16:07:16 UTC
Description of problem:

Since kernel-2.6.18-128.1.10.el5, I'm seeing the following at boot about 25% of the time:

May 13 09:13:41 castor kernel: e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k4
May 13 09:13:41 castor kernel: e1000e: Copyright (c) 1999-2008 Intel Corporation.
May 13 09:13:41 castor kernel: ACPI: PCI Interrupt 0000:05:00.0[A] -> GSI 44 (level, low) -> IRQ 185
May 13 09:13:41 castor kernel: intel_rng: FWH not detected
May 13 09:13:41 castor kernel: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:7f:45:1a
May 13 09:13:41 castor kernel: eth0: Intel(R) PRO/1000 Network Connection
May 13 09:13:41 castor kernel: eth0: MAC: 4, PHY: 5, PBA No: 2050ff-0ff
May 13 09:13:41 castor kernel: GSI 25 sharing vector 0x6A and IRQ 25
May 13 09:13:41 castor kernel: ACPI: PCI Interrupt 0000:05:00.1[B] -> GSI 40 (level, low) -> IRQ 106
May 13 09:13:41 castor kernel: GSI 26 sharing vector 0x7A and IRQ 26
May 13 09:13:41 castor kernel: ACPI: PCI Interrupt 0000:00:1f.3[C] -> GSI 18 (level, low) -> IRQ 122
May 13 09:13:41 castor kernel: 0000:05:00.1: Hardware Error
May 13 09:13:41 castor kernel: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:7f:45:1b
May 13 09:13:41 castor kernel: eth1: Intel(R) PRO/1000 Network Connection
May 13 09:13:41 castor kernel: eth1: MAC: 4, PHY: 5, PBA No: 2050ff-0ff
May 13 09:13:42 castor kernel: eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: None
May 13 09:13:42 castor kernel: eth1: changing MTU from 1500 to 8982
May 13 09:13:42 castor kernel: eth1: Hardware Error
May 13 09:13:44 castor kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready

And eth1 does not work.

05:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
        Subsystem: Super Micro Computer Inc Unknown device 1096
        Flags: bus master, fast devsel, latency 0, IRQ 186
        Memory at d8060000 (32-bit, non-prefetchable) [size=128K]
        Memory at d8040000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2020 [size=32]
        [virtual] Expansion ROM at d8310000 [disabled] [size=64K]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
        Capabilities: [e0] Express Endpoint IRQ 0
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 1a-45-7f-ff-ff-48-30-00

I'm backing off to 2.6.18-128.1.6.el5 for now.

Comment 1 Andy Gospodarek 2009-05-20 15:55:17 UTC
There were no changes specifically to e1000e between -128.1.6 and -128.1.10, so this is a bit odd.  Is there a particular ring size between 1500 and 8982 that seems to work as close to 100% of the time as far as you can tell?  I'd be curious how consistently exactly 4000 or 8000 worked.

Comment 2 Orion Poplawski 2009-05-27 16:11:57 UTC
Also seeing:

e1000e: probe of 0000:05:00.1 failed with error -2

and no presence of eth1 at all.

I'll try 1500 a bit and see if that makes any difference.

Comment 4 Andy Gospodarek 2009-10-19 18:18:15 UTC
Has 5.4 been tried and does it resolve this problem?

Comment 5 Andy Gospodarek 2010-04-22 13:57:31 UTC
Several errors related to the system PHY that produced failure like this:

e1000e: probe of 0000:04:00.1 failed with error -2

were fixed in RHEL5.5.  There were a few other times when we have seen this error that were fixed with BIOS updates.  Please update to the latest kernel and re-open if that is still broken.