403121 – e1000: issues with Intel ESB2/Gilgal (82563EB)

Bug 403121 - e1000: issues with Intel ESB2/Gilgal (82563EB)

Summary: e1000: issues with Intel ESB2/Gilgal (82563EB)

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-28 16:36 UTC by (GalaxyMaster)
Modified:	2014-06-29 22:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-12-23 20:38:20 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description (GalaxyMaster) 2007-11-28 16:36:19 UTC

Description of problem:
Intel ESB2/Gilgal (82563EB) NIC (for instance, this NIC is used on Supermicro
motherboards like this:
http://www.supermicro.com/products/motherboard/Xeon1333/5000V/X7DVL-E.cfm)
requires at driver version 7.6.5-NAPI or later.  Although driver versions before
7.6.5-NAPI announce support for PCI ID 0x8086:0x1096 the fact is that the system
with such a NIC becomes unreachable via network in 5-10 minutes after the boot.

Version-Release number of selected component (if applicable):

e1000 7.3.20-k2 as included in the latest rhel5.1 kernels

How reproducible:

The problem manifests itself on the specified hardware - no network connectivity
after several minutes from server's startup.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

I have also added a description of this issue to bug #398921, and submitted a
bug report to OpenVZ bugzilla (they are using RHEL5.1 kernels and enhancing them
with OpenVZ functionality) here: http://bugzilla.openvz.org/show_bug.cgi?id=530#c6

Comment 2 Andy Gospodarek 2007-12-06 15:28:05 UTC

The Intel driver from sourceforge has some interesting heritage.  Intel did a
major refactor of the driver, but when they went to push it upstream it wasn't
too well received.  The changes were so drastic that it was determined that
Intel should split the driver into 2 versions.

The first submission of the e1000e driver simply added support for some hardware
that didn't previously exist -- this was the best way to not disturb the e1000
driver.  Recently e1000e was considered stable enough to move all the PCIe
hardware over to e1000e and those changes were made upstream.  The older e1000
driver did a poor job of driving much of the newer e1000 PCIe hardware so I
expect this will help, but I have not tested on your specific hardware to be sure.

Just yesterday I pulled these changes into my experimental gtest kernels.  It
would help if you could try them out here:

http://people.redhat.com/agospoda/#rhel5

You will probably have to change your /etc/modprobe.conf to use e1000e instead
of e1000 for your NIC[s], but other than that you should be fine.

Comment 3 (GalaxyMaster) 2007-12-19 00:49:38 UTC

(In reply to comment #2)
> Just yesterday I pulled these changes into my experimental gtest kernels.  It
> would help if you could try them out here:

I have tried to boot one of our servers with e1000e instead of e1000, no luck. 
e1000e just didn't recognize 82563EB.

BTW, e1000 (7.3.20-k2) from vanilla kernel 2.6.22.1 works with 82563EB without a
single failure.

Comment 4 (GalaxyMaster) 2007-12-19 00:54:23 UTC

(In reply to comment #3)

> BTW, e1000 (7.3.20-k2) from vanilla kernel 2.6.22.1 works with 82563EB without a
> single failure.

Oh, just spotted that there is no NAPI enabled in that build of e1000, but we
need this functionality.  Perhaps there is something in NAPI code of older e1000
modules that lockups the NIC?  it's just a speculation since I hadn't
investigated this.

Comment 5 Andy Gospodarek 2007-12-19 15:31:19 UTC

Intersting that you say your device (0x8086,0x1096) isn't supported by e1000e
since my latest test kernels show that as an included device.

# modinfo e1000e | grep 1096
alias:          pci:v00008086d00001096sv*sd*bc*sc*i*

Is that not the pci-id of the card you are using?

Comment 6 Andy Gospodarek 2008-01-08 20:44:58 UTC

My test kernels moved those PCI ids to e1000e so they should work much better. 
Install them from here:

http://people.redhat.com/agospoda/#rhel5

and then switch modprobe.conf to use e1000e instead of e1000 for those devices
and you should notice significantly better (at least more stable) performance.

Comment 7 Trevor Cordes 2008-01-25 01:26:53 UTC

Could this cause a problem I'm seeing.  Upgraded a server from FC5 (worked 100%
ok) to F8.  Instantly our samba file sharing gives strange TCP errors and drops
connections.  Keeping F8, but using F6/F5 samba version seems to help a bit, but
not much.  The box has onboard Intel LAN:

04:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet
Controller (Copper) (rev 03)
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI
Copyright (c) 1999-2006 Intel Corporation.
ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 17 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:04:00.0 to 64
e1000: 0000:04:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:13:20:d3:5b:18
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

Looks like the o/b LAN is PCIe in disguise.  It's using the e1000 driver.  I
cannot seem to force it to e1000e.  Using latest, stock, F8 kernel.  modinfo
shows this card's id is set to use e1000.

Could e1000e possibly help with my issue?  I've exhausted all other ideas at
this point.  If these cards should use e1000e, why aren't the latest F8 modules
setup that way?

I am heading out in the next few days to try a (ah, always reliable) 8139 NIC if
I can.  Thanks!

Comment 8 Andy Gospodarek 2008-01-25 04:30:46 UTC

Trevor, you may want to check this out if the kernels you are using have don't
allow use of e1000e for 82573 hardware.  The e1000e driver that has support for
this hardware has a workaround for the power-saving issue, but the firmware fix
described here supposedly works with the older drivers.

http://e1000.sourceforge.net/doku.php?id=known_issues#v_l_e_tx_unit_hang_messages

Comment 9 Trevor Cordes 2008-02-16 14:30:37 UTC

Update to comment #7, please ignore completely.  The strange problem was not the
e1000, it was a flaky 48p Gb switch!  It was randomly corrupting/dropping
packets.  I expected more of a $1k switch.

I did try the power-saving fix, which can't hurt.  I'll see if F8's newer
kernels support e1000e, otherwise I'm sure it will be in F9.  Thanks!

Comment 10 Andy Gospodarek 2008-12-23 20:38:20 UTC

GalaxyMaster, I'm guessing this is no longer a problem since I haven't heard from you in over a year.  Please reopen this bug if the problem still persists.  Thanks!

Note You need to log in before you can comment on or make changes to this bug.