477774 – e1000e devices fail to initialize on an IBM x3850 (M1) machines.

Bug 477774 - e1000e devices fail to initialize on an IBM x3850 (M1) machines.

Summary: e1000e devices fail to initialize on an IBM x3850 (M1) machines.

Keywords:
Status:	CLOSED DUPLICATE of bug 496127
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Dean Nelson
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-23 16:32 UTC by Gilboa Davara
Modified:	2010-06-30 17:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-30 17:38:13 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci -vv output (17.91 KB, text/plain) 2008-12-23 16:50 UTC, Gilboa Davara	no flags	Details
View All

Description Gilboa Davara 2008-12-23 16:32:06 UTC

Description of problem:
We are current deploying a large number of IBM x3850 servers (w/ 4 x Xeon MP 7120). 
Each machine is equipped with one or two Intel GbE NICs. (Either two dual port 82572EI/Fiber or a single quad port 82571EB/Copper)
No matter which type of NIC we connect to the machine and/or in which combination (slot/copper-fiber/etc), we get multiple "ADDRCONF(NETDEV_UP): ethX: link is not ready" message and the NICs refuse to work even though ethtool reports the link status as Link up/1000/Full duplex.
Ping doesn't work, including broadcast - neither is out own in-kernel raw packet generator. (dev_queue_xmit)
Just to test new(er) kernels we attempted to install Fedora 10, and the network devices worked out of the box.
We didn't attempt to use the latest e1000e drivers for Intel.com (0.8.6.1) - we assume that these driver should work more-or-less out of the box.

While we are not using the latest IBM firmware (mostly because the machines are not ours to modify) - by looking at the changelog we didn't see anything concerning RHEL5.2 support and/or PCI-E bug-fixes.

Version-Release number of selected component (if applicable):
$ uname -a
Linux IP-Probe-Hadar 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Kernel \r on an \m

How reproducible:
Always. (On a number of machines)

Additional info:
$ dmesg -c
...
$ ifconfig eth0 down
$ ifconfig eth1 down
$ rmmod e1000e
$ ifup eth0
$ ifup eth1
$ sleep 30s
$ dmesg -c
ACPI: PCI interrupt for device 0000:1a:00.0 disabled
ACPI: PCI interrupt for device 0000:0b:00.0 disabled
e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
e1000e: Copyright (c) 1999-2007 Intel Corporation.
PCI: Enabling device 0000:0b:00.0 (0040 -> 0043)
ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 52 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:0b:00.0 to 64
0000:0a:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:7c:e0:2a
0000:0a:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:0a:00.0: eth0: MAC: 1, PHY: 1, PBA No: d76489-001
PCI: Enabling device 0000:1a:00.0 (0040 -> 0043)
ACPI: PCI Interrupt 0000:1a:00.0[A] -> GSI 55 (level, low) -> IRQ 225
PCI: Setting latency timer of device 0000:1a:00.0 to 64
0000:19:00.0: eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:23:51:7c
0000:19:00.0: eth1: Intel(R) PRO/1000 Network Connection
0000:19:00.0: eth1: MAC: 1, PHY: 1, PBA No: d76489-001
ADDRCONF(NETDEV_UP): eth0: link is not ready
ADDRCONF(NETDEV_UP): eth1: link is not ready
$

Comment 1 Gilboa Davara 2008-12-23 16:33:12 UTC

$ ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:15:17:7C:E0:2A
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:ea620000-ea640000

$ ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:15:17:23:51:7C
          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:ea720000-ea740000

Comment 2 Gilboa Davara 2008-12-23 16:48:47 UTC

$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  0:   99967994          0          0          0          0          0          0          0  local-APIC-edge  timer
  1:       2384          0          0          0        738          0          0          0    IO-APIC-edge  i8042
  8:          1          0          0          0          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0          0          0          0          0   IO-APIC-level  acpi
 12:          4          0          0          0          0          0          0          0    IO-APIC-edge  i8042
 14:     883627          0          0          0       7038          0          0          0    IO-APIC-edge  ide0
 50:          3          0          0          0          0          0          0          0   IO-APIC-level  eth3
 66:          0          0          0          0          0          0          0          0         PCI-MSI  eth0
 74:          0          0          0          0          0          0          0          0         PCI-MSI  eth1
201:      20095          0          0          0       2514          0          0          0   IO-APIC-level  ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
209:     108957          0      91641          0       2056          0          0          0   IO-APIC-level  aacraid
233:         29          0          0     762632          0          0          0       1584   IO-APIC-level  eth2
NMI:       1885       1819       1331       1270        963        902       1351       1290
LOC:   99954204   99954131   99954062   99953994   99953911   99953839   99953772   99953709
ERR:          0
MIS:          0

$ ethtool eth0
Settings for eth0:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: on
        Supports Wake-on: pumbag
        Wake-on: d
        Current message level: 0x00000001 (1)
        Link detected: yes

$ ethtool eth1
Settings for eth1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: on
        Supports Wake-on: pumbag
        Wake-on: d
        Current message level: 0x00000001 (1)
        Link detected: yes

Comment 3 Gilboa Davara 2008-12-23 16:50:21 UTC

Created attachment 327757 [details]
lspci -vv output

Comment 4 Gilboa Davara 2009-01-05 16:43:44 UTC

Older quad-port PCI-X/Copper and dual port PCI-X/Fiber NICs (both using e1000) work out of the box.

- Gilboa

Comment 5 Andy Gospodarek 2009-01-05 17:36:59 UTC

You are probably hitting the problem that this upstream patch fixes:

commit f8d59f7826aa73c5e7682fbed6db38020635d466
Author: Bruce Allan <bruce.w.allan>
Date:   Fri Aug 8 18:36:11 2008 -0700

    e1000e: test for unusable MSI support
    
    Some systems do not like 82571/2 use of 16-bit MSI messages and some
    other systems claim to support MSI, but neither really works.  Setup a
    test MSI handler to detect whether or not MSI is working properly, and
    if not, fallback to legacy INTx interrupts.
    
    Signed-off-by: Bruce Allan <bruce.w.allan>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher>
    Signed-off-by: Jeff Garzik <jgarzik>

This patch has been added to the e1000e driver for our RHEL5.3 update that will be shipping quite soon.

You can confirm the fix by trying kernels from here:

http://people.redhat.com/dzickus/el5/128.el5/

Please reopen this bug if testing of the kernels mentioned above to not yield positive results.  Thanks!

*** This bug has been marked as a duplicate of bug 436045 ***

Comment 6 Gilboa Davara 2009-01-07 19:44:41 UTC

Thanks for the prompt response.
We'll test the pre-5.3 kernel and see how it goes.

- Gilboa

Comment 7 Gilboa Davara 2009-01-11 13:16:08 UTC

No go.
Seeing the same (ethx: link is not ready) problem after the kernel upgrade.

- Gilboa

Comment 8 Andy Gospodarek 2009-01-12 16:07:16 UTC

That's interesting.  I'm a bit surprised that didn't work around it.  Can you send me the output from the dmesg command that shows this doesn't work.  I'd like all of it too so I can see the complete initialization of the hardware.

Comment 9 Andy Gospodarek 2009-01-12 16:17:03 UTC

I would also be curious if your machine still booted and if the NIC in question works when adding pci=nomsi to the kernel command line.

Comment 10 Gilboa Davara 2009-01-19 17:35:21 UTC

OK. Latest kernel seems to be more information.
As expected, MSI is acting up.

$ ifup eth0 && ifup eth1
$ dmesg
e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k4
e1000e: Copyright (c) 1999-2008 Intel Corporation.
PCI: Enabling device 0000:0b:00.0 (0440 -> 0443)
ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 52 (level, low) -> IRQ 74
PCI: Setting latency timer of device 0000:0b:00.0 to 64
eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:7c:e0:2a
eth0: Intel(R) PRO/1000 Network Connection
eth0: MAC: 1, PHY: 1, PBA No: d76489-001
PCI: Enabling device 0000:1a:00.0 (0440 -> 0443)
ACPI: PCI Interrupt 0000:1a:00.0[A] -> GSI 55 (level, low) -> IRQ 98
PCI: Setting latency timer of device 0000:1a:00.0 to 64
eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:23:51:7c
eth1: Intel(R) PRO/1000 Network Connection
eth1: MAC: 1, PHY: 1, PBA No: d76489-001
eth0: MSI interrupt test failed, using legacy interrupt.
ADDRCONF(NETDEV_UP): eth0: link is not ready
eth1: MSI interrupt test failed, using legacy interrupt.
ADDRCONF(NETDEV_UP): eth1: link is not ready

Comment 11 Andy Gospodarek 2009-01-19 18:49:35 UTC

Gilboa, this is definitely an interesting problem.  There are a few changes between the current version of e1000e (from F10) and what is in the e1000e driver in RHEL5.3.  Nothing immediately stands out as the fix for this.  Can I ask why you chose this particular NIC to be added to the system rather than using the on-board bnx2-based interfaces?

Comment 12 Gilboa Davara 2009-01-20 08:53:17 UTC

Anday,

The NICs are being used to passively monitor GbE links using dev_add_pack in promisc mode.
Non-Intel cards yield far better results when monitoring multiple GbE ports.

In general, the ideal solution would have been to use the PCI-X version of this card instead, but for some weird reason IBM only certifies the PCI-E card... (Go figure)

- Gilboa

Comment 13 Gilboa Davara 2009-01-20 10:28:34 UTC

Please ignore my previous comments.
I had a typo in the kernel command line (pci=nnomsi :()

Device seems to work just fine with MSI disabled.
We'll conduct some low-level testing and report back.

Thanks!

- Gilboa

Comment 14 Andy Gospodarek 2009-01-20 13:15:40 UTC

Glad to hear the system is working with pci=nomsi on the command line.  At least we know something works. :-)

I'm a bit surprised that my test kernels that disable msi on the NIC don't work.  All cases I've run into with the 8257x and friends that don't work with MSI are resolved by the patch that tests and then switches to INTx.

Could you send me the output from lspci -t?

Comment 15 Gilboa Davara 2009-01-20 15:57:48 UTC

 lspci -t
-+-[0000:19]---00.0-[0000:1a-1d]----00.0
 +-[0000:14]---00.0-[0000:15-18]--
 +-[0000:0f]---00.0-[0000:10-13]--
 +-[0000:0a]---00.0-[0000:0b-0e]----00.0
 +-[0000:06]-+-00.0
 |           \-01.0-[0000:07]--+-04.0
 |                             +-04.1
 |                             +-06.0
 |                             \-06.1
 +-[0000:02]---00.0
 +-[0000:01]-+-00.0
 |           +-01.0
 |           +-01.1
 |           \-02.0
 \-[0000:00]-+-00.0
             +-01.0
             +-03.0
             +-03.1
             +-03.2
             +-0f.0
             +-0f.1
             \-0f.3

Comment 16 Andy Gospodarek 2009-03-23 21:24:27 UTC

This should be fixed with the latest kernel available from RHN.  Can you verify that the latest kernels on RHN or my test kernels at http://people.redhat.com/agospoda#rhel5 are fully functional.  Thanks!

Comment 17 Gilboa Davara 2009-03-30 17:59:53 UTC

Should we enable msi?

- Gilboa

Comment 18 Andy Gospodarek 2009-03-30 18:37:39 UTC

Yes, I would try it with msi enabled.  The changes that were made in the e1000e driver should now allow the card to work correctly on a system without pci=nomsi on the command-line.

Comment 19 Gilboa Davara 2009-04-05 14:23:30 UTC

No go.
Once we remove pci=nomsi, the e1000e devices no long work.

- Gilboa

Comment 20 Andy Gospodarek 2009-04-06 13:03:34 UTC

That is not at all what I expected.  I will do some more looking and post a patch or comments.

Comment 21 Marco Schirrmeister 2009-05-02 15:21:16 UTC

I can confirm this too.

I have also an x3850 and run into the same problem. e1000e was not working at all.
The link was there, but I was not able to talk to other devices on the local lan.

I also tried your latest test kernel http://people.redhat.com/agospoda/rhel5/kernel-2.6.18-140.el5.gtest.70.x86_64.rpm and it is still not working without the pci=nomsi option.

With that option it is working fine.

Andy, you mentioned in one of your posts the onboard Broadcom card and the bnx2 module.
The onboard card in the x3850 is a NetXtreme, so the tg3 module is used.

The reason why we added an additional card is that the application owner wanted a etherchannel with more then 2 GBit.

Anyway, this server where I had the problems is still running for 2 weeks in a test mode. If you want I still can test something. But after 2 weeks, I need to put that box into production mode.


Marco

Comment 23 Dana Rubin 2010-02-11 18:07:10 UTC

Hitting the same problem in kernel-2.6.18-164.11.1

Comment 24 Jean Delvare 2010-06-07 15:59:27 UTC

As I am investigating a related issue with the e1000e driver, I have noticed the following: on a system where MSI interrupts don't work, and the e1000e driver attempts to fallback to legacy interrupts, the driver won't work. However, if I force legacy interrupts myself by passing the e1000e driver the IntMode=0,0,0,0 option _after a reboot_, then it works fine.

So I am under the impression that the e1000e driver's code that falls back to legacy interrupts doesn't really work and leaves the card in an incorrect state, while legacy interrupt support itself does work.

I am curious if others would be able to reproduce this.

Comment 25 Dean Nelson 2010-06-07 18:40:31 UTC

(In reply to comment #24)
> So I am under the impression that the e1000e driver's code that falls back to
> legacy interrupts doesn't really work and leaves the card in an incorrect
> state, while legacy interrupt support itself does work.
> 
> I am curious if others would be able to reproduce this.    

I was able to reproduce the problem reported by this BZ (and I believe mentioned by you in your comment) on an internal system (nec-em17), for which 'lspci -tv' produced:

> -[0000:00]-+-00.0  Intel Corporation 5000V Chipset Memory Controller Hub
>            +-02.0-[0000:01-08]--+-00.0-[0000:02-07]--+-00.0-[0000:03]--
>            |                    |                    \-02.0-[0000:07]--+-00.0  Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper)
>            |                    |                                      \-00.1  Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper)
>            |                    \-00.3-[0000:08]--
>            +-03.0-[0000:0c-0e]--+-00.0-[0000:0d]----0e.0  Promise Technology, Inc. 80333 [SuperTrak EX8350/EX16350], 80331 [SuperTrak EX8300/EX16300]
>            |                    \-00.2-[0000:0e]--
>            +-10.0  Intel Corporation 5000 Series Chipset FSB Registers
>            +-10.1  Intel Corporation 5000 Series Chipset FSB Registers
>            +-10.2  Intel Corporation 5000 Series Chipset FSB Registers
>            +-11.0  Intel Corporation 5000 Series Chipset Reserved Registers
>            +-13.0  Intel Corporation 5000 Series Chipset Reserved Registers
>            +-15.0  Intel Corporation 5000 Series Chipset FBD Registers
>            +-16.0  Intel Corporation 5000 Series Chipset FBD Registers
>            +-1c.0-[0000:12]--
>            +-1c.1-[0000:18]----00.0  Matrox Graphics, Inc. MGA G200e [Pilot] ServerEngines (SEP1)
>            +-1d.0  Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1
>            +-1d.1  Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2
>            +-1d.2  Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3
>            +-1d.3  Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4
>            +-1d.7  Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller
>            +-1e.0-[0000:19]--
>            +-1f.0  Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller
>            +-1f.1  Intel Corporation 631xESB/632xESB IDE Controller
>            \-1f.3  Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller

For debugging purposes, I added two printk()s to e1000e_set_interrupt_capability(). One marked 'A' was added just before the switch statement and the other marked 'B' was added just before the last return statement. I also added a debug function to msi_capability_init() that dumped out the devices as it followed bus->parent. The bus number appears after "msi_capability_init:". The following is a small portion pulled from dmesg:

> e1000e_set_interrupt_capability:A: int_mode=0x2
>
> msi_capability_init:7: pci_dev=0xffff88012b08d000 subord=0x(null)           hdr_type=0x0 vendor=0x8086 device=0x1096 msi_enabled=1
> msi_capability_init:7: pci_bus=0xffff88012bbe1c00 parent=0xffff88012bbe1400 primary=2 secondary=7 PCI_BUS_FLAGS_NO_MSI=0x0
>
> msi_capability_init:2: pci_dev=0xffff88012b08b000 subord=0xffff88012bbe1c00 hdr_type=0x1 vendor=0x8086 device=0x3518 msi_enabled=1
> msi_capability_init:2: pci_bus=0xffff88012bbe1400 parent=0xffff88012bbe1000 primary=1 secondary=2 PCI_BUS_FLAGS_NO_MSI=0x0
>
> msi_capability_init:1: pci_dev=0xffff88012b088000 subord=0xffff88012bbe1400 hdr_type=0x1 vendor=0x8086 device=0x3500 msi_enabled=0
> msi_capability_init:1: pci_bus=0xffff88012bbe1000 parent=0xffff88012bbe0c00 primary=0 secondary=1 PCI_BUS_FLAGS_NO_MSI=0x0
>
> msi_capability_init:0: pci_dev=0xffff88012b009000 subord=0xffff88012bbe1000 hdr_type=0x1 vendor=0x8086 device=0x25f7 msi_enabled=1
> msi_capability_init:0: pci_bus=0xffff88012bbe0c00 parent=0x(null)           primary=0 secondary=0 PCI_BUS_FLAGS_NO_MSI=0x0
>
> e1000e_set_interrupt_capability:B: int_mode=0x1

When the e1000_test_msi_interrupt()/e1000_test_msi() tries to switch back to legacy interrupts, it only deals with the e1000e device and none of the bridges. So the only device in the above list that gets msi_enabled flag set to 0 is the e1000e. (Note that the bridge on bus 1 had msi_enabled set to 0 all along.)

I've verified that this is a problem on RHEL5 and RHEL6, but I haven't verified whether it fails on RHEL4.9, though from just looking at the code it should. I've also not verified whether upstream has this issue. But the code is pretty much the same as RHEL6, so I suspect upstream hasn't addressed this problem either.

I'm in the process of looking for a solution, but haven't been actively pursuing one for the last few months. This area of the kernel/hardware is new to me, so there are a number of areas I need to become familar with, like how interrupts are propagated through bridges, and the impact of switching a bridge from MSI to legacy on any of the other devices that may be on its bus.

Comment 27 Jean Delvare 2010-06-08 08:25:20 UTC

Dean, I'm not sure if you're on the right track. Even in the case where legacy interrupts work for me (loading e1000e with IntMode=0,0,0,0 after a reboot), the bridge settings don't change, MSI support is still enabled on the bridge. I don't think MSI settings need to be the same on the e1000e device and the bridge.

But then again I am no expert in the area.

Comment 28 Dean Nelson 2010-06-08 18:32:04 UTC

(In reply to comment #27)
> Dean, I'm not sure if you're on the right track. Even in the case where legacy
> interrupts work for me (loading e1000e with IntMode=0,0,0,0 after a reboot),
> the bridge settings don't change, MSI support is still enabled on the bridge. I
> don't think MSI settings need to be the same on the e1000e device and the
> bridge.
> 
> But then again I am no expert in the area.    

Jean, at the moment, I'm inclined to agree with...

Back in mid-February I'd run on two systems provisioned with RHEL6, nec-em17 (mentioned above in comment #25) and hp-z200-02. The latter system looked as follows (notice that it lacks the multiple bridges of nec-em17):

> [root@hp-z200-02 ~]# lspci -tv
> -[0000:00]-+-00.0  Intel Corporation Core Processor DRAM Controller
>            +-02.0  Intel Corporation Core Processor Integrated Graphics Controller
>            +-04.0  Intel Corporation Core Processor Thermal Management Controller
>            +-19.0  Intel Corporation 82578DM Gigabit Network Connection
>            +-1a.0  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1a.1  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1a.2  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1a.7  Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller
>            +-1b.0  Intel Corporation 5 Series/3400 Series Chipset High Definition Audio
>            +-1c.0-[18]--
>            +-1c.4-[24]--
>            +-1d.0  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1d.1  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1d.2  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1d.3  Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller
>            +-1d.7  Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller
>            +-1e.0-[10]--
>            +-1f.0  Intel Corporation 5 Series/3400 Series Chipset LPC Interface Controller
>            \-1f.2  Intel Corporation 82801 SATA RAID Controller
> [root@hp-z200-02 ~]#

It should be noted that in my analysis, I had forced the e1000_msi_test_interrupt() to fail by ensuring that FLAG_MSI_TEST_FAILED remained set in the adapter->flags. This caused e1000_msi_test() to switch from MSI to legacy interrupts.

The following is from the dmesg output of one of my runs on hp-z200-02:

> eth0: MSI interrupt test failed!
> eth0: MSI interrupt test failed, using legacy interrupt.
> ADDRCONF(NETDEV_UP): eth0: link is not ready
> e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> eth0: no IPv6 routers present

Because I could ssh into this system after it had booted, and the ssh session seemed quite normal and responsive. And because, things weren't so good for the nec-em17, I was led to consider that something with the bridges was at issue.

But this morning I tried doing the same experiment on hp-z200-02, and discovered I couldn't even ping the system.
From the serial console I saw that the ouput of dmesg had:

> 0000:00:19.0: eth0: MSI interrupt test failed!
> 0000:00:19.0: eth0: MSI interrupt test failed, using legacy interrupt.
> ADDRCONF(NETDEV_UP): eth0: link is not ready
> ADDRCONF(NETDEV_UP): eth0: link is not ready

And ifconfig yielded:

> [root@hp-z200-02 test]# ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:1A:4B:0C:29:05
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>           Memory:e0400000-e0420000
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:7 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:576 (576.0 b)  TX bytes:576 (576.0 b)
> [root@hp-z200-02 test]#

This morning I also tried a few other systems, all with the same result. So, something has definitely changed from mid-February to now. And like you, I'm not thinking I was on the right track. I need to do some more investigation.

Did I mention that all of this is new to me? :)

Comment 29 Dean Nelson 2010-06-08 18:37:33 UTC

(In reply to comment #28)
> (In reply to comment #27)
> > Dean, I'm not sure if you're on the right track. Even in the case where legacy
> > interrupts work for me (loading e1000e with IntMode=0,0,0,0 after a reboot),
> > the bridge settings don't change, MSI support is still enabled on the bridge. I
> > don't think MSI settings need to be the same on the e1000e device and the
> > bridge.
> > 
> > But then again I am no expert in the area.    
> 
> Jean, at the moment, I'm inclined to agree with...

BTW, the "inclined to agree with..." didn't come out quite right. It was suppose to be "inclined to agree with you..." and was not meant in reference to your last sentence, but rather to your first sentence. Sorry for the confusion or insult this may have led to.

Comment 30 Jean Delvare 2010-06-09 09:16:29 UTC

No problem, I didn't feel offended at all :D

Me too, hacked e1000_msi_test_interrupt() to simulate an MSI interrupt failure and force the fallback to legacy interrupts on my laptop. Note that network failure doesn't necessarily trigger immediately for me. The only immediate sign that something is wrong, is a ping much higher and less stable than usual (in the 1-16 ms range, sawtooth style, when it is normally in the 0-2 ms range.) The actual network failure can happen after a few minutes, or one hour.

Comment 31 Dean Nelson 2010-06-30 04:06:27 UTC

After some investigation, I believe I've found the problem reported in this BZ, at least it's the one I've been seeing running RHEL5.6, RHEL6.0 and net-next-2.6 on a number of different systems. A description of the problem can be seen in the following patch I've posted to the netdev mailing list.

http://patchwork.ozlabs.org/patch/56224/

I've also updated my test kernel rpms to include this patch, and they can be found under RHEL5 Test Packages at:

http://people.redhat.com/dnelson/#rhel5

If you've been seeing this problem, please install and test one of the rpms. And if you do, please report back whether the problem has been resolved or not.

Thanks,
Dean

Comment 32 Jean Delvare 2010-06-30 06:46:13 UTC

Wow, excellent job, Dean. I've been looking at this code for hours and couldn't see the bug. Of course, now that you explained it, it is pretty clear...

As I had 2 machines I was able to reproduce this bug on, I'll try your patch immediately and report the results.

Comment 33 Jean Delvare 2010-06-30 12:30:57 UTC

Patch tested successfully on both machines I had that were exhibiting the problem. Great job, thanks again!

Comment 35 Dean Nelson 2010-06-30 14:59:17 UTC

Both bug 496127 and this bug are reporting the same problem. Since that one has ACKs, blocker bugs and Issue-Tracker attachments, and this bug has none of those things, I'll mark this bug a duplicate of bug 496127.

*** This bug has been marked as a duplicate of bug 496127 ***

Comment 38 Dean Nelson 2010-06-30 17:38:13 UTC

Try again to close this bug as a duplicate of bug 496127. Seems there was a timing issue with the adding of comment #36 which collided with the adding of comment #35 and all that each of those included.

*** This bug has been marked as a duplicate of bug 496127 ***

Note You need to log in before you can comment on or make changes to this bug.