Bug 496127 - [RHEL5.5] e1000e devices fail to initialize interrupts properly
[RHEL5.5] e1000e devices fail to initialize interrupts properly
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
low Severity medium
: rc
: ---
Assigned To: Dean Nelson
Network QE
:
: 477774 (view as bug list)
Depends On:
Blocks: 502912 533192 600363
  Show dependency treegraph
 
Reported: 2009-04-16 15:17 EDT by Marco Schirrmeister
Modified: 2011-01-13 15:47 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 627926 (view as bug list)
Environment:
Last Closed: 2011-01-13 15:47:28 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
pci=nomsi is set (2.80 KB, text/plain)
2009-05-12 11:54 EDT, Marco Schirrmeister
no flags Details
no pci=nomsi kernel option, network is not started (2.40 KB, text/plain)
2009-05-12 11:55 EDT, Marco Schirrmeister
no flags Details
no pci=nomsi kernel option, network is started (2.80 KB, text/plain)
2009-05-12 11:55 EDT, Marco Schirrmeister
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 05:37:42 EST

  None (edit)
Description Marco Schirrmeister 2009-04-16 15:17:34 EDT
Description of problem:

I try to configure bonding with the mode 802.3ad on the RHEL5 system.
The network card is an Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06).

My problem is that the channel comes never up.

I have tried to kernels. 2.6.18-92.el5PAE and 2.6.18-128.1.1.el5PAE

You will see below that the MII Status for the physical interfaces is DOWN.
I think that is the problem.

With the latest kernel, mii-tool is not working at all and is reporting a down state.

Ethtool is working fine.


Version-Release number of selected component (if applicable):

RHEL5 U3
Kernel 2.6.18-92.el5PAE
Kernel 2.6.18-128.1.1.el5PAE


Additional info:

configuration files
-------------------------

[root@hostname ~]# cat /etc/modprobe.conf 
alias eth0 e1000e
alias eth1 e1000e
alias eth2 tg3
alias eth3 tg3
alias bond0 bonding
options bond0 max_bonds=2


[root@hostname ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Intel Corporation 82571EB Gigabit Ethernet Controller
DEVICE=eth0
HWADDR=00:15:17:4B:2E:1C
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes


[root@hostname ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0 
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
TYPE=Ethernet
IPADDR=10.10.10.10
NETMASK=255.255.255.0
NETWORK=10.10.10.0
BROADCAST=10.10.10.255
BONDING_OPTS="mode=4 miimon=100"



Results for 2.6.18-92.el5PAE
----------------------------

[root@hostname ~]# modinfo e1000e | head -n 2
filename:       /lib/modules/2.6.18-92.el5/kernel/drivers/net/e1000e/e1000e.ko
version:        0.2.0


[root@hostname ~]# mii-tool eth0
eth0: negotiated 100baseTx-FD, link ok


[root@hostname ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 1
	Actor Key: 0
	Partner Key: 1
	Partner Mac Address: 00:00:00:00:00:00

Slave Interface: eth0
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1c
Aggregator ID: 1


/var/log/messages output when I load the bonding module

Apr 16 19:56:30 hostname kernel: Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
Apr 16 19:56:30 hostname kernel: bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
Apr 16 19:56:30 hostname kernel: bonding: bond0: setting mode to 802.3ad (4).
Apr 16 19:56:30 hostname kernel: bonding: bond0: Setting MII monitoring interval to 100.
Apr 16 19:56:30 hostname kernel: bonding: bond0: Adding slave eth0.
Apr 16 19:56:30 hostname kernel: bonding: bond0: enslaving eth0 as a backup interface with a down link.
Apr 16 19:56:38 hostname kernel: bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

-------------------------------------------------------------------------------------



Results for 2.6.18-128.1.1.el5PAE
----------------------------

[root@hostname ~]# modinfo e1000e | head -n 2
filename:       /lib/modules/2.6.18-128.1.1.el5PAE/kernel/drivers/net/e1000e/e1000e.ko
version:        0.3.3.3-k4


[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link


[root@hostname ~]# ethtool eth0 | grep Link
	Link detected: yes


[root@hostname ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 1
	Actor Key: 0
	Partner Key: 1
	Partner Mac Address: 00:00:00:00:00:00

Slave Interface: eth0
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1c
Aggregator ID: 1


/var/log/messages output when I load the bonding module

Apr 16 20:13:49 hostname kernel: Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
Apr 16 20:13:49 hostname kernel: bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
Apr 16 20:13:49 hostname kernel: bonding: bond0: setting mode to 802.3ad (4).
Apr 16 20:13:49 hostname kernel: bonding: bond0: Setting MII monitoring interval to 100.
Apr 16 20:13:49 hostname kernel: bonding: bond0: Adding slave eth0.
Apr 16 20:13:49 hostname kernel: eth0: MSI interrupt test failed, using legacy interrupt.
Apr 16 20:13:49 hostname kernel: bonding: bond0: enslaving eth0 as a backup interface with a down link.
Apr 16 20:13:57 hostname kernel: bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

-------------------------------------------------------------------------------------
Comment 1 Jiri Pirko 2009-04-23 04:31:42 EDT
Hi Marco.

Can you please try this on latest upstream kernel downloaded from http://www.kernel.org/ ? It would be helpful to see if the issue appears there too.

Thanks.
Comment 2 Marco Schirrmeister 2009-04-23 07:16:02 EDT
Hi Jiro,

I installed kernel 2.6.29 and the channel comes up.

The e1000e version is 0.3.3.3-k6 (So only k4 changed to k6)
The bonding version is 3.5.0

[root@hostname ~]# uname -r
2.6.29-ms1

[root@hostname ~]# modinfo bonding | head -n 4
filename:       /lib/modules/2.6.29-ms1/kernel/drivers/net/bonding/bonding.ko
author:         Thomas Davis, tadavis@lbl.gov and many others
description:    Ethernet Channel Bonding Driver, v3.5.0
version:        3.5.0

[root@hostname ~]# modinfo e1000e | head -n 2
filename:       /lib/modules/2.6.29-ms1/kernel/drivers/net/e1000e/e1000e.ko
version:        0.3.3.3-k6

[root@hostname ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
	Aggregator ID: 3
	Number of ports: 2
	Actor Key: 17
	Partner Key: 8
	Partner Mac Address: 00:11:5d:15:95:80

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1c
Aggregator ID: 3

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1d
Aggregator ID: 3

---------------------------

With the official Redhat Kernel I also tried a newer e1000e driver from Intel. Version 0.5.18.3
It was also not working with this newer driver.

So the problem is maybe the bonding driver? At least in combination with some network cards.
Because other cards working fine.


Thanks
Marco
Comment 3 Jiri Pirko 2009-04-23 07:30:11 EDT
Marco.

Thanks for fast feedback. Can you please try if this is bonding mode dependent of if this issue appears for example in mode 1.

Thanks.
Comment 4 Marco Schirrmeister 2009-04-23 09:26:37 EDT
Jiro,

I think the mode is independent. I tested it again with mode 1.
If I set it to mode 1 (active/backup) the MII Status is also always "down" for the physical interfaces.

It looks to me that in the official kernel the link status of the NIC can not properly determined.

I know that "mii-tool" is not for the newer network cards, but it shows at least a different behavior between the offical rhel5 kernel and the latest kernel.
ethtool shows always the correct information.

With the official kernel "mii-tool eth0" shows always "no link".
With the latest kernel it shows "link ok" once the bond interface was brought up.

I also noticed with the official kernel in /var/log/messages when I bring up the bond interface the following messages.
Apr 23 14:29:14 hostname kernel: eth0: MSI interrupt test failed, using legacy interrupt.
Apr 23 14:29:14 hostname kernel: eth1: MSI interrupt test failed, using legacy interrupt.


Here are now some results.

----------- official kernel ------------------
This was with mode 1. The NICs have definitely a link.

[root@hostname ~]# uname -r
2.6.18-128.el5PAE

[root@hostname test]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: None
MII Status: down
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1c

Slave Interface: eth1
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1d


[root@hostname test]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link

[root@hostname test]# mii-tool eth1
SIOCGMIIREG on eth1 failed: Input/output error
eth1: 10 Mbit, half duplex, no link
-----------------------------------------------------

-------------- latest kernel ------------------
[root@hostname ~]# uname -r
2.6.29-ms1


[root@hostname ~]# ifup bond0

[root@hostname ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 2
	Actor Key: 17
	Partner Key: 8
	Partner Mac Address: 00:11:5d:15:95:80

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1c
Aggregator ID: 1

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:15:17:4b:2e:1d
Aggregator ID: 1

[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: negotiated 100baseTx-FD, link ok

[root@hostname ~]# mii-tool eth1
SIOCGMIIREG on eth1 failed: Input/output error
eth1: negotiated 100baseTx-FD, link ok

[root@hostname ~]# ifdown bond0

[root@hostname ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: down
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
bond bond0 has no active aggregator


[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: negotiated 100baseTx-FD, link ok

[root@hostname ~]# mii-tool eth1
SIOCGMIIREG on eth1 failed: Input/output error
eth1: negotiated 100baseTx-FD, link ok

-----------------------------------------------

I also tried the latest kernels that I found here. http://people.redhat.com/dzickus/el5/140.el5/i686/
But same issue.

Do you have some other newer testing redhat kernels that I could try?


Marco
Comment 5 Jiri Pirko 2009-04-23 12:28:32 EDT
Hi Marco.

Ok, this is actually what I thought. Can you please do mii-tool ethX when NICs are not enslaved to bonding interface to be sure this issue has nothing to do with bonding?

Thanks a lot.
Comment 6 Marco Schirrmeister 2009-04-23 12:44:36 EDT
With the official rhel kernel it shows always this. You can bring up/down the bond interface as many times as you want. It's always "no link".

[root@hostname test]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link


With the latest kernel it shows "no link" after startup of the server.
After bringing up bond0 it says "link ok".

It then stays "link ok" for always. It doesn't matter how often I bring down/up bond0.

[root@hostname ~]# mii-tool eth1
SIOCGMIIREG on eth1 failed: Input/output error
eth1: negotiated 100baseTx-FD, link ok


Btw, the line "SIOCGMIIREG on eth1 failed: Input/output error" does not occur with an older kernel.
For example with -92 release.
Comment 7 Jiri Pirko 2009-04-23 14:03:55 EDT
Marco.

Actually I wanted you to test the behaviour without bonding involved. Only plain eth device and mii-tool. I expect the same results, just wanted to be sure and to rule out the bonding driver bug.

Thanks
Comment 8 Marco Schirrmeister 2009-04-23 19:18:34 EDT
Sorry, then I misunderstood you Jiri.

I reconfigured eth0 to a standalone card with an IP and below are the results.


[root@hostname ~]# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:15:17:4B:2E:1C  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:ea120000-ea140000 

[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link

[root@hostname ~]# ifup eth0

[root@hostname ~]# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:15:17:4B:2E:1C  
          inet addr:10.10.10.10  Bcast:10.10.10.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:ea120000-ea140000 

[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link

[root@hostname ~]# ifdown eth0

[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link
Comment 9 Jiri Pirko 2009-04-24 00:40:58 EDT
Ok Marco, thanks for the confirmation. This is solely e1000e issue. I'll look at this.
Comment 10 Marco Schirrmeister 2009-04-24 04:44:20 EDT
Jiri,

do you think it's solely a e1000e issue? I also tried the e1000e driver directly from intel. Version 0.5.18.3

This newer version has the same behavior. Channel does not come up at all.

Could it also be something else, because it's working in the latest vanilla kernel with the same e1000e version? (0.3.3.3)



Marco
Comment 11 Marco Schirrmeister 2009-04-24 10:51:51 EDT
One more update.

Yesterday where I configured a standalone device I just looked for the link status with mii-tool or ethtool. I have not tried if I have real network connectivity. I tested this today. 

So with the Redhat kernel and the e1000e module that comes with the kernel or with the latest intel version I have no network connectivity.

The link is there, but I can not ping my default gateway. The switch does also not see my MAC address.


Marco
Comment 12 Marco Schirrmeister 2009-05-02 11:31:15 EDT
Jiri,

I found bug https://bugzilla.redhat.com/show_bug.cgi?id=477774.
I think that's the same problem what I noticed on my machine. I have also an IBM x3850 (M1).

After I added pci=nomsi to my kernel line the NICs with the e1000e driver were working fine. Also my bonding device worked fine.

So I guess my bug report is a duplicate for 477774?


Marco
Comment 13 Jiri Pirko 2009-05-03 03:01:40 EDT
Hi Marco.

It looks like a similar issue. I suggest we wait for the solution and if it will work for you, we would set this as a duplicate.

Thanks
Comment 14 Marco Schirrmeister 2009-05-03 03:19:18 EDT
Jiri,

I have this box where I have the issue still available for testing for about 2 weeks.
After this I need to go into production mode.

So if I can test something, let me know. Not sure how fast Redhat has a solution.


Marco
Comment 15 Andy Gospodarek 2009-05-08 16:15:22 EDT
It sounds like this issue might be quite similar to bug 47774 since you are using the same system.  I would encourage you to try the latest test kernels located here:

http://people.redhat.com/dzickus/el5/

It looks like the last kernel you tried was 2.6.18-140, but there was a change in 2.6.18-141 that may resolve this issue.  This may be a different issue, but certainly seems like it could be a duplicate of bug 492270, so some testing would be helpful.
Comment 16 Marco Schirrmeister 2009-05-12 06:38:59 EDT
I installed kernel-2.6.18-144.el5.x86_64 and started the system without the pci=nomsi option.

The intel NICs with the e1000e driver do still not work. Once I bring up eth0 for example I still have the following message.

------
May 12 12:15:52 hostname: eth0: MSI interrupt test failed, using legacy interrupt.
------

It's working fine with the kernel option pci=nomsi
Comment 17 Andy Gospodarek 2009-05-12 11:20:15 EDT
Good to know that -144 doesn't work (sorry it doesn't!).

Though you get the message that indicates that the system will switch to legacy interrupts for the 82571EB, did you check to make sure it did in /proc/interrupts?  It would be helpful you could paste the contents of /proc/interrupts.

Looking at the driver there seems to be a case where the test will fail, but the driver will still try and use MSI.  This patch seems like one way to resolve this (if my presumption is correct).

--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1440,7 +1440,7 @@ void e1000e_set_interrupt_capability(struct e1000_adapter *adapter)
                }
                /* Fall through */
        case E1000E_INT_MODE_LEGACY:
-               /* Don't do anything; this is the system default */
+               adapter->flags &= ~FLAG_MSI_ENABLED & ~FLAG_HAS_MSIX;
                break;
        }
Comment 18 Marco Schirrmeister 2009-05-12 11:52:38 EDT
I attach 3 files.

1 with pci=nomsi
2 without the kernel option. One file before eth0 and eth1 are up and one after eth0/1 are up.
Comment 19 Marco Schirrmeister 2009-05-12 11:54:00 EDT
Created attachment 343608 [details]
pci=nomsi is set
Comment 20 Marco Schirrmeister 2009-05-12 11:55:16 EDT
Created attachment 343610 [details]
no pci=nomsi kernel option, network is not started
Comment 21 Marco Schirrmeister 2009-05-12 11:55:46 EDT
Created attachment 343611 [details]
no pci=nomsi kernel option, network is started
Comment 22 Andy Gospodarek 2009-05-12 13:34:21 EDT
I did some more snooping around and it looks like the the drive code as it exists right now is fine.  MSI should be getting properly disabled in the driver.

IIRC there was a pci quirk added to address this sort of thing, that probably helps to explain why the problem does not exist with 2.6.29, but does with RHEL5's native driver and with Intel's latest driver from sourceforge.  I'll see what I can dig up.
Comment 26 f_a_f12001 2010-06-14 03:30:01 EDT
Dear Guys,
         I have the same error on a Fedora 9 HP workstation with the same driver for NIC, This host has a strange behavior with networking, Every may be month or more it stops sending or receiving on eth0 although the whole systems is working normally, And the networking is back normally after I reboot the machine, It gave me this message after "dmesg |grep eth0"
SIOCGMIIREG on eth0 failed: Input/output error                        
0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:0f:fe:4d:35:0e
0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
0000:00:19.0: eth0: MAC: 5, PHY: 6, PBA No: 1002ff-0ff
ADDRCONF(NETDEV_UP): eth0: link is not ready
0000:00:19.0: eth0: Link is Up 100 Mbps Full Duplex, Flow Control: None
0000:00:19.0: eth0: 10/100 speed: disabling TSO
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present
0000:00:19.0: eth0: Link is Down
0000:00:19.0: eth0: Link is Up 100 Mbps Full Duplex, Flow Control: None
0000:00:19.0: eth0: 10/100 speed: disabling TSO
0000:00:19.0: eth0: Link is Down
0000:00:19.0: eth0: Link is Up 100 Mbps Full Duplex, Flow Control: None
Comment 27 Andy Gospodarek 2010-06-29 16:40:01 EDT
Marco, I think there is a good chance you are hitting a problem that Dean Nelson just fixed with the following patch:

http://patchwork.ozlabs.org/patch/56224/

He discovered that some systems that failed to initialize MSI would set registers incorrectly when trying to enable legacy interrupts.

I will ask Dean to take a look at this and hopefully he can share the bugzilla # for the RHEL5 bug that plans to include the patch linked above.
Comment 28 Andy Gospodarek 2010-06-29 17:04:39 EDT
f_a_f12001@yahoo.com, that seems like an odd problem.  You should check the logs in your switch at the same time to make sure the switch sees the link going down too.  If it does, this doesn't appear to be the system losing contact with the NIC hardware, but a serious problem with the device or switch.

Unfortunately this bug is to address some RHEL5 issues, not Fedora 9 bugs.  I cannot really help you much as F9 is not maintained anymore.

If you still have this problem when running F13 (or a kernel for F13) you can open a new bug to address the problem.
Comment 29 Dean Nelson 2010-06-29 18:39:18 EDT
(In reply to comment #27)
> Marco, I think there is a good chance you are hitting a problem that Dean
> Nelson just fixed with the following patch:
> 
> http://patchwork.ozlabs.org/patch/56224/
> 
> He discovered that some systems that failed to initialize MSI would set
> registers incorrectly when trying to enable legacy interrupts.
> 
> I will ask Dean to take a look at this and hopefully he can share the bugzilla
> # for the RHEL5 bug that plans to include the patch linked above.    

Bug 477774 is the RHEL5 bug I'm working on. And from comment #12 I see Marco is already familiar with it.

And I'd agree that both BZs look to be dealing with the same problem. I'll put together a RHEL5.6 system with the patch that fixes the problem. And then, Marco, you can prove whether they are.

Dean
Comment 30 Dean Nelson 2010-06-29 23:50:22 EDT
As promised in comment #29, I've updated my test kernel rpms to include a patch that in theory fixes the problem reported in this BZ. The patch and rpms can be found under the RHEL5 Test Packages at:

http://people.redhat.com/dnelson/#rhel5

Please test, and if you do, please report back whether the problem has been resolved or not.

Thanks,
Dean
Comment 31 Marco Schirrmeister 2010-06-30 03:34:44 EDT
Since my servers are already in production I can't easily test it.
But I will talk with my application owners so see if we can schedule some down or maintenance time.

If it's possible to do that, I will try that kernel out and will let you know if it's working.

Thanks
Marco
Comment 32 Andy Gospodarek 2010-06-30 09:30:02 EDT
Marco, based on feedback in bug 477774, I feel pretty confident that the patch from Dean will fix it your problem.
Comment 33 Dean Nelson 2010-06-30 10:59:17 EDT
*** Bug 477774 has been marked as a duplicate of this bug. ***
Comment 34 Marco Schirrmeister 2010-06-30 11:33:04 EDT
Andy,

thanks, I saw that comment too.
Still have no final answer from my application owners. But most likely they don't want to test it on the production box.

But once the patch is in a new updated RHEL5 kernel I will schedule an update.


Thanks again Dean and Andy.
Comment 36 Dean Nelson 2010-06-30 13:38:14 EDT
*** Bug 477774 has been marked as a duplicate of this bug. ***
Comment 38 Jarod Wilson 2010-07-12 11:43:21 EDT
in kernel-2.6.18-206.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.
Comment 44 errata-xmlrpc 2011-01-13 15:47:28 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.