Bug 496127
| Summary: | [RHEL5.5] e1000e devices fail to initialize interrupts properly | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Marco Schirrmeister <marco> | ||||||||
| Component: | kernel | Assignee: | Dean Nelson <dnelson> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Network QE <network-qe> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | 5.3 | CC: | agospoda, cward, f_a_f12001, gilboad, jdelvare, jpirko, jwest, kzhang, lzheng, madko, mschmidt, tao, uwe.knop | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 627926 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2011-01-13 20:47:28 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 502912, 533192, 600363 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Marco Schirrmeister
2009-04-16 19:17:34 UTC
Hi Marco. Can you please try this on latest upstream kernel downloaded from http://www.kernel.org/ ? It would be helpful to see if the issue appears there too. Thanks. Hi Jiro, I installed kernel 2.6.29 and the channel comes up. The e1000e version is 0.3.3.3-k6 (So only k4 changed to k6) The bonding version is 3.5.0 [root@hostname ~]# uname -r 2.6.29-ms1 [root@hostname ~]# modinfo bonding | head -n 4 filename: /lib/modules/2.6.29-ms1/kernel/drivers/net/bonding/bonding.ko author: Thomas Davis, tadavis and many others description: Ethernet Channel Bonding Driver, v3.5.0 version: 3.5.0 [root@hostname ~]# modinfo e1000e | head -n 2 filename: /lib/modules/2.6.29-ms1/kernel/drivers/net/e1000e/e1000e.ko version: 0.3.3.3-k6 [root@hostname ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Aggregator selection policy (ad_select): stable Active Aggregator Info: Aggregator ID: 3 Number of ports: 2 Actor Key: 17 Partner Key: 8 Partner Mac Address: 00:11:5d:15:95:80 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:15:17:4b:2e:1c Aggregator ID: 3 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:15:17:4b:2e:1d Aggregator ID: 3 --------------------------- With the official Redhat Kernel I also tried a newer e1000e driver from Intel. Version 0.5.18.3 It was also not working with this newer driver. So the problem is maybe the bonding driver? At least in combination with some network cards. Because other cards working fine. Thanks Marco Marco. Thanks for fast feedback. Can you please try if this is bonding mode dependent of if this issue appears for example in mode 1. Thanks. Jiro, I think the mode is independent. I tested it again with mode 1. If I set it to mode 1 (active/backup) the MII Status is also always "down" for the physical interfaces. It looks to me that in the official kernel the link status of the NIC can not properly determined. I know that "mii-tool" is not for the newer network cards, but it shows at least a different behavior between the offical rhel5 kernel and the latest kernel. ethtool shows always the correct information. With the official kernel "mii-tool eth0" shows always "no link". With the latest kernel it shows "link ok" once the bond interface was brought up. I also noticed with the official kernel in /var/log/messages when I bring up the bond interface the following messages. Apr 23 14:29:14 hostname kernel: eth0: MSI interrupt test failed, using legacy interrupt. Apr 23 14:29:14 hostname kernel: eth1: MSI interrupt test failed, using legacy interrupt. Here are now some results. ----------- official kernel ------------------ This was with mode 1. The NICs have definitely a link. [root@hostname ~]# uname -r 2.6.18-128.el5PAE [root@hostname test]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: None MII Status: down MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: down Link Failure Count: 0 Permanent HW addr: 00:15:17:4b:2e:1c Slave Interface: eth1 MII Status: down Link Failure Count: 0 Permanent HW addr: 00:15:17:4b:2e:1d [root@hostname test]# mii-tool eth0 SIOCGMIIREG on eth0 failed: Input/output error eth0: 10 Mbit, half duplex, no link [root@hostname test]# mii-tool eth1 SIOCGMIIREG on eth1 failed: Input/output error eth1: 10 Mbit, half duplex, no link ----------------------------------------------------- -------------- latest kernel ------------------ [root@hostname ~]# uname -r 2.6.29-ms1 [root@hostname ~]# ifup bond0 [root@hostname ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Aggregator selection policy (ad_select): stable Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 17 Partner Key: 8 Partner Mac Address: 00:11:5d:15:95:80 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:15:17:4b:2e:1c Aggregator ID: 1 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:15:17:4b:2e:1d Aggregator ID: 1 [root@hostname ~]# mii-tool eth0 SIOCGMIIREG on eth0 failed: Input/output error eth0: negotiated 100baseTx-FD, link ok [root@hostname ~]# mii-tool eth1 SIOCGMIIREG on eth1 failed: Input/output error eth1: negotiated 100baseTx-FD, link ok [root@hostname ~]# ifdown bond0 [root@hostname ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: down MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Aggregator selection policy (ad_select): stable bond bond0 has no active aggregator [root@hostname ~]# mii-tool eth0 SIOCGMIIREG on eth0 failed: Input/output error eth0: negotiated 100baseTx-FD, link ok [root@hostname ~]# mii-tool eth1 SIOCGMIIREG on eth1 failed: Input/output error eth1: negotiated 100baseTx-FD, link ok ----------------------------------------------- I also tried the latest kernels that I found here. http://people.redhat.com/dzickus/el5/140.el5/i686/ But same issue. Do you have some other newer testing redhat kernels that I could try? Marco Hi Marco. Ok, this is actually what I thought. Can you please do mii-tool ethX when NICs are not enslaved to bonding interface to be sure this issue has nothing to do with bonding? Thanks a lot. With the official rhel kernel it shows always this. You can bring up/down the bond interface as many times as you want. It's always "no link". [root@hostname test]# mii-tool eth0 SIOCGMIIREG on eth0 failed: Input/output error eth0: 10 Mbit, half duplex, no link With the latest kernel it shows "no link" after startup of the server. After bringing up bond0 it says "link ok". It then stays "link ok" for always. It doesn't matter how often I bring down/up bond0. [root@hostname ~]# mii-tool eth1 SIOCGMIIREG on eth1 failed: Input/output error eth1: negotiated 100baseTx-FD, link ok Btw, the line "SIOCGMIIREG on eth1 failed: Input/output error" does not occur with an older kernel. For example with -92 release. Marco. Actually I wanted you to test the behaviour without bonding involved. Only plain eth device and mii-tool. I expect the same results, just wanted to be sure and to rule out the bonding driver bug. Thanks Sorry, then I misunderstood you Jiri.
I reconfigured eth0 to a standalone card with an IP and below are the results.
[root@hostname ~]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:15:17:4B:2E:1C
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:ea120000-ea140000
[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link
[root@hostname ~]# ifup eth0
[root@hostname ~]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:15:17:4B:2E:1C
inet addr:10.10.10.10 Bcast:10.10.10.255 Mask:255.255.255.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:ea120000-ea140000
[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link
[root@hostname ~]# ifdown eth0
[root@hostname ~]# mii-tool eth0
SIOCGMIIREG on eth0 failed: Input/output error
eth0: 10 Mbit, half duplex, no link
Ok Marco, thanks for the confirmation. This is solely e1000e issue. I'll look at this. Jiri, do you think it's solely a e1000e issue? I also tried the e1000e driver directly from intel. Version 0.5.18.3 This newer version has the same behavior. Channel does not come up at all. Could it also be something else, because it's working in the latest vanilla kernel with the same e1000e version? (0.3.3.3) Marco One more update. Yesterday where I configured a standalone device I just looked for the link status with mii-tool or ethtool. I have not tried if I have real network connectivity. I tested this today. So with the Redhat kernel and the e1000e module that comes with the kernel or with the latest intel version I have no network connectivity. The link is there, but I can not ping my default gateway. The switch does also not see my MAC address. Marco Jiri, I found bug https://bugzilla.redhat.com/show_bug.cgi?id=477774. I think that's the same problem what I noticed on my machine. I have also an IBM x3850 (M1). After I added pci=nomsi to my kernel line the NICs with the e1000e driver were working fine. Also my bonding device worked fine. So I guess my bug report is a duplicate for 477774? Marco Hi Marco. It looks like a similar issue. I suggest we wait for the solution and if it will work for you, we would set this as a duplicate. Thanks Jiri, I have this box where I have the issue still available for testing for about 2 weeks. After this I need to go into production mode. So if I can test something, let me know. Not sure how fast Redhat has a solution. Marco It sounds like this issue might be quite similar to bug 47774 since you are using the same system. I would encourage you to try the latest test kernels located here: http://people.redhat.com/dzickus/el5/ It looks like the last kernel you tried was 2.6.18-140, but there was a change in 2.6.18-141 that may resolve this issue. This may be a different issue, but certainly seems like it could be a duplicate of bug 492270, so some testing would be helpful. I installed kernel-2.6.18-144.el5.x86_64 and started the system without the pci=nomsi option. The intel NICs with the e1000e driver do still not work. Once I bring up eth0 for example I still have the following message. ------ May 12 12:15:52 hostname: eth0: MSI interrupt test failed, using legacy interrupt. ------ It's working fine with the kernel option pci=nomsi Good to know that -144 doesn't work (sorry it doesn't!).
Though you get the message that indicates that the system will switch to legacy interrupts for the 82571EB, did you check to make sure it did in /proc/interrupts? It would be helpful you could paste the contents of /proc/interrupts.
Looking at the driver there seems to be a case where the test will fail, but the driver will still try and use MSI. This patch seems like one way to resolve this (if my presumption is correct).
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1440,7 +1440,7 @@ void e1000e_set_interrupt_capability(struct e1000_adapter *adapter)
}
/* Fall through */
case E1000E_INT_MODE_LEGACY:
- /* Don't do anything; this is the system default */
+ adapter->flags &= ~FLAG_MSI_ENABLED & ~FLAG_HAS_MSIX;
break;
}
I attach 3 files. 1 with pci=nomsi 2 without the kernel option. One file before eth0 and eth1 are up and one after eth0/1 are up. Created attachment 343608 [details]
pci=nomsi is set
Created attachment 343610 [details]
no pci=nomsi kernel option, network is not started
Created attachment 343611 [details]
no pci=nomsi kernel option, network is started
I did some more snooping around and it looks like the the drive code as it exists right now is fine. MSI should be getting properly disabled in the driver. IIRC there was a pci quirk added to address this sort of thing, that probably helps to explain why the problem does not exist with 2.6.29, but does with RHEL5's native driver and with Intel's latest driver from sourceforge. I'll see what I can dig up. Dear Guys,
I have the same error on a Fedora 9 HP workstation with the same driver for NIC, This host has a strange behavior with networking, Every may be month or more it stops sending or receiving on eth0 although the whole systems is working normally, And the networking is back normally after I reboot the machine, It gave me this message after "dmesg |grep eth0"
SIOCGMIIREG on eth0 failed: Input/output error
0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:0f:fe:4d:35:0e
0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
0000:00:19.0: eth0: MAC: 5, PHY: 6, PBA No: 1002ff-0ff
ADDRCONF(NETDEV_UP): eth0: link is not ready
0000:00:19.0: eth0: Link is Up 100 Mbps Full Duplex, Flow Control: None
0000:00:19.0: eth0: 10/100 speed: disabling TSO
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present
0000:00:19.0: eth0: Link is Down
0000:00:19.0: eth0: Link is Up 100 Mbps Full Duplex, Flow Control: None
0000:00:19.0: eth0: 10/100 speed: disabling TSO
0000:00:19.0: eth0: Link is Down
0000:00:19.0: eth0: Link is Up 100 Mbps Full Duplex, Flow Control: None
Marco, I think there is a good chance you are hitting a problem that Dean Nelson just fixed with the following patch: http://patchwork.ozlabs.org/patch/56224/ He discovered that some systems that failed to initialize MSI would set registers incorrectly when trying to enable legacy interrupts. I will ask Dean to take a look at this and hopefully he can share the bugzilla # for the RHEL5 bug that plans to include the patch linked above. f_a_f12001, that seems like an odd problem. You should check the logs in your switch at the same time to make sure the switch sees the link going down too. If it does, this doesn't appear to be the system losing contact with the NIC hardware, but a serious problem with the device or switch. Unfortunately this bug is to address some RHEL5 issues, not Fedora 9 bugs. I cannot really help you much as F9 is not maintained anymore. If you still have this problem when running F13 (or a kernel for F13) you can open a new bug to address the problem. (In reply to comment #27) > Marco, I think there is a good chance you are hitting a problem that Dean > Nelson just fixed with the following patch: > > http://patchwork.ozlabs.org/patch/56224/ > > He discovered that some systems that failed to initialize MSI would set > registers incorrectly when trying to enable legacy interrupts. > > I will ask Dean to take a look at this and hopefully he can share the bugzilla > # for the RHEL5 bug that plans to include the patch linked above. Bug 477774 is the RHEL5 bug I'm working on. And from comment #12 I see Marco is already familiar with it. And I'd agree that both BZs look to be dealing with the same problem. I'll put together a RHEL5.6 system with the patch that fixes the problem. And then, Marco, you can prove whether they are. Dean As promised in comment #29, I've updated my test kernel rpms to include a patch that in theory fixes the problem reported in this BZ. The patch and rpms can be found under the RHEL5 Test Packages at: http://people.redhat.com/dnelson/#rhel5 Please test, and if you do, please report back whether the problem has been resolved or not. Thanks, Dean Since my servers are already in production I can't easily test it. But I will talk with my application owners so see if we can schedule some down or maintenance time. If it's possible to do that, I will try that kernel out and will let you know if it's working. Thanks Marco Marco, based on feedback in bug 477774, I feel pretty confident that the patch from Dean will fix it your problem. *** Bug 477774 has been marked as a duplicate of this bug. *** Andy, thanks, I saw that comment too. Still have no final answer from my application owners. But most likely they don't want to test it on the production box. But once the patch is in a new updated RHEL5 kernel I will schedule an update. Thanks again Dean and Andy. *** Bug 477774 has been marked as a duplicate of this bug. *** in kernel-2.6.18-206.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |