Intel's 82572EI, It will send out DHCPDISCOVER, server will see DISCOVER, send DHCPOFFER, client doesn't see DHCPOFFER. Configuring IP manually doesn't work ( it can't ping other machines, and other machine's cannot ping it). It's connected to GB ethernet switch, and makes 1000/FD connection. As a test, installed latest driver from Intel site - this worked ( albeit, with the message : e1000: eth1: e1000_test_msi: MSI interrupt test failed, using legacy interrupt. ). I've attached 'dmesg' output.
Created attachment 195811 [details] dmesg output
Created attachment 195831 [details] lspci -v output
Intel driver which was successful was 7.6.5
Have you tried any of these kernels? http://people.redhat.com/agospoda/#rhel5
Just tried them there now - slightly different result from before. The server doesn't even see the DHCPDISCOVER request now. dmesg output doesn't show anything odd/different from first attached output. ethtool shows a correctly negotiated, functioning link.
(In reply to comment #5) > > dmesg output doesn't show anything odd/different from first attached output. > ethtool shows a correctly negotiated, functioning link. same problem here. i have four of these nics Intel Corp. 82571EB Gigabit Ethernet Controller (rev 06) in an IBM x3850 machine. the e1000 module in 2.6.18-8.1.15.el5 initializes the card without any errors, the device is up, everythinks looks ok. but the nics only sends data, never receives. intel's module 7.6.9.1-NAPI works. it complains about MSI and switches to legacy interrupts as Sean reported but it works (i compiled with -DDISABLE_PCI_MSI to get rid of these messages). on my laptop (IBM T60) i had problems with e1000 too. the card is a Intel Corp. 82573L Gigabit Ethernet Controller using Redhat's module i noticed delays (up to 10s) while opening a ssh connection, no error messages, no increased error counters. after switching to Intel's module (7.6.9.1-NAPI with MSI) the delays are gone.
There have been quite a few changes to the e1000 driver in RHEL5 since 2.6.18-8.1.15.el5. Please try some of my test kernels located here: http://people.redhat.com/agospoda/#rhel5 I make an effort to keep these close to upstream drivers (upstream as in located on kernel.org not intel's sourceforge site). Thanks!
Just tested there again (kernel-2.6.18-52.el5.gtest.25.i686.rpm), no change.
(In reply to comment #8) > Just tested there again (kernel-2.6.18-52.el5.gtest.25.i686.rpm), no change. same with kernel-2.6.18-52.el5.gtest.25.x86_64 which i have tested.
Have either of you tried this with the e1000e driver? There is some overlap and both of these drivers claim support for 82572EI though e1000e will probably be the permanent home.
Sorry for the delay getting back to you. The e1000e driver doesn't seem to support my card, so was unable to test it.
Yeah, sorry about that. I was grepping through the source and thought I saw that e1000e had support for this hardware as well, but realized this morning that the code I saw is actually commented out. I'll see if I can get my hands on some 82572 hardware so I can start to diagnose these problems.
i don't have any e1000e module in 2.6.18-8.1.15.el5. where is it?
It's in the test kernels that Andy posted previously (http://people.redhat.com/ agospoda/#rhel5)
The e1000e driver won't work with this hardware anyway, so don't worry about it. We'll have to get this working with e1000
Sean O Sullivan: your system has some compatibility issue with MSI interrupts. This is a depressingly common issue on Hypertransport based systems. A bios upgrade may fix it, but mostly I think there has been lots of kernel work to recognize these broken systems and disable MSI, in the latest kernels. Do you have any devices that *are* working with MSI (check cat /proc/interrupts) Best thing we can do in e1000 to help you is patch in the MSI test. Andreas Piesk: IBM 3850 is known to be incompatible with MSI interrupts generated by 82571/2. The same patch to disable MSI interrupts on non-working MSI systems would be the solution for you. 82573L had quite a few issues, most of them eeprom related, or related to ASPM (specifically this problem on the t60) being enabled but not working. We have a patch to fix the driver to disable ASPM. I can post that here if Andy would like to think about merging that back to in-kernel e1000.
thanks Jesse, i suspected MSI problems on this particular machine (because the Intel module reports failing back zo legacy interrupts due to MSI problems) and tried kernel parameter 'pci=nomsi' but it didn't make any change.
Jesse, Thanks a lot for the insight into the problem. I checked /proc/interrupts, and MSI wasn't mentioned there.
(In reply to comment #16) > We have a patch to fix the driver to disable ASPM. I can post that here if Andy > would like to think about merging that back to in-kernel e1000. Jesse, I'd be happy to check it out and see about getting it included upstream. Many thanks for the assistance on this one too!
Andy, I posted the L1 ASPM disable patch upstream last week (for inclusion in e1000e).
Yeah, I saw those for e1000e -- I'll have to decide how we want to handle it since those devices are moving to e1000e upstream and I'm not sure how we want to handle it in RHEL.
The same error is seen with Fedora 8... "ifup eth1" produces the following dmesg: APIC error on CPU0: 04(08) ADDRCONF(NETDEV_UP): eth1: link is not ready ethtool eth1: Settings for eth1: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: umbg Wake-on: d Current message level: 0x00000007 (7) Link detected: no mii-tool eth1: eth1: negotiated 100baseTx-FD flow-control, link ok lspci: 03:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) cat /proc/interrupts: CPU0 0: 140 XT-PIC-XT timer 1: 17460 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 4: 483289 XT-PIC-XT ehci_hcd:usb1, uhci_hcd:usb4 5: 5349969 XT-PIC-XT uhci_hcd:usb3, uhci_hcd:usb5, sata_via, cx88[0], cx88[0], cx88[0] 6: 6 XT-PIC-XT floppy 7: 1218566 XT-PIC-XT parport0 8: 0 XT-PIC-XT rtc 9: 0 XT-PIC-XT acpi 10: 1234603 XT-PIC-XT nvidia 11: 1682464 XT-PIC-XT uhci_hcd:usb2, CS46XX, eth0 14: 189004 XT-PIC-XT libata 15: 0 XT-PIC-XT libata 2301: 0 PCI-MSI-edge eth1 NMI: 0 LOC: 31985838 ERR: 6094 uname -a: Linux lizard 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:36 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
Is this a problem with my latest test kernels? Support for that hardware should have moved to e1000e and it should contain the needed patches. http://people.redhat.com/agospoda/#rhel5
Still no luck - though behavior has changed. According to ethtool there is a link (1000/FD), and dmesg shows no errors. Attempting to get IP via DHCP, the client (machine in question) sends out DHCPDISCOVER's, however the server does not see them (previously server did see them & sent DHCPOFFER's, which the client didn't see/receive). It automatically uses the e1000e driver.
(In reply to comment #24) > Still no luck - though behavior has changed. > > According to ethtool there is a link (1000/FD), and dmesg shows no errors. > Attempting to get IP via DHCP, the client (machine in question) sends out > DHCPDISCOVER's, however the server does not see them (previously server did see > them & sent DHCPOFFER's, which the client didn't see/receive). > > It automatically uses the e1000e driver. So if the system is sending out DHCPDISCOVERs but the server doesn't respond to them, I'm not sure what I can do. Can you capture them and try to figure out whey they are not getting answered? Do they have invalid checksums or something? What about using a static IP? Does that work for passing traffic?
Sorry - to clarify, the server never sees the DHCPDISCOVERs (used tethereal to verify), it's not that it ignores them. I also tried static IP, no luck with that either.
That is interesting. I heard a few complain that e1000e fixed problems with 82572's, so this one is puzzling. Is this a add-on card or is this an on-board card? I can try and dig up an 82572EI card, and there is a chance we have the system you have if it's on-board, so I'd like to test it out. Also, have you tried it with a different switch by any chance? That might let me know that this is a phy issue, so I can consider looking at the sourceforge driver for phy fixes that may not be included in the current upstream version.
It's an Add-on PCIe card. I haven't tried it with any other switches, I can do, but won't be till the weekend.
Ok, thanks for the update. I"ll see if I can track down one of these cards for testing.
(In reply to comment #16) > Sean O Sullivan: > your system has some compatibility issue with MSI interrupts. This is a > depressingly common issue on Hypertransport based systems. A bios upgrade may > fix it, but mostly I think there has been lots of kernel work to recognize these > broken systems and disable MSI, in the latest kernels. Do you have any devices > that *are* working with MSI (check cat /proc/interrupts) Best thing we can do in > e1000 to help you is patch in the MSI test. > Sean, Let's not forget Jesse's comment from above. Out of curiosity can you try booting with pci=nomsi and see if that makes a difference? If there are some MSI problems on your system we will need to know about them and try to workaround them in the e1000e driver. This is probably better than trying another switch. Thanks!
I forgot to mention, that you should paste the contents of /proc/interrupts before and after adding pci=nomsi to the kernel command line.
Adding "pci=nomsi" has fixed the problem for me on F8 (kernel 2.6.23.15-137.fc8).
Thanks for the feeback, Oli! This seems to confirm Jesse's statement in comment #16. Jesse, do you have some code not upstream right now (the 'MSI test' you referenced) that could help out with this and disable MSI when it's not working properly or are you just referencing adding an MSI test case to the ethtool interrupt test routine(s) to help debug this. If there are not bios updates that will fix this, I would like to see pci quirks added to disable msi on the bridge chips connected to the network hardware if they are going to be problematic.
Excellent, booting with pci=nomsi resolves the issue.
Sean, that's good news -- to me this sounds like something that needs to be resolved with possibly a bios update or some quirk to account for the fact that MSI doesn't work well with that bridge. Can you attach the lspci -vvv output to this bugzilla?
Created attachment 296633 [details] lspci -vvv output
Output attached in previous comment. BIOS is currently up-to-date, and wouldn't hold breath for Dell to release fix for this. Hopefully whatever workarounds Intel put in their drivers can be merged into the RHEL e1000 (or e1000e) driver.
Created attachment 296673 [details] lspci -vvv output (root) Sorry - last lspci -vvv output done as non-root user, rectified.
Created attachment 296764 [details] lspci -vvv output My lspci -vvv output (Asus A8V-VM SE).
Created attachment 296801 [details] backport msi test to RHEL5 this is a patch that was ONLY compile tested. I was not able to quickly test on RHEL5+MSI enabled e1000, but I wanted to post the patch anyway. Patch was generated on RHEL5 2.6.18-53.
Jesse, The logic of this patch makes sense to me. I'm glad someone familiar with the hardware can write a good interrupt test case. I can integrate this into my test kernels (for both e1000 and e1000e drivers) and if all works well it would be good to push this upstream. -andy
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel4 Please test them and report back your results.
Tried out your latest el5 kernel, still no luck. As before, sends out DHCPDISCOVER's, however server never receives them.
Sean, So does your system report that you are using MSI or INTx on your e1000 cards that have been problematic? Please post output from /proc/interrupts on the system running my latest test kernel if you can. I would have expected this patch: http://people.redhat.com/agospoda/rhel4/0005-e1000-msi-test-and-switch-to-intx.patch to detect that MSI was not working well and continue in INTx mode.
Created attachment 302381 [details] /proc/interrupts
Created attachment 302382 [details] dmesg output # dmesg | grep -i msi assign_interrupt_mode Found MSI capability assign_interrupt_mode Found MSI capability assign_interrupt_mode Found MSI capability 0000:00:02.0: eth0: MSI interrupt test failed, using legacy interrupt.
From the output in comment #46 it seems that MSI is not being used for the e1000e card, but I find it interesting that the output from /proc/interrupts in comment #45 does not show any IRQs for eth0 -- I'm hoping this was because you had the interface down when checking /proc/interrupts. Can you check for sure when the system is running that you are using INTx rather than MSI? It seems odd that your system works fine with pci=nomsi, but with MSI supposedly disabled in the e1000e driver it doesn't work well at all.
Created attachment 303683 [details] lspci -vvv and /proc/interrupts Sorry about delay - attached is the requested.
Thanks, Sean. It certainly seems like one of your bridges must still have problems and needs to have MSI disabled or something is still wrong with the e1000e driver we are using. Auke and Jesse, are there more changes from the sourceforge driver that we can backport to rhel for testing, or that were missed? I've seen Auke mention to others on netdev that problems with some of the HT chipsets (specifically the ones mentioned in this BZ) this would be fixed in 2.6.25 with e1000e, but was the fix in e1000e or somewhere else in the kernel? Our driver is pretty close to upstream e1000e, so I wonder what I'm missing.
I have upgraded to F9 (2.6.25/e1000e) and the problem has gone - even after removing the "pci=nomsi" kernel argument.
Thanks for the feedback, Oli. Good to know it's working for someone! :)
Created attachment 319653 [details] 0001-pci-quirk-set-En-bit-of-MSI-mapping-for-device-on.patch Sean, There is a patch that first appeared in 2.6.25 that may address your issue: commit 9dc625e72309e1c919ea3e7f51d0ffca96123787 Author: Peer Chen <pchen> Date: Mon Feb 4 23:50:13 2008 -0800 PCI: quirks: set 'En' bit of MSI Mapping for devices onHT-based nvidia platform According to HT spec, to get message interrupt from devices mapped to HT interrupt message, the 'En' bit of MSI Mapping capability need to be set. The patch do this setting in quirks code for the devices on HT-based nvidia If you've tested on a recent upstream (fedora or otherwise) kernel it might be a good indication whether or not this patch will fix it. I've done a backport to make this work on RHEL5, and attached the patch as it's probably worth testing. I'll also try and build some test kernels later this week with the patch included, but any testing you can do in the mean-time would be helpful. Thanks!
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results.
Excellent, that seems to have done it! Tested kernel-2.6.18-120.el5.gtest.59.i686.rpm Thanks a lot.
Thanks for the quick feedback, Sean! Glad to hear it's working well for you.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Updating PM score.
in kernel-2.6.18-132.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html