Bug 290701 - pci: MSI/HT problems with some nvidia bridge chips
Summary: pci: MSI/HT problems with some nvidia bridge chips
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-09-14 12:41 UTC by Seán O Sullivan
Modified: 2014-06-29 22:59 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:38:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg output (21.00 KB, application/octet-stream)
2007-09-14 12:41 UTC, Seán O Sullivan
no flags Details
lspci -v output (6.54 KB, application/octet-stream)
2007-09-14 12:42 UTC, Seán O Sullivan
no flags Details
lspci -vvv output (11.60 KB, text/plain)
2008-03-03 16:36 UTC, Seán O Sullivan
no flags Details
lspci -vvv output (root) (20.26 KB, text/plain)
2008-03-03 20:07 UTC, Seán O Sullivan
no flags Details
lspci -vvv output (25.73 KB, text/plain)
2008-03-04 17:14 UTC, Oli Wade
no flags Details
backport msi test to RHEL5 (4.81 KB, patch)
2008-03-04 20:56 UTC, Jesse Brandeburg
no flags Details | Diff
/proc/interrupts (589 bytes, text/plain)
2008-04-14 18:57 UTC, Seán O Sullivan
no flags Details
dmesg output (20.32 KB, text/plain)
2008-04-14 18:58 UTC, Seán O Sullivan
no flags Details
lspci -vvv and /proc/interrupts (21.21 KB, text/plain)
2008-04-24 20:14 UTC, Seán O Sullivan
no flags Details
0001-pci-quirk-set-En-bit-of-MSI-mapping-for-device-on.patch (3.01 KB, patch)
2008-10-07 15:28 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Seán O Sullivan 2007-09-14 12:41:46 UTC
Intel's 82572EI,

It will send out DHCPDISCOVER, server will see DISCOVER, send DHCPOFFER, client 
doesn't see DHCPOFFER.
Configuring IP manually doesn't work ( it can't ping other machines, and other 
machine's cannot ping it).

It's connected to GB ethernet switch, and makes 1000/FD connection.


As a test, installed latest driver from Intel site - this worked  ( albeit, 
with the message : e1000: eth1: e1000_test_msi: MSI interrupt test failed, 
using legacy interrupt. ).

I've attached 'dmesg' output.

Comment 1 Seán O Sullivan 2007-09-14 12:41:46 UTC
Created attachment 195811 [details]
dmesg output

Comment 2 Seán O Sullivan 2007-09-14 12:42:35 UTC
Created attachment 195831 [details]
lspci -v output

Comment 3 Seán O Sullivan 2007-09-14 12:45:02 UTC
Intel driver which was successful was 7.6.5

Comment 4 Andy Gospodarek 2007-09-14 14:29:10 UTC
Have you tried any of these kernels?

http://people.redhat.com/agospoda/#rhel5

Comment 5 Seán O Sullivan 2007-09-14 17:22:51 UTC
Just tried them there now - slightly different result from before.
The server doesn't even see the DHCPDISCOVER request now.

dmesg output doesn't show anything odd/different from first attached output.
ethtool shows a correctly negotiated, functioning link.

Comment 6 Andreas Piesk 2007-10-25 08:30:51 UTC
(In reply to comment #5)
> 
> dmesg output doesn't show anything odd/different from first attached output.
> ethtool shows a correctly negotiated, functioning link.

same problem here. i have four of these nics

Intel Corp. 82571EB Gigabit Ethernet Controller (rev 06)

in an IBM x3850 machine. the e1000 module in 2.6.18-8.1.15.el5 initializes the
card without any errors, the device is up, everythinks looks ok. but the nics
only sends data, never receives. 

intel's module 7.6.9.1-NAPI works. it complains about MSI and switches to legacy
interrupts as Sean reported but it works (i compiled with -DDISABLE_PCI_MSI to
get rid of these messages).


on my laptop (IBM T60) i had problems with e1000 too. the card is a

Intel Corp. 82573L Gigabit Ethernet Controller

using Redhat's module i noticed delays (up to 10s) while opening a ssh
connection, no error messages, no increased error counters. after switching to
Intel's module (7.6.9.1-NAPI with MSI) the delays are gone.


Comment 7 Andy Gospodarek 2007-10-25 12:55:09 UTC
There have been quite a few changes to the e1000 driver in RHEL5 since
2.6.18-8.1.15.el5.  

Please try some of my test kernels located here:

http://people.redhat.com/agospoda/#rhel5

I make an effort to keep these close to upstream drivers (upstream as in located
on kernel.org not intel's sourceforge site).  Thanks!

Comment 8 Seán O Sullivan 2007-10-25 13:55:05 UTC
Just tested there again (kernel-2.6.18-52.el5.gtest.25.i686.rpm), no change.

Comment 9 Andreas Piesk 2007-10-25 14:07:39 UTC
(In reply to comment #8)
> Just tested there again (kernel-2.6.18-52.el5.gtest.25.i686.rpm), no change.

same with kernel-2.6.18-52.el5.gtest.25.x86_64 which i have tested.


Comment 10 Andy Gospodarek 2007-10-25 14:37:32 UTC
Have either of you tried this with the e1000e driver?  There is some overlap and
both of these drivers claim support for 82572EI though e1000e will probably be
the permanent home.

Comment 11 Seán O Sullivan 2007-10-26 22:55:43 UTC
Sorry for the delay getting back to you.
The e1000e driver doesn't seem to support my card, so was unable to test it.

Comment 12 Andy Gospodarek 2007-10-26 23:48:08 UTC
Yeah, sorry about that.  I was grepping through the source and thought I saw
that e1000e had support for this hardware as well, but realized this morning
that the code I saw is actually commented out.  I'll see if I can get my hands
on some 82572 hardware so I can start to diagnose these problems.

Comment 13 Andreas Piesk 2007-10-29 10:24:21 UTC
i don't have any e1000e module in 2.6.18-8.1.15.el5. where is it?


Comment 14 Seán O Sullivan 2007-10-29 10:28:34 UTC
It's in the test kernels that Andy posted previously (http://people.redhat.com/
agospoda/#rhel5)

Comment 15 Andy Gospodarek 2007-10-29 12:56:39 UTC
The e1000e driver won't work with this hardware anyway, so don't worry about it.  We'll have to get this working with e1000

Comment 16 Jesse Brandeburg 2007-11-11 07:08:30 UTC
Sean O Sullivan:
your system has some compatibility issue with MSI interrupts.  This is a
depressingly common issue on Hypertransport based systems.  A bios upgrade may
fix it, but mostly I think there has been lots of kernel work to recognize these
broken systems and disable MSI, in the latest kernels.  Do you have any devices
that *are* working with MSI (check cat /proc/interrupts) Best thing we can do in
e1000 to help you is patch in the MSI test.

Andreas Piesk:
IBM 3850 is known to be incompatible with MSI interrupts generated by 82571/2. 
The same patch to disable MSI interrupts on non-working MSI systems would be the
solution for you.
82573L had quite a few issues, most of them eeprom related, or related to ASPM
(specifically this problem on the t60) being enabled but not working.
We have a patch to fix the driver to disable ASPM.  I can post that here if Andy
would like to think about merging that back to in-kernel e1000.

Comment 17 Andreas Piesk 2007-11-12 09:07:04 UTC
thanks Jesse,

i suspected MSI problems on this particular machine (because the Intel module
reports failing back zo legacy interrupts due to MSI problems) and tried kernel
parameter 'pci=nomsi' but it didn't make any change.


Comment 18 Seán O Sullivan 2007-11-12 10:11:10 UTC
Jesse,

Thanks a lot for the insight into the problem.
I checked /proc/interrupts, and MSI wasn't mentioned there.


Comment 19 Andy Gospodarek 2007-11-12 13:39:27 UTC
(In reply to comment #16)
> We have a patch to fix the driver to disable ASPM.  I can post that here if Andy
> would like to think about merging that back to in-kernel e1000.

Jesse, I'd be happy to check it out and see about getting it included upstream.
 Many thanks for the assistance on this one too!

Comment 20 Auke Kok 2007-11-12 17:00:49 UTC
Andy,

I posted the L1 ASPM disable patch upstream last week (for inclusion in e1000e). 



Comment 21 Andy Gospodarek 2007-11-12 18:57:27 UTC
Yeah, I saw those for e1000e -- I'll have to decide how we want to handle it since those devices are moving to e1000e upstream and I'm not sure how we want to handle it in RHEL.

Comment 22 Oli Wade 2007-12-31 12:42:43 UTC
The same error is seen with Fedora 8...

"ifup eth1" produces the following dmesg:
  APIC error on CPU0: 04(08)
  ADDRCONF(NETDEV_UP): eth1: link is not ready

ethtool eth1:
  Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: d
        Current message level: 0x00000007 (7)
        Link detected: no

mii-tool eth1:
  eth1: negotiated 100baseTx-FD flow-control, link ok

lspci:
  03:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet
Controller (Copper) (rev 06)

cat /proc/interrupts:
           CPU0       
  0:        140    XT-PIC-XT        timer
  1:      17460    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
  4:     483289    XT-PIC-XT        ehci_hcd:usb1, uhci_hcd:usb4
  5:    5349969    XT-PIC-XT        uhci_hcd:usb3, uhci_hcd:usb5, sata_via,
cx88[0], cx88[0], cx88[0]
  6:          6    XT-PIC-XT        floppy
  7:    1218566    XT-PIC-XT        parport0
  8:          0    XT-PIC-XT        rtc
  9:          0    XT-PIC-XT        acpi
 10:    1234603    XT-PIC-XT        nvidia
 11:    1682464    XT-PIC-XT        uhci_hcd:usb2, CS46XX, eth0
 14:     189004    XT-PIC-XT        libata
 15:          0    XT-PIC-XT        libata
2301:          0   PCI-MSI-edge      eth1
NMI:          0 
LOC:   31985838 
ERR:       6094

uname -a:
Linux lizard 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:36 EST 2007 x86_64 x86_64
x86_64 GNU/Linux

Comment 23 Andy Gospodarek 2008-02-26 20:47:15 UTC
Is this a problem with my latest test kernels?  Support for that hardware should
have moved to e1000e and it should contain the needed patches.

http://people.redhat.com/agospoda/#rhel5

Comment 24 Seán O Sullivan 2008-02-27 13:07:37 UTC
Still no luck - though behavior has changed.

According to ethtool there is a link (1000/FD), and dmesg shows no errors.
Attempting to get IP via DHCP, the client (machine in question) sends out
DHCPDISCOVER's, however the server does not see them (previously server did see
them & sent DHCPOFFER's, which the client didn't see/receive).

It automatically uses the e1000e driver.

Comment 25 Andy Gospodarek 2008-02-27 15:09:02 UTC
(In reply to comment #24)
> Still no luck - though behavior has changed.
> 
> According to ethtool there is a link (1000/FD), and dmesg shows no errors.
> Attempting to get IP via DHCP, the client (machine in question) sends out
> DHCPDISCOVER's, however the server does not see them (previously server did see
> them & sent DHCPOFFER's, which the client didn't see/receive).
> 
> It automatically uses the e1000e driver.

So if the system is sending out DHCPDISCOVERs but the server doesn't respond to
them, I'm not sure what I can do.  Can you capture them and try to figure out
whey they are not getting answered?  Do they have invalid checksums or something?

What about using a static IP?  Does that work for passing traffic?

Comment 26 Seán O Sullivan 2008-02-27 17:21:06 UTC
Sorry - to clarify, the server never sees the DHCPDISCOVERs (used tethereal to
verify), it's not that it ignores them.

I also tried static IP, no luck with that either.

Comment 27 Andy Gospodarek 2008-02-28 15:15:40 UTC
That is interesting.  I heard a few complain that e1000e fixed problems with
82572's, so this one is puzzling.

Is this a add-on card or is this an on-board card?  I can try and dig up an
82572EI card, and there is a chance we have the system you have if it's
on-board, so I'd like to test it out.

Also, have you tried it with a different switch by any chance?  That might let
me know that this is a phy issue, so I can consider looking at the sourceforge
driver for phy fixes that may not be included in the current upstream version.

Comment 28 Seán O Sullivan 2008-02-28 15:37:31 UTC
It's an Add-on PCIe card.

I haven't tried it with any other switches, I can do, but won't be till the weekend.

Comment 29 Andy Gospodarek 2008-02-28 16:48:28 UTC
Ok, thanks for the update.  I"ll see if I can track down one of these cards for
testing.

Comment 30 Andy Gospodarek 2008-02-28 20:17:14 UTC
(In reply to comment #16)
> Sean O Sullivan:
> your system has some compatibility issue with MSI interrupts.  This is a
> depressingly common issue on Hypertransport based systems.  A bios upgrade may
> fix it, but mostly I think there has been lots of kernel work to recognize these
> broken systems and disable MSI, in the latest kernels.  Do you have any devices
> that *are* working with MSI (check cat /proc/interrupts) Best thing we can do in
> e1000 to help you is patch in the MSI test.
> 

Sean, 

Let's not forget Jesse's comment from above.  Out of curiosity can you try
booting with pci=nomsi and see if that makes a difference?  If there are some
MSI problems on your system we will need to know about them and try to
workaround them in the e1000e driver.  This is probably better than trying
another switch.

Thanks!

Comment 31 Andy Gospodarek 2008-02-28 20:18:35 UTC
I forgot to mention, that you should paste the contents of /proc/interrupts
before and after adding pci=nomsi to the kernel command line.

Comment 32 Oli Wade 2008-02-29 19:05:39 UTC
Adding "pci=nomsi" has fixed the problem for me on F8 (kernel 2.6.23.15-137.fc8).

Comment 33 Andy Gospodarek 2008-02-29 19:17:35 UTC
Thanks for the feeback, Oli!  This seems to confirm Jesse's statement in comment
#16.  

Jesse, do you have some code not upstream right now (the 'MSI test' you
referenced) that could help out with this and disable MSI when it's not working
properly or are you just referencing adding an MSI test case to the ethtool
interrupt test routine(s) to help debug this.

If there are not bios updates that will fix this, I would like to see pci quirks
added to disable msi on the bridge chips connected to the network hardware if
they are going to be problematic.

Comment 34 Seán O Sullivan 2008-02-29 21:08:00 UTC
Excellent,
booting with pci=nomsi resolves the issue.

Comment 35 Andy Gospodarek 2008-03-03 16:12:45 UTC
Sean, that's good news -- to me this sounds like something that needs to be
resolved with possibly a bios update or some quirk to account for the fact that
MSI doesn't work well with that bridge.

Can you attach the lspci -vvv output to this bugzilla?

Comment 36 Seán O Sullivan 2008-03-03 16:36:23 UTC
Created attachment 296633 [details]
lspci -vvv output

Comment 37 Seán O Sullivan 2008-03-03 16:38:06 UTC
Output attached in previous comment.

BIOS is currently up-to-date, and wouldn't hold breath for Dell to release fix
for this.
Hopefully whatever workarounds Intel put in their drivers can be merged into the
RHEL e1000 (or e1000e) driver.

Comment 38 Seán O Sullivan 2008-03-03 20:07:10 UTC
Created attachment 296673 [details]
lspci -vvv output (root)

Sorry - last lspci -vvv output done as non-root user, rectified.

Comment 39 Oli Wade 2008-03-04 17:14:31 UTC
Created attachment 296764 [details]
lspci -vvv output

My lspci -vvv output (Asus A8V-VM SE).

Comment 40 Jesse Brandeburg 2008-03-04 20:56:20 UTC
Created attachment 296801 [details]
backport msi test to RHEL5

this is a patch that was ONLY compile tested.  I was not able to quickly test
on RHEL5+MSI enabled e1000, but I wanted to post the patch anyway.  Patch was
generated on RHEL5 2.6.18-53.

Comment 41 Andy Gospodarek 2008-03-04 21:32:04 UTC
Jesse,

The logic of this patch makes sense to me.  I'm glad someone familiar with the
hardware can write a good interrupt test case.

I can integrate this into my test kernels (for both e1000 and e1000e drivers)
and if all works well it would be good to push this upstream.

-andy

Comment 42 Andy Gospodarek 2008-04-14 13:33:40 UTC
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel4

Please test them and report back your results.

Comment 43 Seán O Sullivan 2008-04-14 17:41:56 UTC
Tried out your latest el5 kernel, still no luck.

As before, sends out DHCPDISCOVER's, however server never receives them.

Comment 44 Andy Gospodarek 2008-04-14 18:23:06 UTC
Sean,

So does your system report that you are using MSI or INTx on your e1000 cards
that have been problematic?

Please post output from /proc/interrupts on the system running my latest test
kernel if you can.

I would have expected this patch:

http://people.redhat.com/agospoda/rhel4/0005-e1000-msi-test-and-switch-to-intx.patch

to detect that MSI was not working well and continue in INTx mode.


Comment 45 Seán O Sullivan 2008-04-14 18:57:15 UTC
Created attachment 302381 [details]
/proc/interrupts

Comment 46 Seán O Sullivan 2008-04-14 18:58:40 UTC
Created attachment 302382 [details]
dmesg output

# dmesg | grep -i msi
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
assign_interrupt_mode Found MSI capability
0000:00:02.0: eth0: MSI interrupt test failed, using legacy interrupt.

Comment 47 Andy Gospodarek 2008-04-15 12:31:57 UTC
From the output in comment #46 it seems that MSI is not being used for the
e1000e card, but I find it interesting that the output from /proc/interrupts in
comment #45 does not show any IRQs for eth0 -- I'm hoping this was because you
had the interface down when checking /proc/interrupts.  Can you check for sure
when the system is running that you are using INTx rather than MSI?

It seems odd that your system works fine with pci=nomsi, but with MSI supposedly
disabled in the e1000e driver it doesn't work well at all.

Comment 48 Seán O Sullivan 2008-04-24 20:14:08 UTC
Created attachment 303683 [details]
lspci -vvv and /proc/interrupts

Sorry about delay - attached is the requested.

Comment 49 Andy Gospodarek 2008-04-24 22:46:22 UTC
Thanks, Sean.  It certainly seems like one of your bridges must still have
problems and needs to have MSI disabled or something is still wrong with the
e1000e driver we are using.

Auke and Jesse, are there more changes from the sourceforge driver that we can
backport to rhel for testing, or that were missed?  I've seen Auke mention to
others on netdev that problems with some of the HT chipsets (specifically the
ones mentioned in this BZ) this would be fixed in 2.6.25 with e1000e, but was
the fix in e1000e or somewhere else in the kernel?

Our driver is pretty close to upstream e1000e, so I wonder what I'm missing.  

Comment 51 Oli Wade 2008-06-05 14:58:47 UTC
I have upgraded to F9 (2.6.25/e1000e) and the problem has gone - even after
removing the "pci=nomsi" kernel argument.

Comment 52 Andy Gospodarek 2008-06-09 20:04:01 UTC
Thanks for the feedback, Oli.  Good to know it's working for someone! :)

Comment 53 Andy Gospodarek 2008-10-07 15:28:39 UTC
Created attachment 319653 [details]
0001-pci-quirk-set-En-bit-of-MSI-mapping-for-device-on.patch

Sean, There is a patch that first appeared in 2.6.25 that may address your issue:

commit 9dc625e72309e1c919ea3e7f51d0ffca96123787
Author: Peer Chen <pchen>
Date:   Mon Feb 4 23:50:13 2008 -0800

    PCI: quirks: set 'En' bit of MSI Mapping for devices onHT-based nvidia platform

    According to HT spec, to get message interrupt from devices mapped to HT
    interrupt message, the 'En' bit of MSI Mapping capability need to be set.
    The patch do this setting in quirks code for the devices on HT-based nvidia

If you've tested on a recent upstream (fedora or otherwise) kernel it might be a good indication whether or not this patch will fix it.  I've done a backport to make this work on RHEL5, and attached the patch as it's probably worth testing.  I'll also try and build some test kernels later this week with the patch included, but any testing you can do in the mean-time would be helpful.  Thanks!

Comment 54 Andy Gospodarek 2008-10-21 12:42:56 UTC
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.

Comment 55 Seán O Sullivan 2008-10-21 13:48:41 UTC
Excellent, that seems to have done it!

Tested kernel-2.6.18-120.el5.gtest.59.i686.rpm

Thanks a lot.

Comment 56 Andy Gospodarek 2008-10-21 14:29:47 UTC
Thanks for the quick feedback, Sean!  Glad to hear it's working well for you.

Comment 57 RHEL Program Management 2009-01-27 20:43:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 59 RHEL Program Management 2009-02-16 15:39:21 UTC
Updating PM score.

Comment 60 Don Zickus 2009-02-23 20:00:40 UTC
in kernel-2.6.18-132.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 64 errata-xmlrpc 2009-09-02 08:38:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.