nVidia MCP55 Ethernet (rev a3) not functional on kernels 2.6.18-53.1.4.* Description of problem: The forcedeth driver found on 2.6.18-53.1.4.* does not properly support the Ethernet adapter on Supermicro H8DME-2 motherboards. This motherboard uses the nVidia MCP55 chipset. The previous kernel release (2.6.18-53) works flawlessly on the same hardware. The main symptom is lack of connectivity due to the inability of the software to detect/establish a link. This issue has been reproduced using many different combinations of Cat5e cables and ethernet devices, like switches and other servers. All possible permutations of link speed and duplexity have been tried out already. Autonegotiation has also been attempted to no avail. Here is the output of "ethtool eth0" with autonegotiation: Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: Unknown! (65535) Duplex: Unknown! (255) Port: MII PHYAD: 2 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Link detected: no Here is the output of "ethtool eth0" with speed 100 duplex full manually set: Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: Not reported Advertised auto-negotiation: No Speed: Unknown! (65535) Duplex: Unknown! (255) Port: MII PHYAD: 2 Transceiver: external Auto-negotiation: off Supports Wake-on: g Wake-on: d Link detected: no Here is what is captured by syslog when the forcedeth module is inserted: Jan 15 04:47:57 ipdaez000mia kernel: forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.60. Jan 15 04:47:57 ipdaez000mia kernel: PCI: Enabling device 0000:00:08.0 (0000 -> 0003) Jan 15 04:47:57 ipdaez000mia kernel: ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [LMAC] -> GSI 21 (level, low) -> IRQ 233 Jan 15 04:47:57 ipdaez000mia kernel: forcedeth: using HIGHDMA Jan 15 04:47:57 ipdaez000mia kernel: eth0: forcedeth.c: subsystem: 015d9:1611 bound to 0000:00:08.0 Jan 15 04:47:57 ipdaez000mia kernel: PCI: Enabling device 0000:00:09.0 (0000 -> 0003) Jan 15 04:47:57 ipdaez000mia kernel: ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LMAD] -> GSI 20 (level, low) -> IRQ 50 Jan 15 04:47:57 ipdaez000mia kernel: forcedeth: using HIGHDMA Jan 15 04:47:57 ipdaez000mia kernel: eth1: forcedeth.c: subsystem: 015d9:1611 bound to 0000:00:09.0 Jan 15 04:47:58 ipdaez000mia kernel: eth0: no link during initialization. Jan 15 04:47:58 ipdaez000mia kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready Jan 15 04:47:58 ipdaez000mia kernel: eth1: no link during initialization. Jan 15 04:47:58 ipdaez000mia kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready Jan 15 04:47:58 ipdaez000mia kernel: eth1: link down. Version-Release number of selected component (if applicable): All 2.6.18-53.1.4 Kernels for x86 architectures, 32 and 64 bit versions. How reproducible: Install RHEL5 using a H8DME-2 motherboard (other motherboards using the nVidia MCP55 possibly too). Upgrade the kernel to version 2.6.18-53.1.4.el5PAE. Steps to Reproduce: 1.Install RHEL5 using a H8DME-2 motherboard 2.Upgrade the kernel to version 2.6.18-53.1.4.el5PAE 3.Connect the primary NIC to a Ethernet device. 4.Run "ethtool eth0" or try pinging any other device on the network. Actual results: The NIC is enabled but not capable of stablishing a link with any other Ethernet device. Expected results: The NIC is enabled and capable of stablishing a link with any other Ethernet device. Additional info: Attached is the output of "lspci -vv".
Created attachment 291604 [details] Output of "lspci -vv"
2.6.18-53.1.3 added the following patch to forcedeth: http://people.redhat.com/agospoda/rhel5/forcedeth-msi-interrupt.patch but I have a hard time imagining that this is causing problems getting the board to allow link-up -- unless this change managed to cause problems with interrupt collection on your board. Here are a few things that might help me out since I don't have your specific board. Can you send the output from /proc/interrupts on the working and non-working kernel? I'm curious if you aren't getting any interrupts on your 2.6.18-53.1.4-based kernel for the forcedeth cards. I'm also curious if you get a link light (on the hardware) when connecting a cable even though ethtool et al don't report the link status as 'UP'. Can you post the relevant bits from syslog during forcedeth init on a working kernel (so I can compare it to what you have attached on the non-working one? Also, There are some PHY fixes out there for forcedeth that we haven't integrated to RHEL. I haven't put them in a test kernel, but a patch to include them is here if you would like to try it: http://people.redhat.com/agospoda/rhel5/forcedeth-unapplied-phy-fixes.patch I don't think it will resolve your issue as I'm now skeptical of the patch, but I figured I would make it available just in case.
Created attachment 291745 [details] Information from a non working kernel.
Created attachment 291746 [details] Information from a working kernel.
I have attached all the information requested about the non working kernel on the file 'broken_kernel.txt' I have also attached all the information requested about the working kernel on the file 'working_kernel.txt' Regarding the lights on the NIC, yes the light is blinking. It also seems to respond to traffic, i.e. it blinks wildly when I try to ping from another machine. I'll try to apply the patch you provided as soon as I can.
Thanks, I'll take a look at these and the patch that caused this and see what I can find. I find it interesting that only 1 interrupt event occurred on your non-working system -- this is exactly what I expected to see.
I found out that ethtool is triggering this issue. When the device is enabled on 2.6.18-53.1.4 without ethtool options it works properly. A static IP or DHCP can be used with no problems. The moment ethtool is used, regardless of options passed, the NIC is unable to stablish a link back again. Kernel 2.6.18-53. is not affected by the use of ethtool. As I mentioned before, we have tried all possible configurations with ethtool to no avail. From autonegotiation to all possible manual duplexity/speed combinations. Let me know if there is more you need to know about this new findings.
Interesting fact about ethtool. I managed to get my hands on a system that has what appears to be an identical MCP55 chip (even down to the rev), but was unale to reproduce the problem before. I'm also unable to reproduce it with ethtool, but will keep working at it to see if I can make it happen.
So I tried a little harder and reproduced it with ethtool. It seems that I can reload the forcedeth module and everything is back to normal (so at least the hardware isn't dead or anything). I'll take a look at the patch in question and see if I can come up with a solution for this.
So now, that we have included this patch from upstream: commit a7475906bc496456ded9e4b062f94067fb93057a Author: Manfred Spraul <manfred> Date: Wed Oct 17 21:52:33 2007 +0200 forcedeth msi bugfix pci_enable_msi() replaces the INTx irq number in pci_dev->irq with the new MSI irq number. The forcedeth driver did not update the copy in netdevice->irq and parts of the driver used the stale copy. See bugzilla.kernel.org, bug 9047. We are in an interesting spot because it appears that the 2.6.18-based kernel that we are shipping for rhel5 isn't quite as good at handling interrupt enable/disable for MSI interrupts. The calls to enable_irq/disable_irq in nv_enable_irq/nv_disable_irq that get called in the case where someone is using MSI don't work well anymore when calling with the correct interrupt. :-) Right now I can't get the device to reliably come back after running the following set of ethtool commands: ethtool -s eth0 autoneg off speed 100 duplex full ethtool -s eth0 autoneg on speed 100 duplex full Ethtool will report the correct output, but we never get link up. No more interrupts seem to show up when watching /proc/interrupts either. I've tried replacing the guts of nv_enable/disable_irq with calles to nv_open/close and of course that works fine, but I've yet to find any smaller set of calls that reliably work. If the goal here is to block interrupts from happening it might be better to simply disable the hardware interrupts like most other drivers do. It has less of an effect on the entire system and won't be dependent on other irq changes that sneak into the kernel. Any thoughts on this Ayaz?
Yes, you can use nv_disable_hw_interrupts and nv_enable_hw_interrupts. However, I believe the reason to use the system calls (disable_irq, enable_irq) was to ensure the ISR function is not executing at the time you attempt to make any changes from the ethtool handlers.
I'm pretty sure a call to synchonize_irq() will make sure all handlers have completed before continuing.
Hi, I would like to know if there is any more information regarding this issue. Will the fixes be included in the next kernel update? if so, any ETA on that RPM? We currently have a number of clients running on this hardware and we would like to plan our course of action.
I can't give you any great idea of when this is going to be fixed since the problem right now is that I don't have a great solution to the problem. As it stands right now, there are some problems with MSI on RHEL5 and calls to enable_irq and/or disable_irq don't quite work as expected. Upstream it seems like the kinks with MSI were worked out so this isn't an issue. My hope was to replace those calls in the driver with nv_disable_hw_interrupts and nv_enable_hw_interrupts (and the necessary other calls like synchronize_irq) so that we are simply turning off the interrupts on the hardware and make sure all have been handled rather than masking them off at the OS level, but I don't quite have that working right now. Hopefully I can come up with something soon, though.
Created attachment 294922 [details] SOSREPORT This snapshot was taken from a RHEL5 system running the latest kernel 2.6.18-53.1.3.el5PAE
The new kernel seems to have the problem when autoneg is set to "OFF" using ethtool.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Ayaz, I've been doing some testing and found that this patch works ok on my system. http://people.redhat.com/agospoda/rhel5/forcedeth-msi-ethtool-fix.patch It does enable interrupts earlier than the previous code, but the main problem seems to be that having interrupts disabled when writing to BMCR stops an interrupts from ever happening again.
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results.
kernel-2.6.18-83.el5.gtest.42.i686.rpm kernel-PAE-2.6.18-83.el5.gtest.42.i686.rpm We have tested these kernels and so far they work for the most part. I found the bug can be triggered by following a specific sequence. In all other situations though, the NIC works as expected; even when autonegotation is turned off through ethtool. Here is the sequence: -/etc/init.d/network start -ethtool -s eth0 autoneg on -ethtool -s eth0 speed 100 duplex full autoneg off I'll be testing kernel-2.6.18-83.el5.gtest.42.x86_64.rpm soon.
Thank you for the quick feedback. The tests you performed were quite similar to the ones I used to reproduce the issue. I look forward to the futher results of your testing.
OK, x86_64 is behaving quite differently from the other test kernels. As soon as 'autoneg off' is passed to ethtool the NIC goes offline. But, if you configure the interface using ethtool before bringing it up, the configuration will stick and the NIC stays online. i.e.: #/etc/init.d/network stop #ethtool -s eth0 speed 100 duplex full autoneg off #/etc/init.d/network start This will work, assuming that ifcfg-* is not passing any ethtool options of course. Let me know if you need anything.
(In reply to comment #20) > kernel-2.6.18-83.el5.gtest.42.i686.rpm > kernel-PAE-2.6.18-83.el5.gtest.42.i686.rpm > > We have tested these kernels and so far they work for the most part. I found the > bug can be triggered by following a specific sequence. In all other situations > though, the NIC works as expected; even when autonegotation is turned off > through ethtool. > > Here is the sequence: > > -/etc/init.d/network start > -ethtool -s eth0 autoneg on > -ethtool -s eth0 speed 100 duplex full autoneg off > > I'll be testing kernel-2.6.18-83.el5.gtest.42.x86_64.rpm soon. > > I do not see the same results when I test x86_64 running my test kernel. It takes a few (maybe 5) seconds for the link to be re-established, but the link does go up (and I'm posting to this bugzilla from that system with the link speed currently at 100Mbps full-duplex). I'm quite sure the delay exists because the switch I'm using goes through a series of tests to detect carrier at the various speeds it supports. Does the device (switch) you have connected to the forcedeth card with autoneg disabled play nicely with cards that have it disabled?
# uname -a Linux localhost.localdomain 2.6.18-83.el5.gtest.42 #1 SMP Tue Feb 26 09:14:24 EST 2008 x86_64 x86_64 x86_64 GNU/Linux [root@localhost network-scripts]# ethtool eth0 Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: Not reported Advertised auto-negotiation: No Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: off Supports Wake-on: g Wake-on: d Link detected: yes
Andy, You were correct. We had the 32bit/64bit builds connected to different switches and it seems to be causing hiccups. I was able to test kernel-2.6.18-83.el5.gtest.42.x86_64.rpm and verify that it works properly.
Created attachment 296659 [details] Eth0 configuration
Results for: -kernel-2.6.18-83.el5.gtest.42.i686.rpm -kernel-PAE-2.6.18-83.el5.gtest.42.i686.rpm I'm still able to trigger the bug on these two kernels. It happens 1 out of every 5 or 6 times you try the following sequence: -Reboot the system -ethtool -s eth0 autoneg on (wait until it gets a link) -ethtool -s eth0 speed 100 duplex full autoneg off To emulate the system reboot, from the NIC's point of view, the following seems to work: /etc/init.d/network stop; modprobe -r forcedeth; /etc/init.d/network start; ethtool -s eth0 autoneg on; sleep 10; ethtool -s eth0 speed 100 duplex full autoneg off; `cat /proc/interrupts` reveals no increments on interrupts to the controller when the interface is unable to link. It also seems that the driver is hard locking the chipset. A reboot or power loss don't seem to help getting the NIC back online. At this point, not even the leds on the socket work. I've been able to resolve this by trying to PXE boot the system, which seems to re-enable the interface. I've attached the configuration for eth0 that I'm using. Let me know if there is anything else that you need.
Thanks for the update. I'm bisecting changes now and it appears something changed during the 2.6.19 development process fixes this. I'm hoping I can narrow it down to a specific patch (or patches) that will make this work on our 2.6.18-based kernels.
I dropped the patch that I originally included to try and fix this bugzilla and made a new patch that should fix the problem (it does for me on an x86_64 system). As I suspected, updates to the code responsible for handling MSI resolved this. I'm not sure exactly how large the patch will be, but this is a minimum amount that I want to start trying. Please test kernels here: http://people.redhat.com/agospoda/#rhel5 and let me know how they work for you. Thanks!
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Results for: -kernel-2.6.18-84.el5.gtest.44.i686.rpm -kernel-2.6.18-84.el5.gtest.44.x86_64.rpm -kernel-PAE-2.6.18-84.el5.gtest.44.i686.rpm Were able to test and verify that the issue has been fixed in these kernels. Let me know if there is anything I can do to help.
The patch touches common interrupt code paths and will void QE testing to date for RHEL 5.2. We need to move this out to R5.3 and have the patch accepted early in the 5.3 development cycle to allow for sufficient testing.
Can we get some more details on the impact of this bug? 1. What is affected - all MCP55? MCP55 rev A3? One specific model of SuperMicro motherboard? 2. Does the lack of connectivity occur when the system is booted, or only after ethtool is run?
This is a problem for anyone that chooses to set link speed and duplex manually rather than relying on auto-negotiation. I still don't completely understand why people refuse to use auto-negotiation, but there are some.
Currently we only have MCP55 rev A3 on our environment so I can't provide information on other revisions. If the system is configured with ethtool options on the network scripts, like we currently do, the system will boot with no link. Customers don't trust auto negotiation because is not reliable when used in conjunction with lots of different models of Cisco switches and some Foundry's. We won't be able to offer REHL5 on the recently launched lineup until we can offer a stable network connection.
Carlos, as a workaround you can put something like this; 'ifconfig eth0 down && ifconfig eth0 up' in rc.local and combined with ETHTOOL_OPTS in ifcfg-eth0 the card will do what you want, right?
Unfortunately it will not work because any option passed to ETHTOOL_OPTS that is not 'autoneg on' by itself will trigger the bug. Regarding ifconfig; in some cases, you have to remove the forcedeth module and re-insert it to get the NIC to respond again. In some other cases the only solution is to reboot the system.
Created attachment 302347 [details] rhel5-forcedeth-fix.patch I think this patch should work as a forcedeth specific fix. You should be able to rebuild a kernel pretty easy to test this out. Unfortunately you will probably end up with different irq numbers each time you run ethtool to change speed/duplex/etc, but that shouldn't be a big deal. If you need this in a test kernel I can probably provide it later this week, but I have a few other fixes I need to add first.
Hi Andy, A test kernel would be a good idea since our environment doesn't permit the use of custom kernels. Thank you
Hi Andy, This issue has been re-assigned to me. I will be testing the latest "test" kernel that you have posted and providing feed back through bugzilla and we continue in the resolution process.
Created attachment 311508 [details] sha1sum and ll -h output
Comment on attachment 311508 [details] sha1sum and ll -h output It seems that the following kernels are zero bytes in size: kernel-2.6.18-94.el5.gtest.49.i686.rpm kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm
Created attachment 311976 [details] Output of dmesg when USB key plugged in. I have completed testing the following kernel: kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm Test Performed ============== -Reboot the system -ethtool -s eth0 autoneg on (wait until it gets a link) -ethtool -s eth0 speed 100 duplex full autoneg off To emulate the system reboot, from the NIC's point of view, the following seems to work: /etc/init.d/network stop; modprobe -r forcedeth; /etc/init.d/network start; ethtool -s eth0 autoneg on; sleep 10; ethtool -s eth0 speed 100 duplex full autoneg off; Expected results: Nivida nic is able to maintain network connectivity after the above test. Simulated Reboot ================ I have run the above test 20 times. The nic was able to get a link and IP after every emulated reboot. `cat /proc/intterupts` reveals that eth0 is getting a new IRQ after every emulated reboot. Network activity was tested using dns lookups (dig) and scp file transfers. `cat /proc/intterupts` shows the counters for eth0 are incrementing. CAT5 cable connect and disconnect ================================= Part of the QA process tested unplugging and plugging in the CAT5 cable to simulate a possible condition in the data center. After 12 test there were no issues with normal network traffic or connectivity. Simulated reboot with CAT5 disconnected ======================================= After 12 test there were no issues with normal network traffic or connectivity. Issues and concerns =================== I ran into the following error when connecting a USB key to the server running the test kernel (kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm): Jul 16 03:10:16 ipdmz0000atl2 kernel: ehci_hcd 0000:00:02.1: Unlink after no-IRQ? Controller is probably using the wrong IRQ. Complete output attached as: qa-errors-kernel-PAE.txt
I have completed testing all 3 test kernels: kernel-2.6.18-94.el5.gtest.49.i686.rpm kernel-2.6.18-94.el5.gtest.49.x86_64.rpm kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm The results are the same as Comment #53 Looks like the patch works, except for the issue with trying to plug in a USB storage device while running the test kernel. Error: Jul 16 03:10:16 ipdmz0000atl2 kernel: ehci_hcd 0000:00:02.1: Unlink after no-IRQ? Controller is probably using the wrong IRQ.
Thanks for the feedback -- I'm glad to know this is working. It does not seem likely the the other message you saw: Jul 16 03:10:16 ipdmz0000atl2 kernel: ehci_hcd 0000:00:02.1: Unlink after no-IRQ? Controller is probably using the wrong IRQ. is related to the code change I made and my initial searching leads me to think that you would have seen this error on an older kernel as well. Did you happen to notice if the USB storage device still worked correctly despite this message?
I was not able to access the usb storage device when this error displayed. The only thing listed under dev was /dev/sdb. The normal /dev/sdb1 not there. To be sure this was isolated to the test kernel, I reboot into the default kernel under RHEL 5.1, and the drive mounted without issues. If you would like me to run any tests please let me know and I will post the results.
Interesting. I'm not sure this is related to my kernels, so if you want to try one from here: http://people.redhat.com/dzickus/el5/97.el5/ as well as the RHEL5.2 kernel those would be nice data points. Just so you know, I'm also planning to update my test kernels with a new test patch later today (mostly to focus on the MSI issue). I'll post here when they are ready.
Thanks, I will be on the look out for the updated Kernels.
I have tested the following test kernel and it seems that the issue with the USB storage key has been resolved. The Nvidia nic also passes our internal QA process. -kernel-2.6.18-98.el5.gtest.50.i686.rpm I will test the other 2 remaining kernels and post my results.
Glad to hear this kernel is working and the usb storage issue is now resolved as well. I'm a bit curious if the removal of my hacks in the forcedeth driver caused those to be fixed, but I would be quite surprised if that did it. I don't see anything in the changelog that jumps out either.
I have tested the remaining kernels, and all is working. I am now able to use a usb storage device without any of the previous issues. My final question would be, will this fix be including in a upcoming kernel update or RHEL 5.3?
This fix is suitable for 5.3 (since it's a large change), not a 5.2 update. We will be making one more tweak to it and I'll let you know when those kernels are available.
in kernel-2.6.18-104.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 433667 has been marked as a duplicate of this bug. ***
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: the forcedeth-msi driver has been updated to fix a bug that prevented proper link-up detection.
~~ Attention ~~ We Need Testing Feedback Soon ~~ We're nearing the end of the Red Hat Enterprise Linux 5.3 Testing Phase and this bug has not yet been VERIFIED. This bug should be fixed in the latest RHEL53 Beta Snapshot. It is critical that we receive your feedback ASAP. Otherwise, this bug is at risk of being dropped from the release. If you encounter any new issues, CLONE this bug and describe the new issues you are facing. We are no longer excepting NEW bugs into the release, bar critical regressions and blocker issues. If you have VERIFIED this fix, add CustomerVerified to the Bugzilla Keywords, along with a description of the test results.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html