Bug 428696 - nVidia MCP55 MCP55 Ethernet (rev a3) not functional on kernel 2.6.18-53.1.4
nVidia MCP55 MCP55 Ethernet (rev a3) not functional on kernel 2.6.18-53.1.4
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
i686 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Andy Gospodarek
Martin Jenner
: OtherQA, Regression, ZStream
: 433667 (view as bug list)
Depends On:
Blocks: 391501 KernelPrio5.3 RHEL5u3_relnotes 461894
  Show dependency treegraph
 
Reported: 2008-01-14 12:15 EST by Carlos Avila
Modified: 2010-10-22 17:44 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
the forcedeth-msi driver has been updated to fix a bug that prevented proper link-up detection.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 14:32:49 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Output of "lspci -vv" (21.63 KB, text/plain)
2008-01-14 12:15 EST, Carlos Avila
no flags Details
Information from a non working kernel. (2.49 KB, text/plain)
2008-01-15 15:40 EST, Carlos Avila
no flags Details
Information from a working kernel. (2.62 KB, text/plain)
2008-01-15 15:41 EST, Carlos Avila
no flags Details
SOSREPORT (384.90 KB, application/octet-stream)
2008-02-14 12:19 EST, Andres Barcenas
no flags Details
Eth0 configuration (141 bytes, text/plain)
2008-03-03 13:40 EST, Carlos Avila
no flags Details
rhel5-forcedeth-fix.patch (2.30 KB, patch)
2008-04-14 10:55 EDT, Andy Gospodarek
no flags Details | Diff
sha1sum and ll -h output (1.05 KB, application/octet-stream)
2008-07-10 15:28 EDT, Kelsey Hightower
no flags Details
Output of dmesg when USB key plugged in. (46.36 KB, text/plain)
2008-07-16 14:14 EDT, Kelsey Hightower
no flags Details

  None (edit)
Description Carlos Avila 2008-01-14 12:15:00 EST
nVidia MCP55 Ethernet (rev a3) not functional on kernels 2.6.18-53.1.4.*

Description of problem:
The forcedeth driver found on 2.6.18-53.1.4.* does not properly support the
Ethernet adapter on Supermicro H8DME-2 motherboards. This motherboard uses the
nVidia MCP55 chipset. The previous kernel release (2.6.18-53) works flawlessly
on the same hardware. The main symptom is lack of connectivity due to the
inability of the software to detect/establish a link. This issue has been
reproduced using many different combinations of Cat5e cables and ethernet
devices, like switches and other servers. All possible permutations of link
speed and duplexity have been tried out already. Autonegotiation has also been
attempted to no avail.

Here is the output of "ethtool eth0" with autonegotiation:
Settings for eth0:
	Supported ports: [ MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: Unknown! (65535)
	Duplex: Unknown! (255)
	Port: MII
	PHYAD: 2
	Transceiver: external
	Auto-negotiation: on
	Supports Wake-on: g
	Wake-on: d
	Link detected: no


Here is the output of "ethtool eth0" with speed 100 duplex full manually set:
Settings for eth0:
	Supported ports: [ MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  Not reported
	Advertised auto-negotiation: No
	Speed: Unknown! (65535)
	Duplex: Unknown! (255)
	Port: MII
	PHYAD: 2
	Transceiver: external
	Auto-negotiation: off
	Supports Wake-on: g
	Wake-on: d
	Link detected: no


Here is what is captured by syslog when the forcedeth module is inserted:
Jan 15 04:47:57 ipdaez000mia kernel: forcedeth.c: Reverse Engineered nForce
ethernet driver. Version 0.60.
Jan 15 04:47:57 ipdaez000mia kernel: PCI: Enabling device 0000:00:08.0 (0000 ->
0003)
Jan 15 04:47:57 ipdaez000mia kernel: ACPI: PCI Interrupt 0000:00:08.0[A] -> Link
[LMAC] -> GSI 21 (level, low) -> IRQ 233
Jan 15 04:47:57 ipdaez000mia kernel: forcedeth: using HIGHDMA
Jan 15 04:47:57 ipdaez000mia kernel: eth0: forcedeth.c: subsystem: 015d9:1611
bound to 0000:00:08.0
Jan 15 04:47:57 ipdaez000mia kernel: PCI: Enabling device 0000:00:09.0 (0000 ->
0003)
Jan 15 04:47:57 ipdaez000mia kernel: ACPI: PCI Interrupt 0000:00:09.0[A] -> Link
[LMAD] -> GSI 20 (level, low) -> IRQ 50
Jan 15 04:47:57 ipdaez000mia kernel: forcedeth: using HIGHDMA
Jan 15 04:47:57 ipdaez000mia kernel: eth1: forcedeth.c: subsystem: 015d9:1611
bound to 0000:00:09.0
Jan 15 04:47:58 ipdaez000mia kernel: eth0: no link during initialization.
Jan 15 04:47:58 ipdaez000mia kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Jan 15 04:47:58 ipdaez000mia kernel: eth1: no link during initialization.
Jan 15 04:47:58 ipdaez000mia kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Jan 15 04:47:58 ipdaez000mia kernel: eth1: link down.


Version-Release number of selected component (if applicable): 

All 2.6.18-53.1.4 Kernels for x86 architectures, 32 and 64 bit versions.


How reproducible:
Install RHEL5 using a H8DME-2 motherboard (other motherboards using the nVidia
MCP55 possibly too). Upgrade the kernel to version 2.6.18-53.1.4.el5PAE. 

Steps to Reproduce:
1.Install RHEL5 using a H8DME-2 motherboard
2.Upgrade the kernel to version 2.6.18-53.1.4.el5PAE
3.Connect the primary NIC to a Ethernet device.
4.Run "ethtool eth0" or try pinging any other device on the network.
  
Actual results:
The NIC is enabled but not capable of stablishing a link with any other Ethernet
device.

Expected results:
The NIC is enabled and capable of stablishing a link with any other Ethernet device.

Additional info:

Attached is the output of "lspci -vv".
Comment 1 Carlos Avila 2008-01-14 12:15:00 EST
Created attachment 291604 [details]
Output of "lspci -vv"
Comment 2 Andy Gospodarek 2008-01-15 10:30:51 EST
2.6.18-53.1.3 added the following patch to forcedeth:

http://people.redhat.com/agospoda/rhel5/forcedeth-msi-interrupt.patch

but I have a hard time imagining that this is causing problems getting the board
to allow link-up -- unless this change managed to cause problems with interrupt
collection on your board.  

Here are a few things that might help me out since I don't have your specific board.

Can you send the output from /proc/interrupts on the working and non-working
kernel?  I'm curious if you aren't getting any interrupts on your
2.6.18-53.1.4-based kernel for the forcedeth cards.  I'm also curious if you get
a link light (on the hardware) when connecting a cable even though ethtool et al
don't report the link status as 'UP'.

Can you post the relevant bits from syslog during forcedeth init on a working
kernel (so I can compare it to what you have attached on the non-working one?

Also, There are some PHY fixes out there for forcedeth that we haven't
integrated to RHEL.  I haven't put them in a test kernel, but a patch to include
them is here if you would like to try it:

http://people.redhat.com/agospoda/rhel5/forcedeth-unapplied-phy-fixes.patch

I don't think it will resolve your issue as I'm now skeptical of the patch, but
I figured I would make it available just in case.
Comment 3 Carlos Avila 2008-01-15 15:40:31 EST
Created attachment 291745 [details]
Information from a non working kernel.
Comment 4 Carlos Avila 2008-01-15 15:41:08 EST
Created attachment 291746 [details]
Information from a working kernel.
Comment 5 Carlos Avila 2008-01-15 15:42:36 EST
I have attached all the information requested about the non working kernel on
the file 'broken_kernel.txt'

I have also attached all the information requested about the working kernel on
the file 'working_kernel.txt'

Regarding the lights on the NIC, yes the light is blinking. It also seems to
respond to traffic, i.e. it blinks wildly when I try to ping from another machine.  

I'll try to apply the patch you provided as soon as I can. 
Comment 6 Andy Gospodarek 2008-01-15 16:50:43 EST
Thanks, I'll take a look at these and the patch that caused this and see what I
can find.  I find it interesting that only 1 interrupt event occurred on your
non-working system -- this is exactly what I expected to see.
Comment 7 Carlos Avila 2008-01-16 10:14:53 EST
I found out that ethtool is triggering this issue. When the device is enabled on
2.6.18-53.1.4 without ethtool options it works properly. A static IP or DHCP can
be used with no problems. The moment ethtool is used, regardless of options
passed, the NIC is unable to stablish a link back again. Kernel 2.6.18-53. is
not affected by the use of ethtool.

As I mentioned before, we have tried all possible configurations with ethtool to
no avail. From autonegotiation to all possible manual duplexity/speed
combinations.  

Let me know if there is more you need to know about this new findings.
Comment 8 Andy Gospodarek 2008-01-16 10:34:43 EST
Interesting fact about ethtool.  I managed to get my hands on a system that has what appears to be an identical MCP55 chip (even down to the rev), but was unale to reproduce the problem before.  I'm also unable to reproduce it with ethtool, but will keep working at it to see if I can make it happen.
Comment 9 Andy Gospodarek 2008-01-16 10:41:31 EST
So I tried a little harder and reproduced it with ethtool.  It seems that I can reload the forcedeth module and everything is back to normal (so at least the hardware isn't dead or anything).  I'll take a look at the patch in question and see if I can come up with a solution for this.
Comment 10 Andy Gospodarek 2008-01-18 15:17:09 EST
So now, that we have included this patch from upstream:

commit a7475906bc496456ded9e4b062f94067fb93057a
Author: Manfred Spraul <manfred@colorfullife.com>
Date:   Wed Oct 17 21:52:33 2007 +0200

    forcedeth msi bugfix

    pci_enable_msi() replaces the INTx irq number in pci_dev->irq with the
    new MSI irq number.
    The forcedeth driver did not update the copy in netdevice->irq and
    parts of the driver used the stale copy.
    See bugzilla.kernel.org, bug 9047.

We are in an interesting spot because it appears that the 2.6.18-based kernel
that we are shipping for rhel5 isn't quite as good at handling interrupt
enable/disable for MSI interrupts.  The calls to enable_irq/disable_irq in
nv_enable_irq/nv_disable_irq that get called in the case where someone is using
MSI don't work well anymore when calling with the correct interrupt. :-)

Right now I can't get the device to reliably come back after running the
following set of ethtool commands:

ethtool -s eth0 autoneg off speed 100 duplex full
ethtool -s eth0 autoneg on speed 100 duplex full

Ethtool will report the correct output, but we never get link up.  No more
interrupts seem to show up when watching /proc/interrupts either.

I've tried replacing the guts of nv_enable/disable_irq with calles to
nv_open/close and of course that works fine, but I've yet to find any smaller
set of calls that reliably work.  

If the goal here is to block interrupts from happening it might be better to
simply disable the hardware interrupts like most other drivers do.  It has less
of an effect on the entire system and won't be dependent on other irq changes
that sneak into the kernel.

Any thoughts on this Ayaz?
Comment 11 Ayaz 2008-01-18 16:58:11 EST
Yes, you can use nv_disable_hw_interrupts and nv_enable_hw_interrupts. 
However, I believe the reason to use the system calls (disable_irq, 
enable_irq) was to ensure the ISR function is not executing at the time you 
attempt to make any changes from the ethtool handlers.
Comment 12 Andy Gospodarek 2008-01-21 09:41:33 EST
I'm pretty sure a call to synchonize_irq() will make sure all handlers have completed before continuing.
Comment 13 Carlos Avila 2008-02-05 16:41:28 EST
Hi,

I would like to know if there is any more information regarding this issue. Will
the fixes be included in the next kernel update? if so, any ETA on that RPM?

We currently have a number of clients running on this hardware and we would like
to plan our course of action. 
Comment 14 Andy Gospodarek 2008-02-05 17:01:56 EST
I can't give you any great idea of when this is going to be fixed since the
problem right now is that I don't have a great solution to the problem.  As it
stands right now, there are some problems with MSI on RHEL5 and calls to
enable_irq and/or disable_irq don't quite work as expected.  Upstream it seems
like the kinks with MSI were worked out so this isn't an issue.

My hope was to replace those calls in the driver with nv_disable_hw_interrupts
and nv_enable_hw_interrupts (and the necessary other calls like synchronize_irq)
so that we are simply turning off the interrupts on the hardware and make sure
all have been handled rather than masking them off at the OS level, but I don't
quite have that working right now.


Hopefully I can come up with something soon, though.
Comment 15 Andres Barcenas 2008-02-14 12:19:19 EST
Created attachment 294922 [details]
SOSREPORT

This snapshot was taken from a RHEL5 system running the latest kernel
2.6.18-53.1.3.el5PAE
Comment 16 Andres Barcenas 2008-02-14 12:20:35 EST
The new kernel seems to have the problem when autoneg is set to "OFF" using
ethtool. 
Comment 17 RHEL Product and Program Management 2008-02-21 13:28:49 EST
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 18 Andy Gospodarek 2008-02-26 09:16:10 EST
Ayaz,

I've been doing some testing and found that this patch works ok on my system.

http://people.redhat.com/agospoda/rhel5/forcedeth-msi-ethtool-fix.patch

It does enable interrupts earlier than the previous code, but the main problem
seems to be that having interrupts disabled when writing to BMCR stops an
interrupts from ever happening again.
Comment 19 Andy Gospodarek 2008-02-26 13:06:56 EST
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.
Comment 20 Carlos Avila 2008-02-26 15:29:03 EST
kernel-2.6.18-83.el5.gtest.42.i686.rpm
kernel-PAE-2.6.18-83.el5.gtest.42.i686.rpm

We have tested these kernels and so far they work for the most part. I found the
bug can be triggered by following a specific sequence. In all other situations
though, the NIC works as expected; even when autonegotation is turned off
through ethtool. 

Here is the sequence:

-/etc/init.d/network start 
-ethtool -s eth0 autoneg on
-ethtool -s eth0 speed 100 duplex full autoneg off

I'll be testing kernel-2.6.18-83.el5.gtest.42.x86_64.rpm soon.

Comment 21 Andy Gospodarek 2008-02-26 15:51:50 EST
Thank you for the quick feedback.  The tests you performed were quite similar to the ones I used to reproduce the issue.  I look forward to the futher results of your testing.
Comment 22 Carlos Avila 2008-02-26 16:47:51 EST
OK, x86_64 is behaving quite differently from the other test kernels. As soon as
'autoneg off' is passed to ethtool the NIC goes offline. But, if you configure
the interface using ethtool before bringing it up, the configuration will stick
and the NIC stays online. i.e.:

#/etc/init.d/network stop
#ethtool -s eth0 speed 100 duplex full autoneg off
#/etc/init.d/network start

This will work, assuming that ifcfg-* is not passing any ethtool options of course. 

Let me know if you need anything.
Comment 23 Andy Gospodarek 2008-02-26 17:33:53 EST
(In reply to comment #20)
> kernel-2.6.18-83.el5.gtest.42.i686.rpm
> kernel-PAE-2.6.18-83.el5.gtest.42.i686.rpm
> 
> We have tested these kernels and so far they work for the most part. I found the
> bug can be triggered by following a specific sequence. In all other situations
> though, the NIC works as expected; even when autonegotation is turned off
> through ethtool. 
> 
> Here is the sequence:
> 
> -/etc/init.d/network start 
> -ethtool -s eth0 autoneg on
> -ethtool -s eth0 speed 100 duplex full autoneg off
> 
> I'll be testing kernel-2.6.18-83.el5.gtest.42.x86_64.rpm soon.
> 
> 


I do not see the same results when I test x86_64 running my test kernel.       
    It takes a few (maybe 5) seconds for the link to be re-established, but the
link does go up (and I'm posting to this bugzilla from that system with the link
speed currently at 100Mbps full-duplex).  I'm quite sure the delay exists
because the switch I'm using goes through a series of tests to detect carrier at
the various speeds it supports.  Does the device (switch) you have connected to
the forcedeth card with autoneg disabled play nicely with cards that have it
disabled?

Comment 24 Andy Gospodarek 2008-02-26 17:34:19 EST
# uname -a
Linux localhost.localdomain 2.6.18-83.el5.gtest.42 #1 SMP Tue Feb 26 09:14:24
EST 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost network-scripts]# ethtool eth0
Settings for eth0:
        Supported ports: [ MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: 100Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes
Comment 25 Carlos Avila 2008-02-27 12:20:50 EST
Andy,

You were correct. We had the 32bit/64bit builds connected to different switches
and it seems to be causing hiccups. 

I was able to test kernel-2.6.18-83.el5.gtest.42.x86_64.rpm and verify that it
works properly. 
Comment 26 Carlos Avila 2008-03-03 13:40:20 EST
Created attachment 296659 [details]
Eth0 configuration
Comment 27 Carlos Avila 2008-03-03 13:40:55 EST
Results for:

-kernel-2.6.18-83.el5.gtest.42.i686.rpm
-kernel-PAE-2.6.18-83.el5.gtest.42.i686.rpm

I'm still able to trigger the bug on these two kernels. It happens 1 out of
every 5 or 6 times you try the following sequence:

-Reboot the system
-ethtool -s eth0 autoneg on (wait until it gets a link)
-ethtool -s eth0 speed 100 duplex full autoneg off

To emulate the system reboot, from the NIC's point of view, the following seems
to work:

/etc/init.d/network stop; modprobe -r forcedeth; /etc/init.d/network start;
ethtool -s eth0 autoneg on; sleep 10; ethtool -s eth0 speed 100 duplex full
autoneg off;

`cat /proc/interrupts` reveals no increments on interrupts to the controller
when the interface is unable to link. It also seems that the driver is hard
locking the chipset. A reboot or power loss don't seem to help getting the NIC
back online. At this point, not even the leds on the socket work. I've been able
to resolve this by trying to PXE boot the system, which seems to re-enable the
interface. 

I've attached the configuration for eth0 that I'm using. Let me know if there is
anything else that you need. 
Comment 28 Andy Gospodarek 2008-03-03 14:37:53 EST
Thanks for the update.  I'm bisecting changes now and it appears something
changed during the 2.6.19 development process fixes this.  I'm hoping I can
narrow it down to a specific patch (or patches) that will make this work on our
2.6.18-based kernels.
Comment 30 Andy Gospodarek 2008-03-03 23:27:49 EST
I dropped the patch that I originally included to try and fix this bugzilla and
made a new patch that should fix the problem (it does for me on an x86_64
system).  As I suspected, updates to the code responsible for handling MSI
resolved this.  I'm not sure exactly how large the patch will be, but this is a
minimum amount that I want to start trying.

Please test kernels here:

http://people.redhat.com/agospoda/#rhel5

and let me know how they work for you.  Thanks!
Comment 31 RHEL Product and Program Management 2008-03-03 23:28:46 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 32 Carlos Avila 2008-03-05 13:55:50 EST
Results for:

-kernel-2.6.18-84.el5.gtest.44.i686.rpm
-kernel-2.6.18-84.el5.gtest.44.x86_64.rpm
-kernel-PAE-2.6.18-84.el5.gtest.44.i686.rpm

Were able to test and verify that the issue has been fixed in these kernels.

Let me know if there is anything I can do to help.
Comment 35 Peter Martuccelli 2008-04-01 09:17:57 EDT
The patch touches common interrupt code paths and will void QE testing to date
for RHEL 5.2.  We need to move this out to R5.3 and have the patch accepted
early in the 5.3 development cycle to allow for sufficient testing.
Comment 37 Russell Doty 2008-04-01 11:51:09 EDT
Can we get some more details on the impact of this bug?

1. What is affected - all MCP55? MCP55 rev A3? One specific model of SuperMicro
motherboard?

2. Does the lack of connectivity occur when the system is booted, or only after
ethtool is run? 
Comment 38 Andy Gospodarek 2008-04-01 15:18:50 EDT
This is a problem for anyone that chooses to set link speed and duplex manually
rather than relying on auto-negotiation.  I still don't completely understand
why people refuse to use auto-negotiation, but there are some.
Comment 39 Carlos Avila 2008-04-01 15:28:31 EDT
Currently we only have MCP55 rev A3 on our environment so I can't provide
information on other revisions. 

If the system is configured with ethtool options on the network scripts, like we
currently do, the system will boot with no link. 

Customers don't trust auto negotiation because is not reliable when used in
conjunction with lots of different models of Cisco switches and some Foundry's. 

We won't be able to offer REHL5 on the recently launched lineup until we can
offer a stable network connection. 
Comment 41 Andy Gospodarek 2008-04-01 15:37:57 EDT
Carlos, as a workaround you can put something like this;

'ifconfig eth0 down && ifconfig eth0 up'

in rc.local and combined with ETHTOOL_OPTS in ifcfg-eth0 the card will do what
you want, right?

Comment 42 Carlos Avila 2008-04-01 15:52:04 EDT
Unfortunately it will not work because any option passed to ETHTOOL_OPTS that is
not 'autoneg on' by itself will trigger the bug. 

Regarding ifconfig; in some cases, you have to remove the forcedeth module and
re-insert it to get the NIC to respond again. In some other cases the only
solution is to reboot the system.
Comment 43 Andy Gospodarek 2008-04-14 10:55:37 EDT
Created attachment 302347 [details]
rhel5-forcedeth-fix.patch

I think this patch should work as a forcedeth specific fix.  You should be able
to rebuild a kernel pretty easy to test this out.

Unfortunately you will probably end up with different irq numbers each time you
run ethtool to change speed/duplex/etc, but that shouldn't be a big deal.

If you need this in a test kernel I can probably provide it later this week,
but I have a few other fixes I need to add first.
Comment 44 Carlos Avila 2008-04-14 11:08:43 EDT
Hi Andy,

A test kernel would be a good idea since our environment doesn't permit the use
of custom kernels. 

Thank you
Comment 46 Andy Gospodarek 2008-04-28 22:33:32 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.
Comment 50 Kelsey Hightower 2008-07-10 14:59:53 EDT
Hi Andy,

This issue has been re-assigned to me. I will be testing the latest "test"
kernel that you have posted and providing feed back through bugzilla and we
continue in the resolution process.
Comment 51 Kelsey Hightower 2008-07-10 15:28:22 EDT
Created attachment 311508 [details]
sha1sum and ll -h output
Comment 52 Kelsey Hightower 2008-07-10 15:30:23 EDT
Comment on attachment 311508 [details]
sha1sum and ll -h output

It seems that the following kernels are zero bytes in size:

kernel-2.6.18-94.el5.gtest.49.i686.rpm
kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm
Comment 53 Kelsey Hightower 2008-07-16 14:14:25 EDT
Created attachment 311976 [details]
Output of dmesg when USB key plugged in.

I have completed testing the following kernel:
kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm 

Test Performed
==============
-Reboot the system
-ethtool -s eth0 autoneg on (wait until it gets a link)
-ethtool -s eth0 speed 100 duplex full autoneg off

To emulate the system reboot, from the NIC's point of view, the following seems

to work:
/etc/init.d/network stop; modprobe -r forcedeth; /etc/init.d/network start;
ethtool -s eth0 autoneg on; sleep 10; ethtool -s eth0 speed 100 duplex full
autoneg off;

Expected results:
Nivida nic is able to maintain network connectivity after the above test.



Simulated Reboot
================
I have run the above test 20 times. The nic was able to get a link and IP after
every emulated reboot.

`cat /proc/intterupts` reveals that eth0 is getting a new IRQ after every
emulated reboot. Network activity was tested using dns lookups (dig) and scp
file transfers.

 `cat /proc/intterupts` shows the counters for eth0 are incrementing.


CAT5 cable connect and disconnect
=================================
Part of the QA process tested unplugging and plugging in the CAT5 cable to
simulate a possible condition in the data center. After 12 test there were no
issues with normal network traffic or connectivity.


Simulated reboot with CAT5 disconnected
=======================================
After 12 test there were no issues with normal network traffic or connectivity.



Issues and concerns
===================
I ran into the following error when connecting a USB key to the server running
the test kernel (kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm):

Jul 16 03:10:16 ipdmz0000atl2 kernel: ehci_hcd 0000:00:02.1: Unlink after
no-IRQ? Controller is probably using the wrong IRQ.

Complete output attached as:
qa-errors-kernel-PAE.txt
Comment 54 Kelsey Hightower 2008-07-17 09:38:26 EDT
I have completed testing all 3 test kernels:
kernel-2.6.18-94.el5.gtest.49.i686.rpm
kernel-2.6.18-94.el5.gtest.49.x86_64.rpm
kernel-PAE-2.6.18-94.el5.gtest.49.i686.rpm

The results are the same as Comment #53

Looks like the patch works, except for the issue with trying to plug in a USB
storage device while running the test kernel.

Error:
Jul 16 03:10:16 ipdmz0000atl2 kernel: ehci_hcd 0000:00:02.1: Unlink after
no-IRQ? Controller is probably using the wrong IRQ.
Comment 55 Andy Gospodarek 2008-07-17 15:25:08 EDT
Thanks for the feedback -- I'm glad to know this is working.

It does not seem likely the the other message you saw:

Jul 16 03:10:16 ipdmz0000atl2 kernel: ehci_hcd 0000:00:02.1: Unlink after
no-IRQ? Controller is probably using the wrong IRQ.

is related to the code change I made and my initial searching leads me to think
that you would have seen this error on an older kernel as well.

Did you happen to notice if the USB storage device still worked correctly
despite this message?
Comment 56 Kelsey Hightower 2008-07-18 11:07:56 EDT
I was not able to access the usb storage device when this error displayed. The
only thing listed under dev was /dev/sdb. The normal /dev/sdb1 not there.

To be sure this was isolated to the test kernel, I reboot into the default
kernel under RHEL 5.1, and the drive mounted without issues.

If you would like me to run any tests please let me know and I will post the
results.
Comment 57 Andy Gospodarek 2008-07-18 14:15:59 EDT
Interesting.  I'm not sure this is related to my kernels, so if you want to try
one from here:

http://people.redhat.com/dzickus/el5/97.el5/

as well as the RHEL5.2 kernel those would be nice data points.

Just so you know, I'm also planning to update my test kernels with a new test
patch later today (mostly to focus on the MSI issue).  I'll post here when they
are ready.
Comment 58 Kelsey Hightower 2008-07-18 16:29:57 EDT
Thanks, I will be on the look out for the updated Kernels.
Comment 59 Andy Gospodarek 2008-07-20 22:35:52 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.
Comment 60 Kelsey Hightower 2008-07-24 15:55:46 EDT
I have tested the following test kernel and it seems that the issue with the USB storage key has been 
resolved. The Nvidia nic also passes our internal QA process.

-kernel-2.6.18-98.el5.gtest.50.i686.rpm

I will test the other 2 remaining kernels and post my results.
Comment 61 Andy Gospodarek 2008-07-25 09:13:40 EDT
Glad to hear this kernel is working and the usb storage issue is now resolved as
well.  I'm a bit curious if the removal of my hacks in the forcedeth driver
caused those to be fixed, but I would be quite surprised if that did it.  I
don't see anything in the changelog that jumps out either.
Comment 62 Kelsey Hightower 2008-07-28 16:09:57 EDT
I have tested the remaining kernels, and all is working. I am now able to use a
usb storage device without any of the previous issues.

My final question would be, will this fix be including in a upcoming kernel
update or RHEL 5.3?
Comment 63 Andy Gospodarek 2008-07-28 16:28:01 EDT
This fix is suitable for 5.3 (since it's a large change), not a 5.2 update.  We
will be making one more tweak to it and I'll let you know when those kernels are
available.
Comment 65 Don Zickus 2008-08-13 12:06:42 EDT
in kernel-2.6.18-104.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 66 Andy Gospodarek 2008-09-10 12:48:46 EDT
*** Bug 433667 has been marked as a duplicate of this bug. ***
Comment 74 Don Domingo 2008-11-11 23:49:09 EST
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
the forcedeth-msi driver has been updated to fix a bug that prevented proper link-up detection.
Comment 76 Chris Ward 2008-11-28 02:01:29 EST
~~ Attention ~~ We Need Testing Feedback Soon ~~

We're nearing the end of the Red Hat Enterprise Linux 5.3 Testing Phase and this bug has not yet been VERIFIED. This bug should be fixed in the latest RHEL53 Beta Snapshot. It is critical that we receive your feedback ASAP. Otherwise, this bug is at risk of being dropped from the release. 

If you encounter any new issues, CLONE this bug and describe the new issues you are facing. We are no longer excepting NEW bugs into the release, bar critical regressions and blocker issues.

If you have VERIFIED this fix, add CustomerVerified to the Bugzilla Keywords, along with a description of the test results.
Comment 84 errata-xmlrpc 2009-01-20 14:32:49 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.