Bug 433667
| Summary: | [RHEL5 U2] Kernel forcedeth driver message | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Burke <jburke> | ||||||
| Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Martin Jenner <mjenner> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 5.2 | CC: | anton, dhoward, dmair, dzickus, fleitner, james.brown, mgahagan, peterm, tao | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| URL: | http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=1964892 | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2008-09-10 16:48:46 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jeff Burke
2008-02-20 19:11:13 UTC
Created attachment 295441 [details]
Full log
Reverse Engineered nForce ethernet driver RHEL driver based on upstream driver version 0.60 Also includes additional upstream commits: 3ba4d093fe8a26f5f2da94411bf8732fa6e9da86 forcedeth: fix tx timeout fcc5f2665c81e087fb95143325ed769a41128d50 forcedeth: fix nic poll 6fedae1f6e66ab5f169bf58064e23e015fc1307d forcedeth: fix checksum feature in mcp65 caf96469e8ab57170cc8ca9c59809132d38e529e forcedeth: disable msix e0379a14fc80cb98978fa86989dab77b522a8106 forcedeth: fixed missing call in napi poll a7475906bc496456ded9e4b062f94067fb93057a forcedeth: msi bugfix Can I have access to the machine? Of the patches included in this forcedeth update, these one is the one that was designed to fix this problem upstream: a7475906bc496456ded9e4b062f94067fb93057a forcedeth: msi bugfix What's interesting is that on rhel5 it doesn't have the desired effect -- that interrupts are correctly disabled when we hope they are. I've been looking at another interesting forcedeth problem that seems to be related to this, so I'd like to see if this can be tested with pci=nomsi on the kernel command line. I'm guessing it's enabled right now. As I suspected, the patch that was added in 2.6.18-50.1.3 for the forcedeth msi bugfix seems to be giving us problems here. NFS connectathon test results: 2.6.18-53.1.2 -- pass 2.6.18-53.1.3 -- fail 2.6.18-53.1.3, with pci=nomsi on kernel cmd line -- pass I'm hoping I can do some work on the forcedeth driver to resolve this since I'm a bit worried that trying to pull all the MSI fixes from the latest upstream will be too much. I would strongly encourage us to NOT revert this patch from rhel5. I would rather see us apply a patch on top to resolve this issue. Without this there will be problems since we are not really enabling and disabling the correct interrupts. I would rather correct the issue that paper-over a new problem by removing the needed patch. I've started to notice what I feel are problems with enable_irq and disable_irq calls in the forcedeth driver. I recently patched the ethtool_set_settings function because I determined that writing to the BMCR register while interrupts were disabled resulted in no interrupts ever coming back out of the hardware. My guess is that changes to interrupt handling upstream have made issues like dropping pending interrupts (or saving them so they can be posted later) may be somewhat related, but this is just a hunch based on what I've observed. After a small patch to the MSI subsystem I can now run the NFS connectathon tests on the same system used in the original test (hp-xw9400-01.rhts.boston.redhat.com) and it appears that not tests failed. There isn't any great output indicating that, but I see nothing but 'PASS' messages on the screen and none of the original messages: eth0: too many iterations (6) in nv_nic_irq. There were a few messages in test output like this: Mar 7 09:23:29 hp-xw9400-01 kernel: nfs: server sol9-nfs not responding, still trying Mar 7 09:23:29 hp-xw9400-01 kernel: nfs: server sol9-nfs OK Mar 7 09:23:34 hp-xw9400-01 kernel: nfs: server sol9-nfs not responding, still trying Mar 7 09:23:34 hp-xw9400-01 kernel: nfs: server sol9-nfs not responding, still trying Mar 7 09:23:34 hp-xw9400-01 kernel: nfs: server sol9-nfs OK Mar 7 09:23:34 hp-xw9400-01 kernel: nfs: server sol9-nfs OK Mar 7 09:23:39 hp-xw9400-01 kernel: nfs: server sol9-nfs not responding, still trying but I don't know if that was caused by the test or not. I also looked at /mnt/tests/kernel/filesystems/nfs/connectathon/cthon04/result.txt and it appears to be zero length -- hopefully that's good. Oh yeah, test kernels available here: http://people.redhat.com/agospoda/ This patch (or something similar) is what I would like to consider for rhel5.2 (if possible). http://people.redhat.com/agospoda/rhel5/irq-msi-upstream-fixes.patch The problem I currently see is that this is only a few (5-6) of the patches needed to make this work whereas there are close to a dozen patches in the original upstream set. I can look over the changes, but am not an expert on this, so I will probably need to get someone to at least keep me in check (whether they are an expert or not). This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 312109 [details]
RHEL5.2 Forcedeth Failure
We have experienced a similar NIC crash with RHEL 5.2 running on the Nvidia Chipset 00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) 00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) The full log details have been attached to this bugreport. Has any progress been made on further testing and integrating the suggested fixes in RHEL5.2 ? Is it known whether RHEL5.0 was prone to the same bug ? RHEL5 should not be problematic, but later kernels will have problems. What is unfortunate is that a small set of users had problems that appeared in 5.2 from a patch that fixed problems that all users would have on 5.1. The root of the 5.2 issues is some MSI problems in 2.6.18 that were fixed in 2.6.19 and later. Those patches will soon be added to my test kernels and will appear in the kernel version: 2.6.18-94.el5.gtest.50 that will appear here: http://people.redhat.com/agospoda/#rhel5 later today. *** This bug has been marked as a duplicate of bug 428696 *** |