Bug 474479
Summary: | RHEL4.8 kernel crashed in net_rx_action() on IA64 machine in RHTS connectathon test | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Vivek Goyal <vgoyal> | ||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.8 | CC: | agospoda, clalance, jplans, jtluka, qcai | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-05-18 19:10:24 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 480741 | ||||||
Attachments: |
|
Description
Vivek Goyal
2008-12-03 23:18:49 UTC
Created attachment 325735 [details]
patch to only allow owning cpu to manipulate poll_list entries
Vivek, gospo and I discussed this, and while we still need to hash out some of the specifics (we're not really happy about adding a new state bit), generally this is an approach to solve the problem. Would you mind trying this out on your test system please? Thanks!
(In reply to comment #1) > Created an attachment (id=325735) [details] > patch to only allow owning cpu to manipulate poll_list entries > > Vivek, gospo and I discussed this, and while we still need to hash out some of > the specifics (we're not really happy about adding a new state bit), generally > this is an approach to solve the problem. Would you mind trying this out on > your test system please? Thanks! Sure Neil, I will reserve the system again and test it. This time I noticed the issue on an x86 system. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/filesystems/nfs/connectathon&result=Fail&rwhiteboard=kernel%202.6.9-78.21.EL.vgoyal.test3&arch=i386&jobids=38628 I never noticed these issues before, so probably probability of race condition actually happening has increased in our test setup. It would seem so, yes. Let me know what the test results are. Thanks! Noticed it one more time. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/filesystems/nfs/connectathon&result=Fail&rwhiteboard=kernel%202.6.9-78.21.EL%20smp&arch=x86_64&jobids=38805 Neil, I have done a scratch build with your patch which is currently running through RHTS. Somehow RHTS is very slow. So far things seem to be fine. As I am seeing this issue, bumping up the priority to high. Ok, so I assume that from comment #4 you mean to say that you saw it prior to the patch, and now with the patch, it seems to be runnning well? If so, I'll post this shortly. Let me know if the bug re-occurs I've sent a copy of this patch upstream, since it appears the problem exists there as well. Neil, I ran an rhts job with your patch built in. I have not noticed any new issues. As you know that I don't have a definite method of reproducing the issue. It appears randomly on some machine during RHTS run. With your patch it did not appear. It does mean at least one thing that with your patch I did not observe any undesired behavior in rhts. May be it is a good idea to post this patch and include in rhel4 and see how does it do. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Just so we have a better record than the email on the internal list: I can still hit this, even with the patch applied. I can also do it pretty reliably; I start up a RHEL-4 FV guest on a RHEL-5 Xen dom0, and then start a large network transfer from the host to the guest, and the guest will eventually OOPS (usually pretty quickly). I've also attempted this follow-on patch from Neil: diff -up linux-2.6.9/drivers/net/8139cp.c.clalance linux-2.6.9/drivers/net/8139cp.c --- linux-2.6.9/drivers/net/8139cp.c.clalance 2008-12-15 13:45:03.000000000 -0500 +++ linux-2.6.9/drivers/net/8139cp.c 2008-12-15 13:48:31.000000000 -0500 @@ -619,9 +619,9 @@ rx_next: if (cpr16(IntrStatus) & cp_rx_intr_mask) goto rx_status_loop; + netif_rx_complete(dev); local_irq_save(flags); cpw16_f(IntrMask, cp_intr_mask); - __netif_rx_complete(dev); local_irq_restore(flags); return 0; /* done */ But the problem persists with that in place. Chris Lalancette Committed in 78.22.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ RHEL4.8 QA ACK. Reproducer available in comment 10. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |