Red Hat Bugzilla – Bug 474479
RHEL4.8 kernel crashed in net_rx_action() on IA64 machine in RHTS connectathon test
Last modified: 2012-03-25 22:52:00 EDT
Description of problem:
Observed one crash with RHEL4.8, 78.20 kernel on IA64 machine while running rhts connectathon test.
Logs of the crash are here.
Version-Release number of selected component (if applicable):
Noticed only once.
Steps to Reproduce:
1. run rhts tests
Unable to handle kernel paging request at virtual address 0000000000100108
swapper: Oops 11003706212352 
Modules linked in: lp(U) nfs lockd nfs_acl md5 ipv6 parport_pc parport
netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat
loop button tg3 sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss
sym53c8xx scsi_transport_spi sd_mod scsi_mod
Pid: 0, CPU 5, comm: swapper
psr : 0000121008022018 ifs : 8000000000000710 ip : [<a0000001004ada11>]
ip is at net_rx_action+0x271/0x500
unat: 0000000000000000 pfs : 0000000000000710 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 80000000ff719aa5
ldrs: 0000000000000000 ccv : 0000000000000026 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001004ad9e0 b6 : a00000010026b080 b7 : a00000020030c900
f6 : 1003e0000000000d8475c f7 : 1003e20c49ba5e353f7cf
f8 : 1003e000000000699ad81 f9 : 1003e000000000ff00000
f10 : 1003e000000003b5f2d38 f11 : 1003e44b831eee7285baf
r1 : a0000001009e1080 r2 : e00000007e802178 r3 : 0000000000100100
r8 : 0000000000100108 r9 : a000000200353ec8 r10 : a000000200346a40
r11 : a00000020030c900 r12 : e0000001001efbb0 r13 : e0000001001e8000
r14 : e000001efe532684 r15 : e00000007e802180 r16 : 0000000000200200
r17 : e00000007e802030 r18 : e00000007e8023a8 r19 : 0000001008026018
r20 : a000000200353ec8 r21 : 0000000000000000 r22 : 0000000000000088
r23 : 0000000082400405 r24 : 00000000000000dd r25 : 00000000000000dd
r26 : e00000007c4a4012 r27 : 0000000000000348 r28 : 0000000000000348
r29 : e00000007c4a4010 r30 : 0000000000000026 r31 : e00000007e802030
Created attachment 325735 [details]
patch to only allow owning cpu to manipulate poll_list entries
Vivek, gospo and I discussed this, and while we still need to hash out some of the specifics (we're not really happy about adding a new state bit), generally this is an approach to solve the problem. Would you mind trying this out on your test system please? Thanks!
(In reply to comment #1)
> Created an attachment (id=325735) [details]
> patch to only allow owning cpu to manipulate poll_list entries
> Vivek, gospo and I discussed this, and while we still need to hash out some of
> the specifics (we're not really happy about adding a new state bit), generally
> this is an approach to solve the problem. Would you mind trying this out on
> your test system please? Thanks!
Sure Neil, I will reserve the system again and test it. This time I noticed the issue on an x86 system.
I never noticed these issues before, so probably probability of race condition actually happening has increased in our test setup.
It would seem so, yes. Let me know what the test results are. Thanks!
Noticed it one more time.
Neil, I have done a scratch build with your patch which is currently running through RHTS. Somehow RHTS is very slow. So far things seem to be fine.
As I am seeing this issue, bumping up the priority to high.
Ok, so I assume that from comment #4 you mean to say that you saw it prior to the patch, and now with the patch, it seems to be runnning well? If so, I'll post this shortly. Let me know if the bug re-occurs
I've sent a copy of this patch upstream, since it appears the problem exists there as well.
I ran an rhts job with your patch built in. I have not noticed any new issues. As you know that I don't have a definite method of reproducing the issue. It appears randomly on some machine during RHTS run. With your patch it did not appear.
It does mean at least one thing that with your patch I did not observe any undesired behavior in rhts.
May be it is a good idea to post this patch and include in rhel4 and see how does it do.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Just so we have a better record than the email on the internal list:
I can still hit this, even with the patch applied. I can also do it pretty reliably; I start up a RHEL-4 FV guest on a RHEL-5 Xen dom0, and then start a large network transfer from the host to the guest, and the guest will eventually OOPS (usually pretty quickly). I've also attempted this follow-on patch from Neil:
diff -up linux-2.6.9/drivers/net/8139cp.c.clalance linux-2.6.9/drivers/net/8139cp.c
--- linux-2.6.9/drivers/net/8139cp.c.clalance 2008-12-15 13:45:03.000000000 -0500
+++ linux-2.6.9/drivers/net/8139cp.c 2008-12-15 13:48:31.000000000 -0500
@@ -619,9 +619,9 @@ rx_next:
if (cpr16(IntrStatus) & cp_rx_intr_mask)
return 0; /* done */
But the problem persists with that in place.
Committed in 78.22.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
RHEL4.8 QA ACK. Reproducer available in comment 10.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.