Description of problem: Observed one crash with RHEL4.8, 78.20 kernel on IA64 machine while running rhts connectathon test. Logs of the crash are here. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5319550 Version-Release number of selected component (if applicable): RHEL4 78.20 How reproducible: Noticed only once. Steps to Reproduce: 1. run rhts tests 2. 3. Actual results: Expected results: Additional info: Pasting backtrace. Unable to handle kernel paging request at virtual address 0000000000100108 swapper[0]: Oops 11003706212352 [1] Modules linked in: lp(U) nfs lockd nfs_acl md5 ipv6 parport_pc parport netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat loop button tg3 sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sym53c8xx scsi_transport_spi sd_mod scsi_mod Pid: 0, CPU 5, comm: swapper psr : 0000121008022018 ifs : 8000000000000710 ip : [<a0000001004ada11>] Not tainted ip is at net_rx_action+0x271/0x500 unat: 0000000000000000 pfs : 0000000000000710 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 80000000ff719aa5 ldrs: 0000000000000000 ccv : 0000000000000026 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001004ad9e0 b6 : a00000010026b080 b7 : a00000020030c900 f6 : 1003e0000000000d8475c f7 : 1003e20c49ba5e353f7cf f8 : 1003e000000000699ad81 f9 : 1003e000000000ff00000 f10 : 1003e000000003b5f2d38 f11 : 1003e44b831eee7285baf r1 : a0000001009e1080 r2 : e00000007e802178 r3 : 0000000000100100 r8 : 0000000000100108 r9 : a000000200353ec8 r10 : a000000200346a40 r11 : a00000020030c900 r12 : e0000001001efbb0 r13 : e0000001001e8000 r14 : e000001efe532684 r15 : e00000007e802180 r16 : 0000000000200200 r17 : e00000007e802030 r18 : e00000007e8023a8 r19 : 0000001008026018 r20 : a000000200353ec8 r21 : 0000000000000000 r22 : 0000000000000088 r23 : 0000000082400405 r24 : 00000000000000dd r25 : 00000000000000dd r26 : e00000007c4a4012 r27 : 0000000000000348 r28 : 0000000000000348 r29 : e00000007c4a4010 r30 : 0000000000000026 r31 : e00000007e802030 Call Trace: [<a000000100016e40>] show_stack+0x80/0xa0 sp=e0000001001ef740 bsp=e0000001001e91f0 [<a000000100017750>] show_regs+0x890/0x8c0 sp=e0000001001ef910 bsp=e0000001001e91a8 [<a00000010003e9b0>] die+0x150/0x240 sp=e0000001001ef930 bsp=e0000001001e9168 [<a000000100064920>] ia64_do_page_fault+0x8e0/0xbe0 sp=e0000001001ef930 bsp=e0000001001e9100 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e0000001001ef9e0 bsp=e0000001001e9100 [<a0000001004ada10>] net_rx_action+0x270/0x500 sp=e0000001001efbb0 bsp=e0000001001e9080 [<a000000100086070>] __do_softirq+0x1f0/0x240 sp=e0000001001efbc0 bsp=e0000001001e8fe8 [<a000000100086130>] do_softirq+0x70/0xc0 sp=e0000001001efbc0 bsp=e0000001001e8f88 [<a000000100015e70>] ia64_handle_irq+0x1b0/0x1e0 sp=e0000001001efbc0 bsp=e0000001001e8f40 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e0000001001efbc0 bsp=e0000001001e8f40 [<a000000100016360>] ia64_pal_call_static+0xa0/0xc0 sp=e0000001001efd90 bsp=e0000001001e8ef0 [<a0000001000179e0>] default_idle+0x140/0x1e0 sp=e0000001001efd90 bsp=e0000001001e8ea0 [<a000000100017ba0>] cpu_idle+0x120/0x2c0 sp=e0000001001efe30 bsp=e0000001001e8e58 [<a00000010005d550>] start_secondary+0x2b0/0x2e0 sp=e0000001001efe30 bsp=e0000001001e8e20 [<a000000100008180>] __end_ivt_text+0x260/0x290 sp=e0000001001efe30 bsp=e0000001001e8e20
Created attachment 325735 [details] patch to only allow owning cpu to manipulate poll_list entries Vivek, gospo and I discussed this, and while we still need to hash out some of the specifics (we're not really happy about adding a new state bit), generally this is an approach to solve the problem. Would you mind trying this out on your test system please? Thanks!
(In reply to comment #1) > Created an attachment (id=325735) [details] > patch to only allow owning cpu to manipulate poll_list entries > > Vivek, gospo and I discussed this, and while we still need to hash out some of > the specifics (we're not really happy about adding a new state bit), generally > this is an approach to solve the problem. Would you mind trying this out on > your test system please? Thanks! Sure Neil, I will reserve the system again and test it. This time I noticed the issue on an x86 system. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/filesystems/nfs/connectathon&result=Fail&rwhiteboard=kernel%202.6.9-78.21.EL.vgoyal.test3&arch=i386&jobids=38628 I never noticed these issues before, so probably probability of race condition actually happening has increased in our test setup.
It would seem so, yes. Let me know what the test results are. Thanks!
Noticed it one more time. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/filesystems/nfs/connectathon&result=Fail&rwhiteboard=kernel%202.6.9-78.21.EL%20smp&arch=x86_64&jobids=38805 Neil, I have done a scratch build with your patch which is currently running through RHTS. Somehow RHTS is very slow. So far things seem to be fine.
As I am seeing this issue, bumping up the priority to high.
Ok, so I assume that from comment #4 you mean to say that you saw it prior to the patch, and now with the patch, it seems to be runnning well? If so, I'll post this shortly. Let me know if the bug re-occurs
I've sent a copy of this patch upstream, since it appears the problem exists there as well.
Neil, I ran an rhts job with your patch built in. I have not noticed any new issues. As you know that I don't have a definite method of reproducing the issue. It appears randomly on some machine during RHTS run. With your patch it did not appear. It does mean at least one thing that with your patch I did not observe any undesired behavior in rhts. May be it is a good idea to post this patch and include in rhel4 and see how does it do.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Just so we have a better record than the email on the internal list: I can still hit this, even with the patch applied. I can also do it pretty reliably; I start up a RHEL-4 FV guest on a RHEL-5 Xen dom0, and then start a large network transfer from the host to the guest, and the guest will eventually OOPS (usually pretty quickly). I've also attempted this follow-on patch from Neil: diff -up linux-2.6.9/drivers/net/8139cp.c.clalance linux-2.6.9/drivers/net/8139cp.c --- linux-2.6.9/drivers/net/8139cp.c.clalance 2008-12-15 13:45:03.000000000 -0500 +++ linux-2.6.9/drivers/net/8139cp.c 2008-12-15 13:48:31.000000000 -0500 @@ -619,9 +619,9 @@ rx_next: if (cpr16(IntrStatus) & cp_rx_intr_mask) goto rx_status_loop; + netif_rx_complete(dev); local_irq_save(flags); cpw16_f(IntrMask, cp_intr_mask); - __netif_rx_complete(dev); local_irq_restore(flags); return 0; /* done */ But the problem persists with that in place. Chris Lalancette
Committed in 78.22.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
RHEL4.8 QA ACK. Reproducer available in comment 10.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html