Bug 474479

Summary: RHEL4.8 kernel crashed in net_rx_action() on IA64 machine in RHTS connectathon test
Product: Red Hat Enterprise Linux 4 Reporter: Vivek Goyal <vgoyal>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.8CC: agospoda, clalance, jplans, jtluka, qcai
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-18 19:10:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 480741    
Attachments:
Description Flags
patch to only allow owning cpu to manipulate poll_list entries none

Description Vivek Goyal 2008-12-03 23:18:49 UTC
Description of problem:

Observed one crash with RHEL4.8, 78.20 kernel on IA64 machine while running rhts connectathon test.

Logs of the crash are here.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5319550

Version-Release number of selected component (if applicable):

RHEL4 78.20

How reproducible:

Noticed only once.

Steps to Reproduce:
1. run rhts tests
2.
3.
  
Actual results:


Expected results:


Additional info:

Pasting backtrace.

Unable to handle kernel paging request at virtual address 0000000000100108
swapper[0]: Oops 11003706212352 [1]
Modules linked in: lp(U) nfs lockd nfs_acl md5 ipv6 parport_pc parport
netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat
loop button tg3 sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss
sym53c8xx scsi_transport_spi sd_mod scsi_mod

Pid: 0, CPU 5, comm:              swapper
psr : 0000121008022018 ifs : 8000000000000710 ip  : [<a0000001004ada11>]
Not tainted
ip is at net_rx_action+0x271/0x500
unat: 0000000000000000 pfs : 0000000000000710 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 80000000ff719aa5
ldrs: 0000000000000000 ccv : 0000000000000026 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001004ad9e0 b6  : a00000010026b080 b7  : a00000020030c900
f6  : 1003e0000000000d8475c f7  : 1003e20c49ba5e353f7cf
f8  : 1003e000000000699ad81 f9  : 1003e000000000ff00000
f10 : 1003e000000003b5f2d38 f11 : 1003e44b831eee7285baf
r1  : a0000001009e1080 r2  : e00000007e802178 r3  : 0000000000100100
r8  : 0000000000100108 r9  : a000000200353ec8 r10 : a000000200346a40
r11 : a00000020030c900 r12 : e0000001001efbb0 r13 : e0000001001e8000
r14 : e000001efe532684 r15 : e00000007e802180 r16 : 0000000000200200
r17 : e00000007e802030 r18 : e00000007e8023a8 r19 : 0000001008026018
r20 : a000000200353ec8 r21 : 0000000000000000 r22 : 0000000000000088
r23 : 0000000082400405 r24 : 00000000000000dd r25 : 00000000000000dd
r26 : e00000007c4a4012 r27 : 0000000000000348 r28 : 0000000000000348
r29 : e00000007c4a4010 r30 : 0000000000000026 r31 : e00000007e802030

Call Trace:
 [<a000000100016e40>] show_stack+0x80/0xa0
                                sp=e0000001001ef740 bsp=e0000001001e91f0
 [<a000000100017750>] show_regs+0x890/0x8c0
                                sp=e0000001001ef910 bsp=e0000001001e91a8
 [<a00000010003e9b0>] die+0x150/0x240
                                sp=e0000001001ef930 bsp=e0000001001e9168
 [<a000000100064920>] ia64_do_page_fault+0x8e0/0xbe0
                                sp=e0000001001ef930 bsp=e0000001001e9100
 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
                                sp=e0000001001ef9e0 bsp=e0000001001e9100
 [<a0000001004ada10>] net_rx_action+0x270/0x500
                                sp=e0000001001efbb0 bsp=e0000001001e9080
 [<a000000100086070>] __do_softirq+0x1f0/0x240
                                sp=e0000001001efbc0 bsp=e0000001001e8fe8
 [<a000000100086130>] do_softirq+0x70/0xc0
                                sp=e0000001001efbc0 bsp=e0000001001e8f88
 [<a000000100015e70>] ia64_handle_irq+0x1b0/0x1e0
                                sp=e0000001001efbc0 bsp=e0000001001e8f40
 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
                                sp=e0000001001efbc0 bsp=e0000001001e8f40
 [<a000000100016360>] ia64_pal_call_static+0xa0/0xc0
                                sp=e0000001001efd90 bsp=e0000001001e8ef0
 [<a0000001000179e0>] default_idle+0x140/0x1e0
                                sp=e0000001001efd90 bsp=e0000001001e8ea0
 [<a000000100017ba0>] cpu_idle+0x120/0x2c0
                                sp=e0000001001efe30 bsp=e0000001001e8e58
 [<a00000010005d550>] start_secondary+0x2b0/0x2e0
                                sp=e0000001001efe30 bsp=e0000001001e8e20
 [<a000000100008180>] __end_ivt_text+0x260/0x290
                                sp=e0000001001efe30 bsp=e0000001001e8e20

Comment 1 Neil Horman 2008-12-04 19:49:54 UTC
Created attachment 325735 [details]
patch to only allow owning cpu to manipulate poll_list entries

Vivek, gospo and I discussed this, and while we still need to hash out some of the specifics (we're not really happy about adding a new state bit), generally this is an approach to solve the problem.  Would you mind trying this out on your test system please?  Thanks!

Comment 2 Vivek Goyal 2008-12-08 13:46:20 UTC
(In reply to comment #1)
> Created an attachment (id=325735) [details]
> patch to only allow owning cpu to manipulate poll_list entries
> 
> Vivek, gospo and I discussed this, and while we still need to hash out some of
> the specifics (we're not really happy about adding a new state bit), generally
> this is an approach to solve the problem.  Would you mind trying this out on
> your test system please?  Thanks!

Sure Neil, I will reserve the system again and test it. This time I noticed the issue on an x86 system.

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/filesystems/nfs/connectathon&result=Fail&rwhiteboard=kernel%202.6.9-78.21.EL.vgoyal.test3&arch=i386&jobids=38628

I never noticed these issues before, so probably probability of race condition actually happening has increased in our test setup.

Comment 3 Neil Horman 2008-12-08 15:58:08 UTC
It would seem so, yes.  Let me know what the test results are.  Thanks!

Comment 4 Vivek Goyal 2008-12-09 14:43:14 UTC
Noticed it one more time.

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/filesystems/nfs/connectathon&result=Fail&rwhiteboard=kernel%202.6.9-78.21.EL%20smp&arch=x86_64&jobids=38805

Neil, I have done a scratch build with your patch which is currently running through RHTS. Somehow RHTS is very slow. So far things seem to be fine.

Comment 5 Vivek Goyal 2008-12-09 14:44:08 UTC
As I am seeing this issue, bumping up the priority to high.

Comment 6 Neil Horman 2008-12-09 15:54:57 UTC
Ok, so I assume that from comment #4 you mean to say that you saw it prior to the patch, and now with the patch, it seems to be runnning well?  If so, I'll post this shortly.  Let me know if the bug re-occurs

Comment 7 Neil Horman 2008-12-09 21:07:10 UTC
I've sent a copy of this patch upstream, since it appears the problem exists there as well.

Comment 8 Vivek Goyal 2008-12-10 13:37:05 UTC
Neil,

I ran an rhts job with your patch built in.  I have not noticed any new issues. As you know that I don't have a definite method of reproducing the issue. It appears randomly on some machine during RHTS run. With your patch it did not appear. 

It does mean at least one thing that with your patch I did not observe any undesired behavior in rhts.

May be it is a good idea to post this patch and include in rhel4 and see how does it do.

Comment 9 RHEL Program Management 2008-12-10 13:59:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Chris Lalancette 2008-12-16 15:26:56 UTC
Just so we have a better record than the email on the internal list:

I can still hit this, even with the patch applied.  I can also do it pretty reliably; I start up a RHEL-4 FV guest on a RHEL-5 Xen dom0, and then start a large network transfer from the host to the guest, and the guest will eventually OOPS (usually pretty quickly).  I've also attempted this follow-on patch from Neil:

diff -up linux-2.6.9/drivers/net/8139cp.c.clalance linux-2.6.9/drivers/net/8139cp.c
--- linux-2.6.9/drivers/net/8139cp.c.clalance	2008-12-15 13:45:03.000000000 -0500
+++ linux-2.6.9/drivers/net/8139cp.c	2008-12-15 13:48:31.000000000 -0500
@@ -619,9 +619,9 @@ rx_next:
 		if (cpr16(IntrStatus) & cp_rx_intr_mask)
 			goto rx_status_loop;
 
+		netif_rx_complete(dev);
 		local_irq_save(flags);
 		cpw16_f(IntrMask, cp_intr_mask);
-		__netif_rx_complete(dev);
 		local_irq_restore(flags);
 
 		return 0;	/* done */


But the problem persists with that in place.

Chris Lalancette

Comment 11 Vivek Goyal 2008-12-17 16:08:47 UTC
Committed in 78.22.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 12 Jan Tluka 2009-01-08 10:10:33 UTC
RHEL4.8 QA ACK. Reproducer available in comment 10.

Comment 20 errata-xmlrpc 2009-05-18 19:10:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html