173146 – Panic in serial.c

Bug 173146 - Panic in serial.c

Summary: Panic in serial.c

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Don Howard
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-11-14 17:06 UTC by Tom "spot" Callaway
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 18:51:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to prevent tty NULL in tty_wakeup (321 bytes, patch) 2005-11-14 17:06 UTC, Tom "spot" Callaway	no flags	Details \| Diff
View All

Description Tom "spot" Callaway 2005-11-14 17:06:08 UTC

Description of problem:

A customer is experiencing intermittent panics in serial.c on heavily loaded
systems:

Oops: 0000
ppp_async smbfs loop nfs lockd sunrpc dgrp
ppp_generic slhc st aic79xx netconsole bcm5700 audit floppy sg microcode
nls_iso8859-1 jfs keybdev mousedev hid inp
CPU: 3
EIP: 0060:[<021b209f>] Not tainted
EFLAGS: 00010296

EIP is at tty_wakeup [kernel] 0xf (2.4.21-23.EL.3.ttyhugemem/i686)
eax: 00000000 ebx: 00000000 ecx: c3088980 edx: 021c7d90
esi: a34d7dfc edi: 00000180 ebp: 00000001 esp: a34d7de8
ds: 0068 es: 0068 ss: 0068
Process lsof (pid: 31373, stackpage=a34d7000)
000002a2 a34d7dfc a34d7dfc 021304aa 00000000 c3088a00 c3088a00 00000001
00000000 021c7d8d 023a928c 021303c4 0247b4bc 02130262 00000003 0244e400
00000009 00000003 0000000a 0212fff5 0244e400 00000246 a34d7e40 627c6400
Call Trace: [<021304aa>] __run_task_queue [kernel] 0x6a (0xa34d7df4)
            [<021c7d8d>] do_serial_bh [kernel] 0x1d (0xa34d7e0c)
            [<021303c4>] bh_action [kernel] 0x54 (0xa34d7e14)
            [<02130262>] tasklet_hi_action [kernel] 0x62 (0xa34d7e1c)
            [<0212fff5>] do_softirq [kernel] 0x105 (0xa34d7e34)
            [<02269417>] .text.lock.tcp_ipv4 [kernel] 0x1dd (0xa34d7e54)
            [<02195c46>] proc_file_read [kernel] 0x1a6 (0xa34d7f54)
            [<02164eb3>] sys_read [kernel] 0xa3 (0xa34d7f94)

Code: Bad EIP value.

CPU#0 is frozen.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is executing netdump.
< netdump activated - performing handshake with the client. >

The trace has a "do_serial_bh" call in it, which only the
built-in comport driver (serial.c) calls. From a quick browse 
of serial.c in "drivers/char/serial.c":
                                                                         
/*                                                                       
 * This routine is used to handle the "bottom half" processing for the
 * serial driver, known also the "software interrupt" processing.
 * This processing is done at the kernel interrupt level, after the
 * rs_interrupt() has returned, BUT WITH INTERRUPTS TURNED ON. This 
 * is where time-consuming activities which can not be done in the
 * interrupt driver proper are done; the interrupt driver schedules
 * them using rs_sched_event(), and they get done here.
 */                                                                      
static void
do_serial_bh(void)                                           
{                                                                        

run_task_queue(&tq_serial);                                      
}                                                                        
                                                                         
From the fact that "tty_wakeup" is the culprit in the stack trace,
its very likely that the serial.c driver queued up a tty_wakeup()
task in tq_serial by calling rs_sched_event(). Its that tty_wakeup
call thats deferencing a null tty pointer, which results in the
crash.

Attached is a patch that implements a check to verify that the tty 
struct is valid at the beginning of the tty_wakeup function.

Comment 1 Tom "spot" Callaway 2005-11-14 17:06:08 UTC

Created attachment 121027 [details]
Patch to prevent tty NULL in tty_wakeup

Comment 2 Ernie Petrides 2005-11-15 01:06:59 UTC

This problem was fixed in RHEL3 U5.  Please upgrade to U6 (2.4.21-37.EL).


*** This bug has been marked as a duplicate of 131674 ***

Comment 3 Tom "spot" Callaway 2005-11-15 15:58:29 UTC

Please look again at U6. This is a new patch to fix a related issue to 131674,
but this is not a dupe. 

This patch was generated off the U6 kernel tree. :)

Comment 4 Ernie Petrides 2005-11-15 21:21:38 UTC

Hi, Tom.  The reason that I thought that this might be a dup is that the
tty changes in U5 should prevent this problem from occurring.  Before
investing any time on this, I think we should have confirmation that
this problem exists on U5 (or U6).  Please verify this (and provide
the oops output on a more recent kernel).

Thanks in advance.

Comment 6 Ernie Petrides 2005-11-15 22:03:28 UTC

Hi, Tom.  This is getting to be a difficult issue.  Basically, the customer
is running an unsupported kernel.  We can't verify for certain that the tty
fixes committed to U5 are exactly what they're running.  (There were multiple
versions of the very large and complex tty patch.)  Further, at least one
other tty change went into U5 that could be related (dealing with races in
forking and controlling tty assignment).  Lastly, I don't feel that the
check in tty_wakeup() in comment #1 is appropriate, since if there's an
open/close race in drivers/char/serial.c, the problem should be fixed in
that driver.

Thus, to make progress on resolving this issue, I think we need to have the
problem reproduced on stock U6.

Reassigning to Don and reverting to NEEDINFO (requesting a U6-based oops or
reproducer).

Comment 8 RHEL Program Management 2007-10-19 18:51:14 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.