Description of problem: A customer is experiencing intermittent panics in serial.c on heavily loaded systems: Oops: 0000 ppp_async smbfs loop nfs lockd sunrpc dgrp ppp_generic slhc st aic79xx netconsole bcm5700 audit floppy sg microcode nls_iso8859-1 jfs keybdev mousedev hid inp CPU: 3 EIP: 0060:[<021b209f>] Not tainted EFLAGS: 00010296 EIP is at tty_wakeup [kernel] 0xf (2.4.21-23.EL.3.ttyhugemem/i686) eax: 00000000 ebx: 00000000 ecx: c3088980 edx: 021c7d90 esi: a34d7dfc edi: 00000180 ebp: 00000001 esp: a34d7de8 ds: 0068 es: 0068 ss: 0068 Process lsof (pid: 31373, stackpage=a34d7000) 000002a2 a34d7dfc a34d7dfc 021304aa 00000000 c3088a00 c3088a00 00000001 00000000 021c7d8d 023a928c 021303c4 0247b4bc 02130262 00000003 0244e400 00000009 00000003 0000000a 0212fff5 0244e400 00000246 a34d7e40 627c6400 Call Trace: [<021304aa>] __run_task_queue [kernel] 0x6a (0xa34d7df4) [<021c7d8d>] do_serial_bh [kernel] 0x1d (0xa34d7e0c) [<021303c4>] bh_action [kernel] 0x54 (0xa34d7e14) [<02130262>] tasklet_hi_action [kernel] 0x62 (0xa34d7e1c) [<0212fff5>] do_softirq [kernel] 0x105 (0xa34d7e34) [<02269417>] .text.lock.tcp_ipv4 [kernel] 0x1dd (0xa34d7e54) [<02195c46>] proc_file_read [kernel] 0x1a6 (0xa34d7f54) [<02164eb3>] sys_read [kernel] 0xa3 (0xa34d7f94) Code: Bad EIP value. CPU#0 is frozen. CPU#1 is frozen. CPU#2 is frozen. CPU#3 is executing netdump. < netdump activated - performing handshake with the client. > The trace has a "do_serial_bh" call in it, which only the built-in comport driver (serial.c) calls. From a quick browse of serial.c in "drivers/char/serial.c": /* * This routine is used to handle the "bottom half" processing for the * serial driver, known also the "software interrupt" processing. * This processing is done at the kernel interrupt level, after the * rs_interrupt() has returned, BUT WITH INTERRUPTS TURNED ON. This * is where time-consuming activities which can not be done in the * interrupt driver proper are done; the interrupt driver schedules * them using rs_sched_event(), and they get done here. */ static void do_serial_bh(void) { run_task_queue(&tq_serial); } From the fact that "tty_wakeup" is the culprit in the stack trace, its very likely that the serial.c driver queued up a tty_wakeup() task in tq_serial by calling rs_sched_event(). Its that tty_wakeup call thats deferencing a null tty pointer, which results in the crash. Attached is a patch that implements a check to verify that the tty struct is valid at the beginning of the tty_wakeup function.
Created attachment 121027 [details] Patch to prevent tty NULL in tty_wakeup
This problem was fixed in RHEL3 U5. Please upgrade to U6 (2.4.21-37.EL). *** This bug has been marked as a duplicate of 131674 ***
Please look again at U6. This is a new patch to fix a related issue to 131674, but this is not a dupe. This patch was generated off the U6 kernel tree. :)
Hi, Tom. The reason that I thought that this might be a dup is that the tty changes in U5 should prevent this problem from occurring. Before investing any time on this, I think we should have confirmation that this problem exists on U5 (or U6). Please verify this (and provide the oops output on a more recent kernel). Thanks in advance.
Hi, Tom. This is getting to be a difficult issue. Basically, the customer is running an unsupported kernel. We can't verify for certain that the tty fixes committed to U5 are exactly what they're running. (There were multiple versions of the very large and complex tty patch.) Further, at least one other tty change went into U5 that could be related (dealing with races in forking and controlling tty assignment). Lastly, I don't feel that the check in tty_wakeup() in comment #1 is appropriate, since if there's an open/close race in drivers/char/serial.c, the problem should be fixed in that driver. Thus, to make progress on resolving this issue, I think we need to have the problem reproduced on stock U6. Reassigning to Don and reverting to NEEDINFO (requesting a U6-based oops or reproducer).
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.