Bug 144059 (IT_52345_64378)

Summary:

CAN-2005-0403 panic in tty init_dev

Product:

Red Hat Enterprise Linux 3

Reporter:

Wendy Cheng <nobody+wcheng>

Component:

kernel

Assignee:

Jason Baron <jbaron>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

alan, andrewj, aschultz, bnocera, ckloiber, dhoward, greg.marsden, hfuchi, juanino, kmori, knoel, mb, mid-rangesupport, mjc, mmesser, mwesley, peterm, peter, petrides, raimondi, riel, tao, tburke, vanhoof, vkanakas

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

impact=important,public=20050308

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-04-22 20:17:33 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ttty patch	none
tty_open_untainted1.txt	none
tty_open_untainted2.txt	none
lsmod output on our most OOPS-y machine (xi)...	none
tty debugging patch	none
kysmopps output	none
EIP from panic	none

Description Wendy Cheng 2005-01-04 08:21:06 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030922

Description of problem:
EIP is at init_dev [kernel] 0x55 (2.4.21-20-ELsmp/i686)
eax: 00000000 ebx: 00000500 ecx: 00000000 edx: 00000000
esi: 00000000 edi: e898fb00 ebp: c04d7400 esp: c5e8be80
ds: 0068 es: 0068 ss: 0068
Stack: 00000000 00000000 dbf0edc0 00000000 00000000 00000000 e2930280
df997010
      00000000 00595820 f280ee48 3b7e7680 00030002 c5e8a000 00000000
e898fb00
      f0cbc880 c01b1436 00000500 c5e8bee4 f7fdda00 e78b3008 00000000
c017e890
Call Trace: [<c01b1436>] tty_open [kernel] 0x66 (0xc5e8bec4)
[<c017c890>] dput [kernel] 0x30 (0xc5e8bedc)
[<c0172266>] link_path_walk [kernel] 0x656 (0xc5e8bef0)
[<c0163f88>] get_chrfops [kernel] 0x98 (0xc5e8bf00)
[<c013a513>] in_group_p [pkernel] 0x23 (0xc5e8bf08)
[<f88947ca>] ext3_permission [ext3] 0xaa (0xc5e8bf10)
[<c0164241>] chrdev_open [kernel] 0x71 (0xc5e8bf38)
[<c0162490>] dentry_open [kernel] 0x110 (0xc5e8bf54)
[<c0162378>] filp_open [kernel] 0x68 (0xc5e8bf70)
[<c0162783>] sys_open [kernel] 0x53 (0xc5e8bfa8)
Code: 8b 04 88 89 44 24 30 85 c0 0f 84 9c 00 00 00 8b 54 24 30 8b
Kernel panic: Fatal exception

Version-Release number of selected component (if applicable):
kernel-2.4.21-20.ELsmp

How reproducible:
Couldn't Reproduce


Additional info:

Break out from bugzilla 131674.

Comment 1 Wendy Cheng 2005-01-04 08:41:20 UTC

Problem is getting serious from support front end - 4 more dumps with
identical stack trace. Jason Baron's latest patch seems to be on
target.  Plan to ship it out for the customers to try out.

Comment 2 Jeff Needle 2005-01-04 14:21:53 UTC

Please attach "Jason Baron's latest patch".  It's not obvious from reading bug
131674 which patch you are talking about.

Comment 5 Peter Levart 2005-01-04 15:34:26 UTC

I have made some printk statements at the point where this NULL
pointer dereference happens and got the following printout:

Jan  3 13:59:32 xk kernel: init_dev: device=(major:5, minor:0):
driver->table is NULL (this would OOPS)
Jan  3 13:59:32 xk kernel:   current->pid = 28162
Jan  3 13:59:32 xk kernel:   current->pgrp = 18059
Jan  3 13:59:32 xk kernel:   current->tty_old_pgrp = 0
Jan  3 13:59:32 xk kernel:   current->session = 16467
Jan  3 13:59:32 xk kernel:   current->tgid = 28162
Jan  3 13:59:32 xk kernel:   current->leader = 0
Jan  3 13:59:32 xk kernel:     ...->tty->magic = 538976288
Jan  3 13:59:32 xk kernel:     ...->tty->pgrp = 1734702177
Jan  3 13:59:32 xk kernel:     ...->tty->session = 108622447
Jan  3 13:59:32 xk kernel:     ...->tty->device = (major:5, minor:0)
Jan  3 13:59:32 xk kernel:   current->parent->pid = 28159
Jan  3 13:59:32 xk kernel:   current->parent->pgrp = 18059
Jan  3 13:59:32 xk kernel:   current->parent->tty_old_pgrp = 0
Jan  3 13:59:32 xk kernel:   current->parent->session = 16467
Jan  3 13:59:32 xk kernel:   current->parent->tgid = 28159
Jan  3 13:59:32 xk kernel:   current->parent->leader = 0
Jan  3 13:59:32 xk kernel:     ...->tty->magic = 538976288
Jan  3 13:59:32 xk kernel:     ...->tty->pgrp = 1734702177
Jan  3 13:59:32 xk kernel:     ...->tty->session = 108622447
Jan  3 13:59:32 xk kernel:     ...->tty->device = (major:5, minor:0)


Here current->tty points to some thing that does not apear to be a tty
structure at all (see the strange tty->magic, tty->pgrp, tty->session
values). The fact is (I have several occurences of this event) that
the parent's tty is allways the same as the child's when this event
happens and no parent process (of the process that triggered this
event) has triggered this event. Suggesting that current->tty pointer
(or structure) gets garbled already in the running parent and then it
is inherited by child which in turn calls init_dev and triggers this OOPS.

I am suspecting that the following is a race condition. In init_dev
when re-opening an existing tty:

        /* check whether we're reopening an existing tty */
        tty = driver->table[idx];
        if (tty) goto fast_track;

...
...

fast_track:
        if (test_bit(TTY_CLOSING, &tty->flags)) {
                retval = -EIO;
                goto end_init;
        }
        if (driver->type == TTY_DRIVER_TYPE_PTY &&
            driver->subtype == PTY_TYPE_MASTER) {
                /*
                 * special case for PTY masters: only one open permitted,
                 * and the slave side open count is incremented as well.
                 */
                if (atomic_read(&tty->count)) {
                        retval = -EIO;
                        goto end_init;
                }
                atomic_inc(&tty->link->count);
        }
        atomic_inc(&tty->count);
        tty->driver = *driver; /* N.B. why do this every time?? */

success:


... there's a test for the TTY_CLOSING bit in the flags and *after*
this test there's a simple incrementing of count(s) and the init_dev
succeeds. What if between testing for TTY_CLOSING bit and incrementing
of count(s) some other task is executing the release_dev and finds the
count droping to 0 and thus sets TTY_CLOSING bit (to late already) and
releases the structure...

This could explain the garbled tty structure (already used by
something else).

Since init_dev is already protecting it's main part by
down_tty_sem/up_tty_sem calls, the same mutex could be used in
release_dev:

***************
*** 1076,1082 ****
  {
        struct tty_struct *tty, *o_tty;
        int     pty_master, tty_closing, o_tty_closing, do_sleep;
!       int     idx;
        char    buf[64];

        tty = (struct tty_struct *)filp->private_data;
--- 1118,1124 ----
  {
        struct tty_struct *tty, *o_tty;
        int     pty_master, tty_closing, o_tty_closing, do_sleep;
!       int     idx, o_idx;
        char    buf[64];

        tty = (struct tty_struct *)filp->private_data;
***************
*** 1091,1096 ****
--- 1133,1139 ----
        pty_master = (tty->driver.type == TTY_DRIVER_TYPE_PTY &&
                      tty->driver.subtype == PTY_TYPE_MASTER);
        o_tty = tty->link;
+       o_idx = o_tty? MINOR(o_tty->device) -
o_tty->driver.minor_start : -1;

  #ifdef TTY_PARANOIA_CHECK
        if (idx < 0 || idx >= tty->driver.num) {
***************
*** 1150,1155 ****
--- 1193,1204 ----
        }
  #endif

+       /* protect the concurent access to tty by init_dev */
+       down_tty_sem(idx);
+       /* when per tty semaphores are ready, uncomment this: */
+       /* if (o_tty && idx != o_idx)
+               down_tty_sem(o_idx); */
+
        if (tty->driver.close)
                tty->driver.close(tty, filp);

***************
*** 1269,1275 ****

        /* check whether both sides are closing ... */
        if (!tty_closing || (o_tty && !o_tty_closing))
!               return;

  #ifdef TTY_DEBUG_HANGUP
        printk(KERN_DEBUG "freeing tty structure...");
--- 1318,1324 ----

        /* check whether both sides are closing ... */
        if (!tty_closing || (o_tty && !o_tty_closing))
!               goto end_release_dev;

  #ifdef TTY_DEBUG_HANGUP
        printk(KERN_DEBUG "freeing tty structure...");
***************
*** 1300,1305 ****
--- 1349,1361 ----
         * the slots and preserving the termios structure.
         */
        release_mem(tty, idx);
+
+ end_release_dev:
+
+       /* when per tty semaphores are ready, uncomment this: */
+       /* if (o_tty && idx != o_idx)
+               up_tty_sem(o_idx); */
+       up_tty_sem(idx);
  }

  /*


What do wou think about this?

Regarding Jasons last patch (commented in bug #131674) I couldn't
quiite get it. Can someone explain what is the rare in this part of
the code (I don't know what kill_pg functions are doing)...

Regards, Peter

Comment 6 Jason Baron 2005-01-04 15:59:08 UTC

Peter, the above race that you suggest, is protected against by the big kernel
lock, or "lock_kernel". This lock is taken upon opening a file, and tty_release
also takes it. Thus, the scenario that you describe is not possible. 

kill_pg, i blieve is sending a signal to an entire process group.

Comment 7 Wendy Cheng 2005-01-04 16:34:43 UTC

The reason Jason's last patch makes sense (to me) is that, based on the dumps
obtained from the customers (that matches with Peter's printk result in comment
#5), the memory that hosted the tty structure seems to get released and used for
other purpose. In one of the dumps, it contained text data, except the
tty->device field. If we check the code, the memory is released in
do_tty_hangup()by fput():


    916         set_bit(TTY_HUPPED, &tty->flags);
    917         if(ld) {
    918                 tty_ldisc_enable(tty);
    919                 tty_ldisc_deref(ld);
    920         }
    921         unlock_kernel();
    922         if (f)
    923                 fput(f);
    924 }

that also matched with the faulty script sent in by one of our customers where
the script was piping the screen output to a text file (but I'm not able to
recreate the issue using the very same script):

#!/bin/sh
/usr/bin/iostat -d -x 60  2  >/usr/local/lotus/notesdata/Y8648038.TMP

I was working on a debug trace kernel that logged the entries between tty_open
and hangup code and was plannning to do the zeroing of p->tty (#994-#995 in
disassociate_ctty()) to earlier part of the routine until I saw Jason's patch. I
think Jason's patch is safer than mine 

    989         current->tty_old_pgrp = 0;
    990         tty->session = 0;
    991         tty->pgrp = -1;
    992
    993         read_lock(&tasklist_lock);
    994         for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid)
    995                 p->tty = NULL;
    996         read_unlock(&tasklist_lock);

Comment 9 Wendy Cheng 2005-01-04 16:42:26 UTC

Per Jeff's request - add Jason's patch here for reference purpose:

--- linux-2.4.21/drivers/char/tty_io.c.bak	Mon Jan  3 19:14:55 2005
+++ linux-2.4.21/drivers/char/tty_io.c	Mon Jan  3 19:15:39 2005
@@ -589,6 +589,8 @@ void disassociate_ctty(int on_exit)
 	struct list_head *l;
 	struct pid *pid;
 
+	lock_kernel();
+
 	if (tty) {
 		tty_pgrp = tty->pgrp;
 		if (on_exit && tty->driver.type != TTY_DRIVER_TYPE_PTY)
@@ -598,6 +600,7 @@ void disassociate_ctty(int on_exit)
 			kill_pg(current->tty_old_pgrp, SIGHUP, on_exit);
 			kill_pg(current->tty_old_pgrp, SIGCONT, on_exit);
 		}
+		unlock_kernel();
 		return;
 	}
 	if (tty_pgrp > 0) {
@@ -614,6 +617,7 @@ void disassociate_ctty(int on_exit)
 	for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid)
 		p->tty = NULL;
 	read_unlock(&tasklist_lock);
+	unlock_kernel();
 }
 
 void stop_tty(struct tty_struct *tty)

Comment 13 Jason Baron 2005-01-04 21:53:36 UTC

*** Bug 130774 has been marked as a duplicate of this bug. ***

Comment 14 Jason Baron 2005-01-05 02:06:56 UTC

As an update, the patch posted in comment #9, i think is along the
right lines to fix this, but i don't think it does exactly what
intended. The problem here is that tty_open path can sleep, thus
giving up the BKL and opening up the tty_open path to all sorts of
races against exit, release, and even ioctls as Peter suggested. I
hope to cook up a test patch for this tomorrow, and hopefully Peter
can help us test it. thanks.

Comment 23 Jason Baron 2005-01-05 23:43:19 UTC

Created attachment 109406 [details]
ttty patch

Ok, here is a patch which might address this issue, that i've done some testing
on. The only weirdness that i've seen with it, is 'pidof' failure to read sid,
intermitently during bootup. i'm not sure if this is related to the patch or
not.

This patch closes most of the holes i've seen in the tty_open vs. hangup,
disassociate, release, ioctls, etc. It leaves some smaller holes still open,
and could use some cleanup, but i think this prototype might be worth testing
if it can pass basic smoke tests.

Comment 26 Peter Levart 2005-01-07 07:59:35 UTC

I have deployed this latest Jason's patch to our 7 machines 18 hours
ago and the bug hasn't showed up yet. From my experience with our
setup we should wait at least for a month...

Regards, Peter

Comment 27 Wendy Cheng 2005-01-07 16:27:48 UTC

There is a new dump (without Jason's fix) - tentatively tie that ticket with
this bugzilla.  The panic route is different: 

PID: 3215   TASK: f76b6000  CPU: 0   COMMAND: "sendmail"
#0 [f76b7d9c] die at c010c5df
#1 [f76b7dac] do_page_fault at c011ff09
#2 [f76b7e70] error_code (via page_fault) at c03f21c0
  EAX: 00000001  EBX: 00280000  ECX: c711decc  EDX: 00000000  EBP: c73dbca4
  DS:  0068      ESI: f76b6000  ES:  0068      EDI: 00000001
  CS:  0060      EIP: c01b0211  ERR: ffffffff  EFLAGS: 00010202
#3 [f76b7eac] disassociate_ctty at c01b0211
#4 [f76b7ec8] do_exit at c012d71d
#5 [f76b7ee4] do_group_exit at c012d926
#6 [f76b7ef8] get_signal_to_deliver at c01372bb
#7 [f76b7f20] do_signal at c010beef
#8 [f76b7fc0] signal_return at c03f20a3
  EAX: fffffffc  EBX: 00300000  ECX: 00000001  EDX: 00000001
  DS:  002b      ESI: 0000000e  ES:  002b      EDI: 00000000
  SS:  002b      ESP: bfffa480  EBP: bfffa4ac
  CS:  0023      EIP: 00c6dc30  ERR: ffffffff  EFLAGS: 00010296

Dis-disassemble disassociate_ctty at c01b0211 shows it crashed at
drivers/char/tty_io.c: 593
0xc01b0211 <disassociate_ctty+33>:      mov    0x108(%ebx),%esi

  584 void disassociate_ctty(int on_exit)
  585 {
  586         struct tty_struct *tty = current->tty;
  587         struct task_struct *p;
  588         int tty_pgrp = -1;
  589         struct list_head *l;
  590         struct pid *pid;
  591
  592         if (tty) {
  593                 tty_pgrp = tty->pgrp;
  594                 if (on_exit && tty->driver.type != TTY_DRIVER_TYPE_PTY)
  595                         tty_vhangup(tty);

crash> struct task_struct f76b6000 | grep tty
tty_old_pgrp = 0,
tty = 0x280000,

The tty address (0x280000) matches with %ebx. This looks like an user mode address. 

We need to keep an eye on this panic route.

Comment 28 Jason Baron 2005-01-07 17:04:00 UTC

Peter, you may have metioned this already, but out of curiosity what is the
primary workload for these boxes? Does /sbin/lsmod show any tainted modules?

thanks

Comment 29 Peter Levart 2005-01-08 12:11:16 UTC

Well, the 3 machines with most of these panics are running Oracle 10g
database cluster. We are also using VERITAS Volume Manager which has
non-GPL modules that taint the kernel. But the modules are certified
(by VERITAS) to work with RHEL 3 update 2 kernel...
Nevertheless, Jason's patch looks promissing. We're now running for 2
days without stomping on this bug.

Comment 30 Jason Baron 2005-01-08 17:20:18 UTC

hi Peter,

after reviewing this some more the patch in #23 that i suggested is
not complete and has some problems. Nevertheless, it does fix some
things, and if you haven't observed any issues, i would suggest
leaving it running. I probably woulnd't have a more complete patch
until early next week. 

thanks.

Comment 31 Jason Baron 2005-01-11 21:21:59 UTC

hi Peter,

any updates?

thanks,

-jason

Comment 34 Peter Levart 2005-01-13 16:43:23 UTC

Sorry, I've been (still am) off for some days.

I'm sorry to inform you that the patch does not  
seem to work for our problems. After installing it 7 days ago on 7 machines,  
one of them hat 5 ocurences of them bug:  
  
Jan 11 17:36:05 xi kernel: init_dev: device=(major:5, minor:0): driver->table  
is NULL (this would OOPS)  
Jan 11 18:39:33 xi last message repeated 2 times  
Jan 12 20:39:06 xi last message repeated 2 times  
  
What were the problems you still saw with your patch that were mentioned in #30?  
  
Peter

Comment 35 Jason Baron 2005-01-14 16:08:16 UTC

The patch that i sent actually has some deadlocks and it doesn't
entirely close all the holes i saw. I could post an updated patch, but
without an easy way to reproduce this, I not sure how much value we'd
get out of it. I'm going to concentrate now, on getting a reproducer
for this in house. thanks for feedback.

Comment 39 Peter Levart 2005-01-15 10:14:59 UTC

Well, if you don't find one (a reproducer) you can allways ask me to try the
patch and see if it has some effect on our system.

Regards, Peter

Comment 41 Ernie Petrides 2005-01-21 05:50:36 UTC

A proposed patch to fix possible data corruption due to /proc/kcore access
has been attached to bug 141394 in comment #56.  That patch has been shown
to resolve one particular scenario, but it is still undergoing code review
and further testing to verify whether it addresses other data corruption
scenarios (possibly this one).

Comment 42 Jason Baron 2005-01-21 14:19:07 UTC

Peter, the patch that Ernie is suggesting is low risk, has a chance of
resolving this issue, and so we would really like to know if it
resolves this issue for you. Bugzilla #141394, comment #56, contains
the patch, and comment #58 contains kernel RPMS with this patch
included. thanks.

Comment 43 Peter Levart 2005-01-21 18:36:50 UTC

Ok, I have applied the patch to 2.4.21-20.ELsmp kernel on our 7 servers and am 
waiting...

Comment 44 Peter Levart 2005-01-24 12:44:33 UTC

So far, so good. 3 days without the OOPS. Should wait at least for a couple of 
weeks to be confident.

Comment 45 Peter Levart 2005-01-26 19:29:16 UTC

Unfortunately, the OOPS is still here in spite of the patch from Bugzilla 
#141394 comment #56. 
 
Jan 25 12:22:05 xi kernel: init_dev: device=(major:5, minor:0): driver->table 
is NULL (this would OOPS) 
Jan 25 12:22:41 xi kernel: init_dev: device=(major:5, minor:0): driver->table 
is NULL (this would OOPS) 
Jan 26 17:07:32 xi kernel: init_dev: device=(major:5, minor:0): driver->table 
is NULL (this would OOPS) 
 
Peter

Comment 46 Ernie Petrides 2005-01-26 20:42:47 UTC

Thanks for the update, Peter.  I guess it's back to the drawing board
for Jason.

Comment 47 Jason Baron 2005-01-26 22:05:37 UTC

Peter, are you running any HP monitoring agents? Could we please get
the output of /sbin/lsmod posted. thanks.

Comment 50 Bastien Nocera 2005-01-27 09:44:49 UTC

Created attachment 110283 [details]
tty_open_untainted1.txt

Comment 51 Bastien Nocera 2005-01-27 09:46:36 UTC

Created attachment 110284 [details]
tty_open_untainted2.txt

Uncomplete untainted trace.

Comment 53 Peter Levart 2005-01-27 10:50:21 UTC

Created attachment 110286 [details]
lsmod output on our most OOPS-y machine (xi)...

Comment 57 Peter Levart 2005-01-30 19:26:58 UTC

No, I'm not running any HP software.

Comment 58 Jason Baron 2005-02-01 17:05:26 UTC

Created attachment 110502 [details]
tty debugging patch

here is a testing patch to try and catch somebody freeing the tty structure
when it really shouldn't be getting freed. test kernels with this patch can be
found at:

http://people.redhat.com/jbaron/tty-debug/

Comment 59 Peter Levart 2005-02-03 16:28:34 UTC

I will apply this patch to our kernel that already has the patch that kprints in
the event when default kernel would OOPS. When this tty debugging patch prints
something, what would you know? Would it print enough to pin-point the problem code?

Peter

Comment 60 Jason Baron 2005-02-03 17:00:45 UTC

This printk is intended to trigger before the printks that you added.
Given the traces we have, it would appear that the tty structure is
being freed while somebody still has a reference to it. This patch
should hopefully help confirm this suspicion. Then, we could further
investigate how the system could get into this erroneous state.

Comment 62 Jason Baron 2005-02-10 19:13:23 UTC

hi Peter,

As we're still not to the bottom of this, i'm wondering about the patch you
posted in bug 131674, comment #44, which returns -ENODEV when the driver table
is NULL. My question is basically does that resolve this issue for you? do you
notice any other corruption as a result? 

Also, any luck with the comment #58 patch?

thanks.

Comment 63 Peter Levart 2005-02-11 11:13:38 UTC

Sorry, I was in bed with temperature for the whole week, so I managed to deploy
the patch from comment #58 only today. Regarding my "workarround" we have
managed to supress kernel panics with this yes. I haven't yet found any other
side effect, but I have this scarry feeling that something might go wrong
sometime and corrupt our database etc if we don't find the real cause of this bug.

I'll keep you posted about the comment #58 patch findings as they apear.

Regards, Peter

Comment 64 Peter Levart 2005-02-13 13:45:23 UTC

Ok, I have one ocurence of my OOPS but your printk has not been triggered:

Feb 12 21:14:21 xk kernel: init_dev: device=(major:5, minor:0): driver->table is
NULL (this would OOPS)
Feb 12 21:14:21 xk last message repeated 8 times


... but no sign of your: printk("%s: line: %d, o_tty->count is: %i!!!\n",
__FILE__, __LINE__, count); ... being triggered before or after or anywhere...


Regards, Peter

Comment 76 Jason Baron 2005-02-18 19:22:29 UTC

Peter, out of curiosity find anything like the following in any of your logs.
They could come out very early on, not necessarily coninciding with the oops.
thanks.

tty_io.c: process 12296 (sh) used obsolete /dev/cua - update software to
use /dev/ttyS13

Comment 80 Peter Levart 2005-02-23 12:50:49 UTC

Sorry for the delay. 
 
Yes, we have them. Exactly on the machines that experience my OOPS. Here's on 
one of them: 
 
Dec 27 21:22:04 xk kernel: tty_io.c: process 24934 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS31 
Dec 27 21:22:04 xk kernel: tty_io.c: process 24947 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS31 
Dec 31 11:02:25 xk kernel: tty_io.c: process 12421 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS30 
Dec 31 11:02:25 xk kernel: tty_io.c: process 12421 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS30 
Dec 31 11:02:25 xk kernel: tty_io.c: process 12429 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS30 
Dec 31 19:14:25 xk kernel: tty_io.c: process 2383 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS1 
Dec 31 19:14:25 xk kernel: tty_io.c: process 2394 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS1 
Jan 25 10:10:44 xk kernel: tty_io.c: process 29618 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS33 
Jan 25 10:10:44 xk kernel: tty_io.c: process 29622 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS33 
Jan 25 10:10:44 xk kernel: tty_io.c: process 29626 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS33 
Jan 25 10:10:44 xk kernel: tty_io.c: process 29630 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS33 
Jan 25 10:10:44 xk kernel: tty_io.c: process 29634 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS33 
Feb 13 12:14:21 xk kernel: tty_io.c: process 7757 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS47 
Feb 13 12:14:21 xk kernel: tty_io.c: process 7761 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS47 
Feb 13 12:14:21 xk kernel: tty_io.c: process 7765 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS47 
Feb 13 12:14:21 xk kernel: tty_io.c: process 7769 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS47 
Feb 13 12:14:21 xk kernel: tty_io.c: process 7773 (sh) used obsolete /dev/cua 
- update software to use /dev/ttyS47 
 
Are they in any relation to this bug? 
 
Regards, Peter

Comment 81 Jason Baron 2005-02-23 16:20:51 UTC

that's my question as well...it seems like a promising lead....

Comment 82 Jerry Uanino 2005-02-25 16:10:39 UTC

Created attachment 111426 [details]
kysmopps output

Comment 83 Jerry Uanino 2005-02-25 16:14:10 UTC

I am having similar output, my comments were lost when I created the 
attachment.  I'm using LVM and ext3 on a new install that was 
up2date'd and has been only running for a short period of time.

Comment 84 Jason Baron 2005-02-25 22:08:04 UTC

it looks like the beginning of the oops is missing...the part with the EIP, do
you have that? thanks.

Comment 85 Jerry Uanino 2005-02-26 02:01:17 UTC

Created attachment 111453 [details]
EIP from panic

Console dump with EIP

Comment 87 Mike Peaster 2005-02-28 16:49:22 UTC

I think i am experiencing the same bug. I get the same dump screen, 
Iâm running the same kernel and Iâm also running an Oracle 10g 
database cluster. My nodes have been up for about three weeks and 
Iâve only had one occurrence of this bug. How can I help?

Comment 88 Michael Bischof 2005-02-28 21:01:19 UTC

Same here. We have a 2 node Oracle 10g RAC which crash every few weeks.
How can I help? Any special kernel to "try" ?

Crash logs and memory dumps available.

Comment 90 Jason Baron 2005-03-01 20:48:40 UTC

Still haven't found a re-producer, but i don't remember, but its not clear to me
now, was the patch in comment #9, tested?

Comment 91 Jason Baron 2005-03-01 23:01:57 UTC

The comment #9 patch was covered by what was in comment #23, which apparently
did not fix this issue. 

In an effort to fix this, since no reproducer is yet known, i'm going to post
ongoing test kernels at: http://people.redhat.com/~jbaron/.private/tty-debug
These kernels will have fixes and debugging patches. Anybody who wants to add
patches to these kernels, please suggest them here. People testing them, should
provide feedback here has well. 

The kernel changelog will have a description of the patches added. I'll keep a
copy of that at:  http://people.redhat.com/~jbaron/.private/tty-debug/changelog,
for easy reference.

thanks.

Comment 94 Jason Baron 2005-03-02 16:08:10 UTC

oops, proper URLs are: 

http://people.redhat.com/~jbaron/tty-debug/
http://people.redhat.com/~jbaron/tty-debug/changelog

Comment 95 Jerry Uanino 2005-03-02 18:09:41 UTC

Would a netdump help.  I'm attempting to configure netdump and catch the 
crash, but I'm not sure if this is better than installing the debugging kernel 
or a waste of time.

Comment 96 Peter Levart 2005-03-03 11:47:53 UTC

I have already captured netdump of this panic some time ago. If anybody needs it
it can be downloaded from:

FTP server: ftp.select-tech.si
username: redhat
password: netdump
file name: 372036_xi_2004-11-08_vmcore+log.tar.bz2

Peter

Comment 100 Jason Baron 2005-03-07 01:45:38 UTC

I believe we have gotten to the bottom of this issue. The basic problem was that
controlling ttys were not being properly cleared in multi-threaded applications.
I've built testing kernels with the fix. thanks.

http://people.redhat.com/jbaron/2.4.21-28.EL.session.5/

Comment 101 juan r andrew 2005-03-07 13:51:56 UTC

I just got another crash, here is the output, I hope it help

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c01ad375
*pde = 0c094001
*pte = 00000000
Oops: 0000
netconsole hangcheck-timer lp parport oracleasm autofs audit sr_mod cdrom st 
iscsi_sfnet iptable_filter ip_tables e1000 tg3 floppy sg keybdev mousedev hid inp
CPU:    0
EIP:    0060:[<c01ad375>]    Tainted: GF
EFLAGS: 00010246

EIP is at init_dev [kernel] 0x55 (2.4.21-15.0.2.ELsmp/i686)
eax: 00000000   ebx: 00000500   ecx: 00000000   edx: 00000000
esi: 00000000   edi: f5cca200   ebp: c04cfe00   esp: ccafde80
ds: 0068   es: 0068   ss: 0068
Process sh (pid: 31537, stackpage=ccafd000)
Stack: 00000000 00000000 c0140ab2 7e3d5025 00000000 00000000 73edb025 
c2d96618 
       f5241100 00000000 b756f540 e5108e00 00030002 ccafc000 00000000 f5cca200 
       f3d4e680 c01adfd6 00000500 ccafdee4 f7fdda00 eee3b008 00000000 c0179a50 
Call Trace:   [<c0140ab2>] do_anonymous_page [kernel] 0x252 (0xccafde88)
[<c01adfd6>] tty_open [kernel] 0x66 (0xccafdec4)
[<c0179a50>] dput [kernel] 0x30 (0xccafdedc)
[<c016f436>] link_path_walk [kernel] 0x656 (0xccafdef0)
[<c01612d8>] get_chrfops [kernel] 0x98 (0xccafdf00)
[<c01397c3>] in_group_p [kernel] 0x23 (0xccafdf08)
[<f887d7ea>] ext3_permission [ext3] 0xaa (0xccafdf10)
[<c0161591>] chrdev_open [kernel] 0x71 (0xccafdf38)
[<c015f7e0>] dentry_open [kernel] 0x110 (0xccafdf54)
[<c015f6c8>] filp_open [kernel] 0x68 (0xccafdf70)
[<c015fad3>] sys_open [kernel] 0x53 (0xccafdfa8)

Code: 8b 04 88 89 44 24 30 85 c0 0f 84 9c 00 00 00 8b 54 24 30 8b

CPU#1 is frozen.
< netdump activated - performing handshake with the client. >

Pid/TGid: 31537/31537, comm:                   sh
EIP: 0060:[<c01ad375>] CPU: 0
EIP is at init_dev [kernel] 0x55 (2.4.21-15.0.2.ELsmp)
 ESP: 0000:00000000 EFLAGS: 00010246    Tainted: GF
EAX: 00000000 EBX: 00000500 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: f5cca200 EBP: c04cfe00 DS: 0068 ES: 0068 FS: 0000 GS: 0033
CR0: 8005003b CR2: 00000000 CR3: 29cd1100 CR4: 000006f0
Call Trace:   [<c0140ab2>] do_anonymous_page [kernel] 0x252 (0xccafde88)
[<c01adfd6>] tty_open [kernel] 0x66 (0xccafdec4)
[<c0179a50>] dput [kernel] 0x30 (0xccafdedc)
[<c016f436>] link_path_walk [kernel] 0x656 (0xccafdef0)
[<c01612d8>] get_chrfops [kernel] 0x98 (0xccafdf00)
[<c01397c3>] in_group_p [kernel] 0x23 (0xccafdf08)
[<f887d7ea>] ext3_permission [ext3] 0xaa (0xccafdf10)
[<c0161591>] chrdev_open [kernel] 0x71 (0xccafdf38)
[<c015f7e0>] dentry_open [kernel] 0x110 (0xccafdf54)
[<c015f6c8>] filp_open [kernel] 0x68 (0xccafdf70)
[<c015fad3>] sys_open [kernel] 0x53 (0xccafdfa8)


                         free                        sibling
  task             PC    stack   pid father child younger older
init          S 00000001  2628     1      0     4       2       (NOTLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf7fa5ea0)
[<c0134105>] schedule_timeout [kernel] 0x65 (0xf7fa5ee4)
[<c015609c>] __get_free_pages [kernel] 0x1c (0xf7fa5eec)
[<c0175391>] __pollwait [kernel] 0x31 (0xf7fa5ef0)
[<c0134090>] process_timeout [kernel] 0x0 (0xf7fa5f04)
[<c017565b>] do_select [kernel] 0x13b (0xf7fa5f1c)
[<c0175afe>] sys_select [kernel] 0x34e (0xf7fa5f60)
[<c016acb9>] sys_fstat64 [kernel] 0x49 (0xf7fa5fa8)

migration/0   S 00000000  5500     2      0             3     1 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf7fa3f68)
[<c0124ebb>] migration_task [kernel] 0x30b (0xf7fa3fac)
[<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa3fc4)
[<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa3fe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7fa3ff0)

migration/1   S 00000001  5488     3      0                   2 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf7fa1f68)
[<c0124ebb>] migration_task [kernel] 0x30b (0xf7fa1fac)
[<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa1fc4)
[<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa1fe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7fa1ff0)

keventd       S 00000000  5124     4      1             5       (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc48ddf48)
[<c013b007>] context_thread [kernel] 0x117 (0xc48ddf8c)
[<c013aef0>] context_thread [kernel] 0x0 (0xc48ddfe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc48ddff0)

ksoftirqd/0   R 00000000  4964     5      1             6     4 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc2e3ff88)
[<c012f16f>] ksoftirqd [kernel] 0xbf (0xc2e3ffcc)
[<c012f0b0>] ksoftirqd [kernel] 0x0 (0xc2e3ffe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc2e3fff0)

ksoftirqd/1   S 00000001  4688     6      1             9     5 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc2e3df88)
[<c012f16f>] ksoftirqd [kernel] 0xbf (0xc2e3dfcc)
[<c012f0b0>] ksoftirqd [kernel] 0x0 (0xc2e3dfe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc2e3dff0)

bdflush       S 00000001  4612     9      1             7     6 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc46d3f60)
[<c0123972>] interruptible_sleep_on [kernel] 0x52 (0xc46d3fa4)
[<c012f09a>] __run_task_queue [kernel] 0x6a (0xc46d3fbc)
[<c01663bf>] bdflush [kernel] 0xff (0xc46d3fd4)
[<c01662c0>] bdflush [kernel] 0x0 (0xc46d3fe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d3ff0)

kswapd        S 00000000  4492     7      1             8     9 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc46d7f00)
[<c0134105>] schedule_timeout [kernel] 0x65 (0xc46d7f44)
[<c0134090>] process_timeout [kernel] 0x0 (0xc46d7f64)
[<c0153fea>] wakeup_memwaiters [kernel] 0xca (0xc46d7f7c)
[<c0153ce4>] kswapd [kernel] 0x84 (0xc46d7fd0)
[<c0153c60>] kswapd [kernel] 0x0 (0xc46d7fe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d7ff0)

kscand        S 00000001  4560     8      1            10     7 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc46d5f4c)
[<c0134105>] schedule_timeout [kernel] 0x65 (0xc46d5f90)
[<c0134090>] process_timeout [kernel] 0x0 (0xc46d5fb0)
[<c01541c5>] kscand [kernel] 0x55 (0xc46d5fc8)
[<c0154170>] kscand [kernel] 0x0 (0xc46d5fe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d5ff0)

kupdated      S 00000000  4188    10      1            11     8 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc46d1f58)
[<c0134105>] schedule_timeout [kernel] 0x65 (0xc46d1f9c)
[<c0167871>] sync_supers [kernel] 0x131 (0xc46d1fa4)
[<c0134090>] process_timeout [kernel] 0x0 (0xc46d1fbc)
[<c01664bf>] kupdate [kernel] 0x8f (0xc46d1fd4)
[<c0166430>] kupdate [kernel] 0x0 (0xc46d1fe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d1ff0)

mdrecoveryd   S 00000000  5612    11      1            22    10 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf7ff7f48)
[<c0213545>] md_thread [kernel] 0x1c5 (0xf7ff7f8c)
[<c0213380>] md_thread [kernel] 0x0 (0xf7ff7fe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7ff7ff0)

raid1d        S 00000001  5612    22      1            23    11 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf74a3f48)
[<c0213545>] md_thread [kernel] 0x1c5 (0xf74a3f8c)
[<f8858544>] .rodata.str1.1 [raid1] 0x75 (0xf74a3f94)
[<c0213380>] md_thread [kernel] 0x0 (0xf74a3fe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf74a3ff0)

raid1d        S 00000001  5612    23      1            24    22 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf749df48)
[<c0213545>] md_thread [kernel] 0x1c5 (0xf749df8c)
[<f8858544>] .rodata.str1.1 [raid1] 0x75 (0xf749df94)
[<c0213380>] md_thread [kernel] 0x0 (0xf749dfe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf749dff0)

Comment 102 Jason Baron 2005-03-07 14:07:39 UTC

i am confident that this issue is fixed. Please see comment #100. I don't need
any more crashes, but thanks anyways.

Comment 108 Ernie Petrides 2005-03-10 00:32:14 UTC

*** Bug 130774 has been marked as a duplicate of this bug. ***

Comment 109 Ernie Petrides 2005-03-10 00:35:56 UTC

A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-30.EL).

Comment 110 Jerry Uanino 2005-03-10 14:10:15 UTC

Will this fix and new kernel be available in up2date or will it only be 
possible to install it if you are at update 5?

Comment 113 Ernie Petrides 2005-03-10 22:24:35 UTC

Jerry, the new kernel will be available in the beta channel in a week or
so.  It will become available in the main channel when U5 is officially
released.  Either way, "up2date" is one of the ways to get the new kernel.

Comment 114 Ernie Petrides 2005-03-14 21:44:27 UTC

*** Bug 151086 has been marked as a duplicate of this bug. ***

Comment 115 Jason Baron 2005-03-17 22:08:22 UTC

*** Bug 150334 has been marked as a duplicate of this bug. ***

Comment 119 Ernie Petrides 2005-04-14 00:15:13 UTC

A fix for this problem has also been committed to the RHEL3 E5
patch pool this evening (in kernel version 2.4.21-27.0.3.EL).

Comment 120 Eric Bursley 2005-04-22 14:37:26 UTC

eric_bursley

Comment 121 Eric Bursley 2005-04-22 14:43:51 UTC

We also have seen this problem with one of our customers, and if 
possible would like a copy of the beta kernel for testing.

Comment 122 Mark J. Cox 2005-04-22 14:59:50 UTC

An updated kernel containing a fix for this issue is currently in QA
and we are expecting to release it publically either later today or in
the first half of next week.

Comment 123 Ernie Petrides 2005-04-22 20:00:35 UTC

Besides the imminent release of the 2.4.21-27.0.3.EL in said
security erratum, the fix is already contained in the current
U5 beta kernel 2.4.21-31.EL that is in the RHN beta channels
(and thus you can get a fixed kernel *now*).

Comment 124 Josh Bressers 2005-04-22 20:17:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-293.html

Comment 125 Ernie Petrides 2005-04-22 21:55:36 UTC

Correction to comment #123: the security advisory listed
above contains the 2.4.21-27.0.4.EL kernel version.

Comment 126 Tim Powers 2005-05-18 13:28:59 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html

Comment 128 Manjunath 2006-10-17 06:00:29 UTC

Is this bug applicable to the linux-2.6.11
kernel.

Comment 129 Jason Baron 2006-10-17 19:56:59 UTC

no.