Bug 144059 (IT_52345_64378)
Summary: | CAN-2005-0403 panic in tty init_dev | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Wendy Cheng <nobody+wcheng> | ||||||||||||||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | 3.0 | CC: | alan, andrewj, aschultz, bnocera, ckloiber, dhoward, greg.marsden, hfuchi, juanino, kmori, knoel, mb, mid-rangesupport, mjc, mmesser, mwesley, peterm, peter, petrides, raimondi, riel, tao, tburke, vanhoof, vkanakas | ||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | impact=important,public=20050308 | ||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2005-04-22 20:17:33 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Wendy Cheng
2005-01-04 08:21:06 UTC
Problem is getting serious from support front end - 4 more dumps with identical stack trace. Jason Baron's latest patch seems to be on target. Plan to ship it out for the customers to try out. Please attach "Jason Baron's latest patch". It's not obvious from reading bug 131674 which patch you are talking about. I have made some printk statements at the point where this NULL pointer dereference happens and got the following printout: Jan 3 13:59:32 xk kernel: init_dev: device=(major:5, minor:0): driver->table is NULL (this would OOPS) Jan 3 13:59:32 xk kernel: current->pid = 28162 Jan 3 13:59:32 xk kernel: current->pgrp = 18059 Jan 3 13:59:32 xk kernel: current->tty_old_pgrp = 0 Jan 3 13:59:32 xk kernel: current->session = 16467 Jan 3 13:59:32 xk kernel: current->tgid = 28162 Jan 3 13:59:32 xk kernel: current->leader = 0 Jan 3 13:59:32 xk kernel: ...->tty->magic = 538976288 Jan 3 13:59:32 xk kernel: ...->tty->pgrp = 1734702177 Jan 3 13:59:32 xk kernel: ...->tty->session = 108622447 Jan 3 13:59:32 xk kernel: ...->tty->device = (major:5, minor:0) Jan 3 13:59:32 xk kernel: current->parent->pid = 28159 Jan 3 13:59:32 xk kernel: current->parent->pgrp = 18059 Jan 3 13:59:32 xk kernel: current->parent->tty_old_pgrp = 0 Jan 3 13:59:32 xk kernel: current->parent->session = 16467 Jan 3 13:59:32 xk kernel: current->parent->tgid = 28159 Jan 3 13:59:32 xk kernel: current->parent->leader = 0 Jan 3 13:59:32 xk kernel: ...->tty->magic = 538976288 Jan 3 13:59:32 xk kernel: ...->tty->pgrp = 1734702177 Jan 3 13:59:32 xk kernel: ...->tty->session = 108622447 Jan 3 13:59:32 xk kernel: ...->tty->device = (major:5, minor:0) Here current->tty points to some thing that does not apear to be a tty structure at all (see the strange tty->magic, tty->pgrp, tty->session values). The fact is (I have several occurences of this event) that the parent's tty is allways the same as the child's when this event happens and no parent process (of the process that triggered this event) has triggered this event. Suggesting that current->tty pointer (or structure) gets garbled already in the running parent and then it is inherited by child which in turn calls init_dev and triggers this OOPS. I am suspecting that the following is a race condition. In init_dev when re-opening an existing tty: /* check whether we're reopening an existing tty */ tty = driver->table[idx]; if (tty) goto fast_track; ... ... fast_track: if (test_bit(TTY_CLOSING, &tty->flags)) { retval = -EIO; goto end_init; } if (driver->type == TTY_DRIVER_TYPE_PTY && driver->subtype == PTY_TYPE_MASTER) { /* * special case for PTY masters: only one open permitted, * and the slave side open count is incremented as well. */ if (atomic_read(&tty->count)) { retval = -EIO; goto end_init; } atomic_inc(&tty->link->count); } atomic_inc(&tty->count); tty->driver = *driver; /* N.B. why do this every time?? */ success: ... there's a test for the TTY_CLOSING bit in the flags and *after* this test there's a simple incrementing of count(s) and the init_dev succeeds. What if between testing for TTY_CLOSING bit and incrementing of count(s) some other task is executing the release_dev and finds the count droping to 0 and thus sets TTY_CLOSING bit (to late already) and releases the structure... This could explain the garbled tty structure (already used by something else). Since init_dev is already protecting it's main part by down_tty_sem/up_tty_sem calls, the same mutex could be used in release_dev: *************** *** 1076,1082 **** { struct tty_struct *tty, *o_tty; int pty_master, tty_closing, o_tty_closing, do_sleep; ! int idx; char buf[64]; tty = (struct tty_struct *)filp->private_data; --- 1118,1124 ---- { struct tty_struct *tty, *o_tty; int pty_master, tty_closing, o_tty_closing, do_sleep; ! int idx, o_idx; char buf[64]; tty = (struct tty_struct *)filp->private_data; *************** *** 1091,1096 **** --- 1133,1139 ---- pty_master = (tty->driver.type == TTY_DRIVER_TYPE_PTY && tty->driver.subtype == PTY_TYPE_MASTER); o_tty = tty->link; + o_idx = o_tty? MINOR(o_tty->device) - o_tty->driver.minor_start : -1; #ifdef TTY_PARANOIA_CHECK if (idx < 0 || idx >= tty->driver.num) { *************** *** 1150,1155 **** --- 1193,1204 ---- } #endif + /* protect the concurent access to tty by init_dev */ + down_tty_sem(idx); + /* when per tty semaphores are ready, uncomment this: */ + /* if (o_tty && idx != o_idx) + down_tty_sem(o_idx); */ + if (tty->driver.close) tty->driver.close(tty, filp); *************** *** 1269,1275 **** /* check whether both sides are closing ... */ if (!tty_closing || (o_tty && !o_tty_closing)) ! return; #ifdef TTY_DEBUG_HANGUP printk(KERN_DEBUG "freeing tty structure..."); --- 1318,1324 ---- /* check whether both sides are closing ... */ if (!tty_closing || (o_tty && !o_tty_closing)) ! goto end_release_dev; #ifdef TTY_DEBUG_HANGUP printk(KERN_DEBUG "freeing tty structure..."); *************** *** 1300,1305 **** --- 1349,1361 ---- * the slots and preserving the termios structure. */ release_mem(tty, idx); + + end_release_dev: + + /* when per tty semaphores are ready, uncomment this: */ + /* if (o_tty && idx != o_idx) + up_tty_sem(o_idx); */ + up_tty_sem(idx); } /* What do wou think about this? Regarding Jasons last patch (commented in bug #131674) I couldn't quiite get it. Can someone explain what is the rare in this part of the code (I don't know what kill_pg functions are doing)... Regards, Peter Peter, the above race that you suggest, is protected against by the big kernel lock, or "lock_kernel". This lock is taken upon opening a file, and tty_release also takes it. Thus, the scenario that you describe is not possible. kill_pg, i blieve is sending a signal to an entire process group. The reason Jason's last patch makes sense (to me) is that, based on the dumps obtained from the customers (that matches with Peter's printk result in comment #5), the memory that hosted the tty structure seems to get released and used for other purpose. In one of the dumps, it contained text data, except the tty->device field. If we check the code, the memory is released in do_tty_hangup()by fput(): 916 set_bit(TTY_HUPPED, &tty->flags); 917 if(ld) { 918 tty_ldisc_enable(tty); 919 tty_ldisc_deref(ld); 920 } 921 unlock_kernel(); 922 if (f) 923 fput(f); 924 } that also matched with the faulty script sent in by one of our customers where the script was piping the screen output to a text file (but I'm not able to recreate the issue using the very same script): #!/bin/sh /usr/bin/iostat -d -x 60 2 >/usr/local/lotus/notesdata/Y8648038.TMP I was working on a debug trace kernel that logged the entries between tty_open and hangup code and was plannning to do the zeroing of p->tty (#994-#995 in disassociate_ctty()) to earlier part of the routine until I saw Jason's patch. I think Jason's patch is safer than mine 989 current->tty_old_pgrp = 0; 990 tty->session = 0; 991 tty->pgrp = -1; 992 993 read_lock(&tasklist_lock); 994 for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid) 995 p->tty = NULL; 996 read_unlock(&tasklist_lock); Per Jeff's request - add Jason's patch here for reference purpose: --- linux-2.4.21/drivers/char/tty_io.c.bak Mon Jan 3 19:14:55 2005 +++ linux-2.4.21/drivers/char/tty_io.c Mon Jan 3 19:15:39 2005 @@ -589,6 +589,8 @@ void disassociate_ctty(int on_exit) struct list_head *l; struct pid *pid; + lock_kernel(); + if (tty) { tty_pgrp = tty->pgrp; if (on_exit && tty->driver.type != TTY_DRIVER_TYPE_PTY) @@ -598,6 +600,7 @@ void disassociate_ctty(int on_exit) kill_pg(current->tty_old_pgrp, SIGHUP, on_exit); kill_pg(current->tty_old_pgrp, SIGCONT, on_exit); } + unlock_kernel(); return; } if (tty_pgrp > 0) { @@ -614,6 +617,7 @@ void disassociate_ctty(int on_exit) for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid) p->tty = NULL; read_unlock(&tasklist_lock); + unlock_kernel(); } void stop_tty(struct tty_struct *tty) *** Bug 130774 has been marked as a duplicate of this bug. *** As an update, the patch posted in comment #9, i think is along the right lines to fix this, but i don't think it does exactly what intended. The problem here is that tty_open path can sleep, thus giving up the BKL and opening up the tty_open path to all sorts of races against exit, release, and even ioctls as Peter suggested. I hope to cook up a test patch for this tomorrow, and hopefully Peter can help us test it. thanks. Created attachment 109406 [details]
ttty patch
Ok, here is a patch which might address this issue, that i've done some testing
on. The only weirdness that i've seen with it, is 'pidof' failure to read sid,
intermitently during bootup. i'm not sure if this is related to the patch or
not.
This patch closes most of the holes i've seen in the tty_open vs. hangup,
disassociate, release, ioctls, etc. It leaves some smaller holes still open,
and could use some cleanup, but i think this prototype might be worth testing
if it can pass basic smoke tests.
I have deployed this latest Jason's patch to our 7 machines 18 hours ago and the bug hasn't showed up yet. From my experience with our setup we should wait at least for a month... Regards, Peter There is a new dump (without Jason's fix) - tentatively tie that ticket with this bugzilla. The panic route is different: PID: 3215 TASK: f76b6000 CPU: 0 COMMAND: "sendmail" #0 [f76b7d9c] die at c010c5df #1 [f76b7dac] do_page_fault at c011ff09 #2 [f76b7e70] error_code (via page_fault) at c03f21c0 EAX: 00000001 EBX: 00280000 ECX: c711decc EDX: 00000000 EBP: c73dbca4 DS: 0068 ESI: f76b6000 ES: 0068 EDI: 00000001 CS: 0060 EIP: c01b0211 ERR: ffffffff EFLAGS: 00010202 #3 [f76b7eac] disassociate_ctty at c01b0211 #4 [f76b7ec8] do_exit at c012d71d #5 [f76b7ee4] do_group_exit at c012d926 #6 [f76b7ef8] get_signal_to_deliver at c01372bb #7 [f76b7f20] do_signal at c010beef #8 [f76b7fc0] signal_return at c03f20a3 EAX: fffffffc EBX: 00300000 ECX: 00000001 EDX: 00000001 DS: 002b ESI: 0000000e ES: 002b EDI: 00000000 SS: 002b ESP: bfffa480 EBP: bfffa4ac CS: 0023 EIP: 00c6dc30 ERR: ffffffff EFLAGS: 00010296 Dis-disassemble disassociate_ctty at c01b0211 shows it crashed at drivers/char/tty_io.c: 593 0xc01b0211 <disassociate_ctty+33>: mov 0x108(%ebx),%esi 584 void disassociate_ctty(int on_exit) 585 { 586 struct tty_struct *tty = current->tty; 587 struct task_struct *p; 588 int tty_pgrp = -1; 589 struct list_head *l; 590 struct pid *pid; 591 592 if (tty) { 593 tty_pgrp = tty->pgrp; 594 if (on_exit && tty->driver.type != TTY_DRIVER_TYPE_PTY) 595 tty_vhangup(tty); crash> struct task_struct f76b6000 | grep tty tty_old_pgrp = 0, tty = 0x280000, The tty address (0x280000) matches with %ebx. This looks like an user mode address. We need to keep an eye on this panic route. Peter, you may have metioned this already, but out of curiosity what is the primary workload for these boxes? Does /sbin/lsmod show any tainted modules? thanks Well, the 3 machines with most of these panics are running Oracle 10g database cluster. We are also using VERITAS Volume Manager which has non-GPL modules that taint the kernel. But the modules are certified (by VERITAS) to work with RHEL 3 update 2 kernel... Nevertheless, Jason's patch looks promissing. We're now running for 2 days without stomping on this bug. hi Peter, after reviewing this some more the patch in #23 that i suggested is not complete and has some problems. Nevertheless, it does fix some things, and if you haven't observed any issues, i would suggest leaving it running. I probably woulnd't have a more complete patch until early next week. thanks. hi Peter, any updates? thanks, -jason Sorry, I've been (still am) off for some days. I'm sorry to inform you that the patch does not seem to work for our problems. After installing it 7 days ago on 7 machines, one of them hat 5 ocurences of them bug: Jan 11 17:36:05 xi kernel: init_dev: device=(major:5, minor:0): driver->table is NULL (this would OOPS) Jan 11 18:39:33 xi last message repeated 2 times Jan 12 20:39:06 xi last message repeated 2 times What were the problems you still saw with your patch that were mentioned in #30? Peter The patch that i sent actually has some deadlocks and it doesn't entirely close all the holes i saw. I could post an updated patch, but without an easy way to reproduce this, I not sure how much value we'd get out of it. I'm going to concentrate now, on getting a reproducer for this in house. thanks for feedback. Well, if you don't find one (a reproducer) you can allways ask me to try the patch and see if it has some effect on our system. Regards, Peter A proposed patch to fix possible data corruption due to /proc/kcore access has been attached to bug 141394 in comment #56. That patch has been shown to resolve one particular scenario, but it is still undergoing code review and further testing to verify whether it addresses other data corruption scenarios (possibly this one). Peter, the patch that Ernie is suggesting is low risk, has a chance of resolving this issue, and so we would really like to know if it resolves this issue for you. Bugzilla #141394, comment #56, contains the patch, and comment #58 contains kernel RPMS with this patch included. thanks. Ok, I have applied the patch to 2.4.21-20.ELsmp kernel on our 7 servers and am waiting... So far, so good. 3 days without the OOPS. Should wait at least for a couple of weeks to be confident. Unfortunately, the OOPS is still here in spite of the patch from Bugzilla #141394 comment #56. Jan 25 12:22:05 xi kernel: init_dev: device=(major:5, minor:0): driver->table is NULL (this would OOPS) Jan 25 12:22:41 xi kernel: init_dev: device=(major:5, minor:0): driver->table is NULL (this would OOPS) Jan 26 17:07:32 xi kernel: init_dev: device=(major:5, minor:0): driver->table is NULL (this would OOPS) Peter Thanks for the update, Peter. I guess it's back to the drawing board for Jason. Peter, are you running any HP monitoring agents? Could we please get the output of /sbin/lsmod posted. thanks. Created attachment 110283 [details]
tty_open_untainted1.txt
Created attachment 110284 [details]
tty_open_untainted2.txt
Uncomplete untainted trace.
Created attachment 110286 [details]
lsmod output on our most OOPS-y machine (xi)...
No, I'm not running any HP software. Created attachment 110502 [details] tty debugging patch here is a testing patch to try and catch somebody freeing the tty structure when it really shouldn't be getting freed. test kernels with this patch can be found at: http://people.redhat.com/jbaron/tty-debug/ I will apply this patch to our kernel that already has the patch that kprints in the event when default kernel would OOPS. When this tty debugging patch prints something, what would you know? Would it print enough to pin-point the problem code? Peter This printk is intended to trigger before the printks that you added. Given the traces we have, it would appear that the tty structure is being freed while somebody still has a reference to it. This patch should hopefully help confirm this suspicion. Then, we could further investigate how the system could get into this erroneous state. hi Peter, As we're still not to the bottom of this, i'm wondering about the patch you posted in bug 131674, comment #44, which returns -ENODEV when the driver table is NULL. My question is basically does that resolve this issue for you? do you notice any other corruption as a result? Also, any luck with the comment #58 patch? thanks. Sorry, I was in bed with temperature for the whole week, so I managed to deploy the patch from comment #58 only today. Regarding my "workarround" we have managed to supress kernel panics with this yes. I haven't yet found any other side effect, but I have this scarry feeling that something might go wrong sometime and corrupt our database etc if we don't find the real cause of this bug. I'll keep you posted about the comment #58 patch findings as they apear. Regards, Peter Ok, I have one ocurence of my OOPS but your printk has not been triggered: Feb 12 21:14:21 xk kernel: init_dev: device=(major:5, minor:0): driver->table is NULL (this would OOPS) Feb 12 21:14:21 xk last message repeated 8 times ... but no sign of your: printk("%s: line: %d, o_tty->count is: %i!!!\n", __FILE__, __LINE__, count); ... being triggered before or after or anywhere... Regards, Peter Peter, out of curiosity find anything like the following in any of your logs. They could come out very early on, not necessarily coninciding with the oops. thanks. tty_io.c: process 12296 (sh) used obsolete /dev/cua - update software to use /dev/ttyS13 Sorry for the delay. Yes, we have them. Exactly on the machines that experience my OOPS. Here's on one of them: Dec 27 21:22:04 xk kernel: tty_io.c: process 24934 (sh) used obsolete /dev/cua - update software to use /dev/ttyS31 Dec 27 21:22:04 xk kernel: tty_io.c: process 24947 (sh) used obsolete /dev/cua - update software to use /dev/ttyS31 Dec 31 11:02:25 xk kernel: tty_io.c: process 12421 (sh) used obsolete /dev/cua - update software to use /dev/ttyS30 Dec 31 11:02:25 xk kernel: tty_io.c: process 12421 (sh) used obsolete /dev/cua - update software to use /dev/ttyS30 Dec 31 11:02:25 xk kernel: tty_io.c: process 12429 (sh) used obsolete /dev/cua - update software to use /dev/ttyS30 Dec 31 19:14:25 xk kernel: tty_io.c: process 2383 (sh) used obsolete /dev/cua - update software to use /dev/ttyS1 Dec 31 19:14:25 xk kernel: tty_io.c: process 2394 (sh) used obsolete /dev/cua - update software to use /dev/ttyS1 Jan 25 10:10:44 xk kernel: tty_io.c: process 29618 (sh) used obsolete /dev/cua - update software to use /dev/ttyS33 Jan 25 10:10:44 xk kernel: tty_io.c: process 29622 (sh) used obsolete /dev/cua - update software to use /dev/ttyS33 Jan 25 10:10:44 xk kernel: tty_io.c: process 29626 (sh) used obsolete /dev/cua - update software to use /dev/ttyS33 Jan 25 10:10:44 xk kernel: tty_io.c: process 29630 (sh) used obsolete /dev/cua - update software to use /dev/ttyS33 Jan 25 10:10:44 xk kernel: tty_io.c: process 29634 (sh) used obsolete /dev/cua - update software to use /dev/ttyS33 Feb 13 12:14:21 xk kernel: tty_io.c: process 7757 (sh) used obsolete /dev/cua - update software to use /dev/ttyS47 Feb 13 12:14:21 xk kernel: tty_io.c: process 7761 (sh) used obsolete /dev/cua - update software to use /dev/ttyS47 Feb 13 12:14:21 xk kernel: tty_io.c: process 7765 (sh) used obsolete /dev/cua - update software to use /dev/ttyS47 Feb 13 12:14:21 xk kernel: tty_io.c: process 7769 (sh) used obsolete /dev/cua - update software to use /dev/ttyS47 Feb 13 12:14:21 xk kernel: tty_io.c: process 7773 (sh) used obsolete /dev/cua - update software to use /dev/ttyS47 Are they in any relation to this bug? Regards, Peter that's my question as well...it seems like a promising lead.... Created attachment 111426 [details]
kysmopps output
I am having similar output, my comments were lost when I created the attachment. I'm using LVM and ext3 on a new install that was up2date'd and has been only running for a short period of time. it looks like the beginning of the oops is missing...the part with the EIP, do you have that? thanks. Created attachment 111453 [details]
EIP from panic
Console dump with EIP
I think i am experiencing the same bug. I get the same dump screen, Iâm running the same kernel and Iâm also running an Oracle 10g database cluster. My nodes have been up for about three weeks and Iâve only had one occurrence of this bug. How can I help? Same here. We have a 2 node Oracle 10g RAC which crash every few weeks. How can I help? Any special kernel to "try" ? Crash logs and memory dumps available. Still haven't found a re-producer, but i don't remember, but its not clear to me now, was the patch in comment #9, tested? The comment #9 patch was covered by what was in comment #23, which apparently did not fix this issue. In an effort to fix this, since no reproducer is yet known, i'm going to post ongoing test kernels at: http://people.redhat.com/~jbaron/.private/tty-debug These kernels will have fixes and debugging patches. Anybody who wants to add patches to these kernels, please suggest them here. People testing them, should provide feedback here has well. The kernel changelog will have a description of the patches added. I'll keep a copy of that at: http://people.redhat.com/~jbaron/.private/tty-debug/changelog, for easy reference. thanks. oops, proper URLs are: http://people.redhat.com/~jbaron/tty-debug/ http://people.redhat.com/~jbaron/tty-debug/changelog Would a netdump help. I'm attempting to configure netdump and catch the crash, but I'm not sure if this is better than installing the debugging kernel or a waste of time. I have already captured netdump of this panic some time ago. If anybody needs it it can be downloaded from: FTP server: ftp.select-tech.si username: redhat password: netdump file name: 372036_xi_2004-11-08_vmcore+log.tar.bz2 Peter I believe we have gotten to the bottom of this issue. The basic problem was that controlling ttys were not being properly cleared in multi-threaded applications. I've built testing kernels with the fix. thanks. http://people.redhat.com/jbaron/2.4.21-28.EL.session.5/ I just got another crash, here is the output, I hope it help Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c01ad375 *pde = 0c094001 *pte = 00000000 Oops: 0000 netconsole hangcheck-timer lp parport oracleasm autofs audit sr_mod cdrom st iscsi_sfnet iptable_filter ip_tables e1000 tg3 floppy sg keybdev mousedev hid inp CPU: 0 EIP: 0060:[<c01ad375>] Tainted: GF EFLAGS: 00010246 EIP is at init_dev [kernel] 0x55 (2.4.21-15.0.2.ELsmp/i686) eax: 00000000 ebx: 00000500 ecx: 00000000 edx: 00000000 esi: 00000000 edi: f5cca200 ebp: c04cfe00 esp: ccafde80 ds: 0068 es: 0068 ss: 0068 Process sh (pid: 31537, stackpage=ccafd000) Stack: 00000000 00000000 c0140ab2 7e3d5025 00000000 00000000 73edb025 c2d96618 f5241100 00000000 b756f540 e5108e00 00030002 ccafc000 00000000 f5cca200 f3d4e680 c01adfd6 00000500 ccafdee4 f7fdda00 eee3b008 00000000 c0179a50 Call Trace: [<c0140ab2>] do_anonymous_page [kernel] 0x252 (0xccafde88) [<c01adfd6>] tty_open [kernel] 0x66 (0xccafdec4) [<c0179a50>] dput [kernel] 0x30 (0xccafdedc) [<c016f436>] link_path_walk [kernel] 0x656 (0xccafdef0) [<c01612d8>] get_chrfops [kernel] 0x98 (0xccafdf00) [<c01397c3>] in_group_p [kernel] 0x23 (0xccafdf08) [<f887d7ea>] ext3_permission [ext3] 0xaa (0xccafdf10) [<c0161591>] chrdev_open [kernel] 0x71 (0xccafdf38) [<c015f7e0>] dentry_open [kernel] 0x110 (0xccafdf54) [<c015f6c8>] filp_open [kernel] 0x68 (0xccafdf70) [<c015fad3>] sys_open [kernel] 0x53 (0xccafdfa8) Code: 8b 04 88 89 44 24 30 85 c0 0f 84 9c 00 00 00 8b 54 24 30 8b CPU#1 is frozen. < netdump activated - performing handshake with the client. > Pid/TGid: 31537/31537, comm: sh EIP: 0060:[<c01ad375>] CPU: 0 EIP is at init_dev [kernel] 0x55 (2.4.21-15.0.2.ELsmp) ESP: 0000:00000000 EFLAGS: 00010246 Tainted: GF EAX: 00000000 EBX: 00000500 ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: f5cca200 EBP: c04cfe00 DS: 0068 ES: 0068 FS: 0000 GS: 0033 CR0: 8005003b CR2: 00000000 CR3: 29cd1100 CR4: 000006f0 Call Trace: [<c0140ab2>] do_anonymous_page [kernel] 0x252 (0xccafde88) [<c01adfd6>] tty_open [kernel] 0x66 (0xccafdec4) [<c0179a50>] dput [kernel] 0x30 (0xccafdedc) [<c016f436>] link_path_walk [kernel] 0x656 (0xccafdef0) [<c01612d8>] get_chrfops [kernel] 0x98 (0xccafdf00) [<c01397c3>] in_group_p [kernel] 0x23 (0xccafdf08) [<f887d7ea>] ext3_permission [ext3] 0xaa (0xccafdf10) [<c0161591>] chrdev_open [kernel] 0x71 (0xccafdf38) [<c015f7e0>] dentry_open [kernel] 0x110 (0xccafdf54) [<c015f6c8>] filp_open [kernel] 0x68 (0xccafdf70) [<c015fad3>] sys_open [kernel] 0x53 (0xccafdfa8) free sibling task PC stack pid father child younger older init S 00000001 2628 1 0 4 2 (NOTLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf7fa5ea0) [<c0134105>] schedule_timeout [kernel] 0x65 (0xf7fa5ee4) [<c015609c>] __get_free_pages [kernel] 0x1c (0xf7fa5eec) [<c0175391>] __pollwait [kernel] 0x31 (0xf7fa5ef0) [<c0134090>] process_timeout [kernel] 0x0 (0xf7fa5f04) [<c017565b>] do_select [kernel] 0x13b (0xf7fa5f1c) [<c0175afe>] sys_select [kernel] 0x34e (0xf7fa5f60) [<c016acb9>] sys_fstat64 [kernel] 0x49 (0xf7fa5fa8) migration/0 S 00000000 5500 2 0 3 1 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf7fa3f68) [<c0124ebb>] migration_task [kernel] 0x30b (0xf7fa3fac) [<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa3fc4) [<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa3fe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7fa3ff0) migration/1 S 00000001 5488 3 0 2 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf7fa1f68) [<c0124ebb>] migration_task [kernel] 0x30b (0xf7fa1fac) [<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa1fc4) [<c0124bb0>] migration_task [kernel] 0x0 (0xf7fa1fe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7fa1ff0) keventd S 00000000 5124 4 1 5 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc48ddf48) [<c013b007>] context_thread [kernel] 0x117 (0xc48ddf8c) [<c013aef0>] context_thread [kernel] 0x0 (0xc48ddfe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc48ddff0) ksoftirqd/0 R 00000000 4964 5 1 6 4 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc2e3ff88) [<c012f16f>] ksoftirqd [kernel] 0xbf (0xc2e3ffcc) [<c012f0b0>] ksoftirqd [kernel] 0x0 (0xc2e3ffe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc2e3fff0) ksoftirqd/1 S 00000001 4688 6 1 9 5 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc2e3df88) [<c012f16f>] ksoftirqd [kernel] 0xbf (0xc2e3dfcc) [<c012f0b0>] ksoftirqd [kernel] 0x0 (0xc2e3dfe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc2e3dff0) bdflush S 00000001 4612 9 1 7 6 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc46d3f60) [<c0123972>] interruptible_sleep_on [kernel] 0x52 (0xc46d3fa4) [<c012f09a>] __run_task_queue [kernel] 0x6a (0xc46d3fbc) [<c01663bf>] bdflush [kernel] 0xff (0xc46d3fd4) [<c01662c0>] bdflush [kernel] 0x0 (0xc46d3fe4) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d3ff0) kswapd S 00000000 4492 7 1 8 9 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc46d7f00) [<c0134105>] schedule_timeout [kernel] 0x65 (0xc46d7f44) [<c0134090>] process_timeout [kernel] 0x0 (0xc46d7f64) [<c0153fea>] wakeup_memwaiters [kernel] 0xca (0xc46d7f7c) [<c0153ce4>] kswapd [kernel] 0x84 (0xc46d7fd0) [<c0153c60>] kswapd [kernel] 0x0 (0xc46d7fe4) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d7ff0) kscand S 00000001 4560 8 1 10 7 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc46d5f4c) [<c0134105>] schedule_timeout [kernel] 0x65 (0xc46d5f90) [<c0134090>] process_timeout [kernel] 0x0 (0xc46d5fb0) [<c01541c5>] kscand [kernel] 0x55 (0xc46d5fc8) [<c0154170>] kscand [kernel] 0x0 (0xc46d5fe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d5ff0) kupdated S 00000000 4188 10 1 11 8 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc46d1f58) [<c0134105>] schedule_timeout [kernel] 0x65 (0xc46d1f9c) [<c0167871>] sync_supers [kernel] 0x131 (0xc46d1fa4) [<c0134090>] process_timeout [kernel] 0x0 (0xc46d1fbc) [<c01664bf>] kupdate [kernel] 0x8f (0xc46d1fd4) [<c0166430>] kupdate [kernel] 0x0 (0xc46d1fe4) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc46d1ff0) mdrecoveryd S 00000000 5612 11 1 22 10 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf7ff7f48) [<c0213545>] md_thread [kernel] 0x1c5 (0xf7ff7f8c) [<c0213380>] md_thread [kernel] 0x0 (0xf7ff7fe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf7ff7ff0) raid1d S 00000001 5612 22 1 23 11 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf74a3f48) [<c0213545>] md_thread [kernel] 0x1c5 (0xf74a3f8c) [<f8858544>] .rodata.str1.1 [raid1] 0x75 (0xf74a3f94) [<c0213380>] md_thread [kernel] 0x0 (0xf74a3fe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf74a3ff0) raid1d S 00000001 5612 23 1 24 22 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf749df48) [<c0213545>] md_thread [kernel] 0x1c5 (0xf749df8c) [<f8858544>] .rodata.str1.1 [raid1] 0x75 (0xf749df94) [<c0213380>] md_thread [kernel] 0x0 (0xf749dfe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf749dff0) i am confident that this issue is fixed. Please see comment #100. I don't need any more crashes, but thanks anyways. *** Bug 130774 has been marked as a duplicate of this bug. *** A fix for this problem has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-30.EL). Will this fix and new kernel be available in up2date or will it only be possible to install it if you are at update 5? Jerry, the new kernel will be available in the beta channel in a week or so. It will become available in the main channel when U5 is officially released. Either way, "up2date" is one of the ways to get the new kernel. *** Bug 151086 has been marked as a duplicate of this bug. *** *** Bug 150334 has been marked as a duplicate of this bug. *** A fix for this problem has also been committed to the RHEL3 E5 patch pool this evening (in kernel version 2.4.21-27.0.3.EL). eric_bursley We also have seen this problem with one of our customers, and if possible would like a copy of the beta kernel for testing. An updated kernel containing a fix for this issue is currently in QA and we are expecting to release it publically either later today or in the first half of next week. Besides the imminent release of the 2.4.21-27.0.3.EL in said security erratum, the fix is already contained in the current U5 beta kernel 2.4.21-31.EL that is in the RHN beta channels (and thus you can get a fixed kernel *now*). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-293.html Correction to comment #123: the security advisory listed above contains the 2.4.21-27.0.4.EL kernel version. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html Is this bug applicable to the linux-2.6.11 kernel. no. |