I haven't been able to re-produce this a 2nd time yet. So I'll asisgn this to myself for now. After I installed kernel 549 on Arjan's page, I met a kernel panic. In a moment, I'll attach the output.
Created attachment 103600 [details] console output from panic including tracebacks
Traceback that just happened without a panic when I issued umount. I assume it's part of the same problem so I'll drop the output here. Badness in interruptible_sleep_on_timeout at kernel/sched.c:3004 Call Trace: [<a000000100016a80>] show_stack+0x80/0xa0 sp=e000003013837bb0 bsp=e000003013830fe0 [<a000000100558370>] interruptible_sleep_on_timeout+0x1f0/0x300 sp=e000003013837d80 bsp=e000003013830fa0 [<a00000020056faa0>] lockd_down+0x200/0x440 [lockd] sp=e000003013837db0 bsp=e000003013830f80 [<a0000002006420d0>] nfs_kill_super+0x1b0/0x280 [nfs] sp=e000003013837db0 bsp=e000003013830f60 [<a000000100117c90>] deactivate_super+0x130/0x180 sp=e000003013837db0 bsp=e000003013830f30 [<a00000010014f370>] __mntput+0x50/0x80 sp=e000003013837db0 bsp=e000003013830f08 [<a000000100129140>] path_release_on_umount+0x60/0x80 sp=e000003013837db0 bsp=e000003013830ee8 [<a0000001001505e0>] sys_umount+0x540/0x9e0 sp=e000003013837db0 bsp=e000003013830e60 [<a00000010000f320>] ia64_ret_from_syscall+0x0/0x20 sp=e000003013837e30 bsp=e000003013830e60 [<a000000000010640>] 0xa000000000010640 sp=e000003013838000 bsp=e000003013830e60 r[root@altix1 ~]# uname -a Linux altix1.lab.boston.redhat.com 2.6.8-1.549 #1 SMP Mon Sep 6 16:10:54 EDT 2004 ia64 ia64 ia64 GNU/Linux
Created attachment 103602 [details] More tracebacks (it panicked shortly later) ext3 in these too. I'll see if I can borrow a tiger box and see if it has the same problems.
As you may have guessed from the tracebacks, this is not an altix specific problem. I installed a tiger SDV in the lab to RHEL4 0907, then put the 549 kernel on the system. Once I did that, I just mounted and unmounted a cdrom a few times and got a similar traceback. This system didn't have a serial console but it looks very similar. I'm going to unassign this - not because I'm not interested or don't want to help, but because I'm probably not the best person to look at this. If someone more experienced in this area doens't want to look, I can take it back and try to fumble my way through it. I wanted to at least test to be sure it affected more than just Altix.
At first glance: in a couple of cases we've got a buffer_head data struct going haywire and oopsing processes which access it. In one other, it's an skbuf. That could be _anything_. Have you been trying other kernels on the same box with the same tests? Which was the last one to run correctly?
Using a McKinley SDV/Tiger, I can't seem to induce the failure using the fedora kernel 2.6.8-1.540.
Jesse asked me about this so I guess it's worth mentioning. On an Altix at SGI in Eagan, I got a traceback when I unmounted/remounted the CD many many times. It didn't panic. This was a linux-2.5 bk pull from earlier this morning. I'm not yet sure if it's truely related or not. I noticed that one of the differences in the 540 vs 549 spec file is one has patch-2.6.9-rc1-bk7.bz2, the other bk13.bz2. Here is a diff between the spec files (I didn't carefully check the actual %patch's yet): Patch1: patch-2.6.9-rc1.bz2 -Patch2: patch-2.6.9-rc1-bk7.bz2 +Patch2: patch-2.6.9-rc1-bk13.bz2 # # Patches 10 to 100 are upstream patches we want to back out @@ -238,7 +238,6 @@ # Patch1000: linux-2.4.0-test11-vidfail.patch -Patch1010: linux-2.6.9-barrier.patch Patch1020: linux-2.6.4-stackusage.patch Patch1030: linux-2.6.5-ext3-reservations.patch Patch1031: linux-2.6.8-ext3-reservations-update.patch @@ -251,9 +250,8 @@ Patch1081: linux-2.6.7-early-schedule.patch Patch1090: linux-2.6.7-netdump.patch Patch1100: linux-2.6.7-i8042.patch -Patch1110: linux-2.6.7-symlink.patch +Patch1110: linux-2.6.9-irqfixup.patch Patch1120: linux-2.6.7-scsi-whitelist.patch -Patch1130: linux-2.6.9-xattr.patch Patch1140: linux-2.6.9-blockfixes.patch Patch2000: linux-2.6.3-printopen.patch [root@altix2 SPECS]# cat foo.bar Patch1: patch-2.6.9-rc1.bz2 -Patch2: patch-2.6.9-rc1-bk7.bz2 +Patch2: patch-2.6.9-rc1-bk13.bz2 # # Patches 10 to 100 are upstream patches we want to back out @@ -238,7 +238,6 @@ # Patch1000: linux-2.4.0-test11-vidfail.patch -Patch1010: linux-2.6.9-barrier.patch Patch1020: linux-2.6.4-stackusage.patch Patch1030: linux-2.6.5-ext3-reservations.patch Patch1031: linux-2.6.8-ext3-reservations-update.patch @@ -251,9 +250,8 @@ Patch1081: linux-2.6.7-early-schedule.patch Patch1090: linux-2.6.7-netdump.patch Patch1100: linux-2.6.7-i8042.patch -Patch1110: linux-2.6.7-symlink.patch +Patch1110: linux-2.6.9-irqfixup.patch Patch1120: linux-2.6.7-scsi-whitelist.patch -Patch1130: linux-2.6.9-xattr.patch Patch1140: linux-2.6.9-blockfixes.patch Patch2000: linux-2.6.3-printopen.patch So. if this possibility holds, I might be able to induce crashes with 540 if I upgrade to the 2.6.9-rc1-bk13 patch ?? I'll give it a shot.
There were some patch depedencies I didn't want to try to resolve so instead I put the kernel I had used in the SGI Eagan office (a bk pull from linux-2.5 this morning) and it got an unable to handle kernel paging request on my third mount attemptl. Unable to handle kernel paging request at virtual address 0100040600080118 kjournald[898]: Oops 8813272891392 [1] Modules linked in: Pid: 898, CPU 0, comm: kjournald psr : 0000101008126030 ifs : 8000000000001025 ip : [<a000000100293250>] Not tainted ip is at journal_commit_transaction+0x2b0/0x2ee0 unat: 0000000000000000 pfs : 0000000000001025 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000005541 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001002931f0 b6 : a0000001000155e0 b7 : a000000100011b70 f6 : 1003e8080808080808081 f7 : 1003e0000000000001400 f8 : 1003e0000000000001400 f9 : 1003e00000000000027d8 f10 : 1003e000000000ff00000 f11 : 1003e000000003b5f2d38 r1 : a000000100c2c7b0 r2 : 0000000000000000 r3 : e00001b07b020e40 r8 : e00001b07b020e50 r9 : 0000000000000001 r10 : 0000000000000002 r11 : 0000000000000000 r12 : e00001b07b027b10 r13 : e00001b07b020000 r14 : 0000000000000000 r15 : a000000100a8a420 r16 : 0000000000000000 r17 : e00001b004a45c00 r18 : 0000000000000000 r19 : a000000100a67500 r20 : 0000000000000008 r21 : 0000000000000007 r22 : 0000000000014d9d r23 : 0000000000000000 r24 : 0000000000000001 r25 : a000000100a2d9f0 r26 : e00001300518cc88 r27 : 0000001008126030 r28 : a0000001006bc150 r29 : 0000000000014d9e r30 : 0000000000000800 r31 : 0000000000000022 Call Trace: [<a000000100019fe0>] show_stack+0x80/0xa0 sp=e00001b07b0276a0 bsp=e00001b07b0210d0 [<a00000010003f850>] die+0x170/0x200 sp=e00001b07b027870 bsp=e00001b07b021098 [<a00000010005e140>] ia64_do_page_fault+0x200/0xa00 sp=e00001b07b027870 bsp=e00001b07b021038 [<a000000100012320>] ia64_leave_kernel+0x0/0x270 sp=e00001b07b027940 bsp=e00001b07b021038 [<a000000100293250>] journal_commit_transaction+0x2b0/0x2ee0 sp=e00001b07b027b10 bsp=e00001b07b020f08 [<a00000010029b570>] kjournald+0x150/0x460 sp=e00001b07b027d80 bsp=e00001b07b020ea8 [<a00000010001c100>] kernel_thread_helper+0xe0/0x100 sp=e00001b07b027e30 bsp=e00001b07b020e80 [<a000000100009060>] start_kernel_thread+0x20/0x40 sp=e00001b07b027e30 bsp=e00001b07b020e80 <6>note: kjournald[898] exited with preempt_count 1
So I tried this test on ia32. I installed a ia32 box in Boston (tiamat) to RHEL4 re0909 nightly. Then I put the 549 arjan kernel on it. I mounted the cdrom. When I unmounted, the system hung. The graphics console said (no serial console, so this is by hand the best I can): spin_is_locked on unitilized spinlock 11fd9818 A bunch of these spew. Line numbers 165 and 167 of transaction.c are referenced and the system is generally hung at this point. Switching to another vc and typing something just results in more spinlock messages. I'm guessing this is the same issue but it sure has a different failure mode (?). Thoughts? I'll do a new bug search shortly I guess.
So is this related to bug 132152 ?
In my add above, I mentioned transaction.c line numbers 165 and 167 were called out. I re-checked my hand-written notes. It's 165 and 177. This makes more sense - the two lines are spin_lock and spin_unlock lines.
Since I see a failure in the same situation on x86, I've decided to adjust the summary slightly and mark it for all platforms. The way to induce the problem is the same but what happens to the system is different. I think it's likely related.
Umm, now I'm confused. The first report here was in the middle of a wget(1), with no mention of CD at all. Now we've got a set of CD reports, involving ext3, with no backtraces. Are you telling us that the initial one was involving CD too? And is it really an ext3 CD (!), or what? #132152 looks completely unrelated at first glance. Given that this is reproducible, please try to hook up a serial console and capture a trace.
Good point - sorry about that. Mounting/unmounting the CD is the only way I can make the problem happen easily. You're right that the wget was the frist way I hit the problem. Maybe I shouldn't have changed the subject like that. For the ia32 box, there isn't a backtrace - it hangs forever with spinlock messages. If you're sure that's a separate issue, we can file a different bug on it. But since the trigger is the same, I was thinking they could be related. Do you want me to force a backtrace using magic sys req? For ia64, I've included several backtrace attachments.
I guess I'm too used to kdb. I guess the closest we could do is dumping the registers from sysrq. I could patch in KDB if that would help.
Again, what sort of fs is on the CD? And yes, for hangs, the more information the better: at a very minimum, alt-sysrq-t and -p will help. If you have time to get at it with kdb, then sure, the more information you can capture the better.
For tiamat, the ia32 box, it's the rhel4 boot.iso and it mounts as iso9660. For ia64, it's the boot.iso for rhel4 ia64.
This problem seems to be gone. I tested with fedora core nightly Oct 12. Closing.