132093 – kernel panic or hangs w/2.6.8-1.549 when mounting/unmount CDs

Bug 132093 - kernel panic or hangs w/2.6.8-1.549 when mounting/unmount CDs

Summary: kernel panic or hangs w/2.6.8-1.549 when mounting/unmount CDs

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-08 18:56 UTC by Red Hat Bugzilla
Modified:	2015-01-04 22:09 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-10-12 21:54:59 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
console output from panic including tracebacks (12.21 KB, text/plain) 2004-09-08 18:57 UTC, Red Hat Bugzilla	no flags	Details
More tracebacks (it panicked shortly later) ext3 in these too. (27.60 KB, text/plain) 2004-09-08 19:48 UTC, Red Hat Bugzilla	no flags	Details
View All

Description Red Hat Bugzilla 2004-09-08 18:56:26 UTC

I haven't been able to re-produce this a 2nd time yet.  So I'll
asisgn this to myself for now.

After I installed kernel 549 on Arjan's page, I met a kernel panic.

In a moment, I'll attach the output.

Comment 1 Red Hat Bugzilla 2004-09-08 18:57:30 UTC

Created attachment 103600 [details]
console output from panic including tracebacks

Comment 2 Red Hat Bugzilla 2004-09-08 19:08:50 UTC

Traceback that just happened without a panic when I issued 
umount.  I assume it's part of the same problem so I'll drop 
the output here.

Badness in interruptible_sleep_on_timeout at kernel/sched.c:3004
 
Call Trace:
 [<a000000100016a80>] show_stack+0x80/0xa0
                                sp=e000003013837bb0 bsp=e000003013830fe0
 [<a000000100558370>] interruptible_sleep_on_timeout+0x1f0/0x300
                                sp=e000003013837d80 bsp=e000003013830fa0
 [<a00000020056faa0>] lockd_down+0x200/0x440 [lockd]
                                sp=e000003013837db0 bsp=e000003013830f80
 [<a0000002006420d0>] nfs_kill_super+0x1b0/0x280 [nfs]
                                sp=e000003013837db0 bsp=e000003013830f60
 [<a000000100117c90>] deactivate_super+0x130/0x180
                                sp=e000003013837db0 bsp=e000003013830f30
 [<a00000010014f370>] __mntput+0x50/0x80
                                sp=e000003013837db0 bsp=e000003013830f08
 [<a000000100129140>] path_release_on_umount+0x60/0x80
                                sp=e000003013837db0 bsp=e000003013830ee8
 [<a0000001001505e0>] sys_umount+0x540/0x9e0
                                sp=e000003013837db0 bsp=e000003013830e60
 [<a00000010000f320>] ia64_ret_from_syscall+0x0/0x20
                                sp=e000003013837e30 bsp=e000003013830e60
 [<a000000000010640>] 0xa000000000010640
                                sp=e000003013838000 bsp=e000003013830e60
r[root@altix1 ~]# uname -a
Linux altix1.lab.boston.redhat.com 2.6.8-1.549 #1 SMP Mon Sep 6
16:10:54 EDT 2004 ia64 ia64 ia64 GNU/Linux

Comment 3 Red Hat Bugzilla 2004-09-08 19:48:54 UTC

Created attachment 103602 [details]
More tracebacks (it panicked shortly later) ext3 in these too.

I'll see if I can borrow a tiger box and see if it has the same problems.

Comment 4 Red Hat Bugzilla 2004-09-08 22:36:06 UTC

As you may have guessed from the tracebacks, this is not an 
altix specific problem.

I installed a tiger SDV in the lab to RHEL4 0907, then put the 549
kernel on the system.  Once I did that, I just mounted and unmounted
a cdrom a few times and got a similar traceback.  This system didn't
have a serial console but it looks very similar.

I'm going to unassign this - not because I'm not interested or don't
want to help, but because I'm probably not the best person to look at
this. If someone more experienced in this area doens't want to look,
I can take it back and try to fumble my way through it.  

I wanted to at least test to be sure it affected more than just
Altix.

Comment 5 Red Hat Bugzilla 2004-09-09 11:10:51 UTC

At first glance: in a couple of cases we've got a buffer_head data
struct going haywire and oopsing processes which access it.  In one
other, it's an skbuf.

That could be _anything_.  Have you been trying other kernels on the
same box with the same tests?  Which was the last one to run correctly?

Comment 7 Red Hat Bugzilla 2004-09-09 15:17:49 UTC

Using a McKinley SDV/Tiger, I can't seem to induce the failure
using the fedora kernel 2.6.8-1.540.

Comment 8 Red Hat Bugzilla 2004-09-09 15:33:53 UTC

Jesse asked me about this so I guess it's worth mentioning.
On an Altix at SGI in Eagan, I got a traceback when I
unmounted/remounted the CD many many times.  It didn't panic.
This was a linux-2.5 bk pull from earlier this morning.  I'm not 
yet sure if it's truely related or not.

I noticed that one of the differences in the 540 vs 549 spec file
is one has patch-2.6.9-rc1-bk7.bz2, the other bk13.bz2.

Here is a diff between the spec files (I didn't carefully check
the actual %patch's yet):

 Patch1: patch-2.6.9-rc1.bz2
-Patch2: patch-2.6.9-rc1-bk7.bz2
+Patch2: patch-2.6.9-rc1-bk13.bz2
  
 #
 # Patches 10 to 100 are upstream patches we want to back out
@@ -238,7 +238,6 @@
 #
  
 Patch1000: linux-2.4.0-test11-vidfail.patch
-Patch1010: linux-2.6.9-barrier.patch
 Patch1020: linux-2.6.4-stackusage.patch
 Patch1030: linux-2.6.5-ext3-reservations.patch
 Patch1031: linux-2.6.8-ext3-reservations-update.patch
@@ -251,9 +250,8 @@
 Patch1081: linux-2.6.7-early-schedule.patch
 Patch1090: linux-2.6.7-netdump.patch
 Patch1100: linux-2.6.7-i8042.patch
-Patch1110: linux-2.6.7-symlink.patch
+Patch1110: linux-2.6.9-irqfixup.patch
 Patch1120: linux-2.6.7-scsi-whitelist.patch
-Patch1130: linux-2.6.9-xattr.patch
 Patch1140: linux-2.6.9-blockfixes.patch
  
 Patch2000: linux-2.6.3-printopen.patch
[root@altix2 SPECS]# cat foo.bar
 Patch1: patch-2.6.9-rc1.bz2
-Patch2: patch-2.6.9-rc1-bk7.bz2
+Patch2: patch-2.6.9-rc1-bk13.bz2
  
 #
 # Patches 10 to 100 are upstream patches we want to back out
@@ -238,7 +238,6 @@
 #
  
 Patch1000: linux-2.4.0-test11-vidfail.patch
-Patch1010: linux-2.6.9-barrier.patch
 Patch1020: linux-2.6.4-stackusage.patch
 Patch1030: linux-2.6.5-ext3-reservations.patch
 Patch1031: linux-2.6.8-ext3-reservations-update.patch
@@ -251,9 +250,8 @@
 Patch1081: linux-2.6.7-early-schedule.patch
 Patch1090: linux-2.6.7-netdump.patch
 Patch1100: linux-2.6.7-i8042.patch
-Patch1110: linux-2.6.7-symlink.patch
+Patch1110: linux-2.6.9-irqfixup.patch
 Patch1120: linux-2.6.7-scsi-whitelist.patch
-Patch1130: linux-2.6.9-xattr.patch
 Patch1140: linux-2.6.9-blockfixes.patch
  
 Patch2000: linux-2.6.3-printopen.patch


So. if this possibility holds, I might be able to induce crashes
with 540 if I upgrade to the 2.6.9-rc1-bk13 patch ??  I'll give
it a shot.

Comment 9 Red Hat Bugzilla 2004-09-09 16:15:36 UTC

There were some patch depedencies I didn't want to try to resolve
so instead I put the kernel I had used in the SGI Eagan office 
(a bk pull from linux-2.5 this morning) and it got an unable to 
handle kernel paging request on my third mount attemptl.

 Unable to handle kernel paging request at virtual address
0100040600080118
kjournald[898]: Oops 8813272891392 [1]
Modules linked in:
                                                                     
          
Pid: 898, CPU 0, comm:            kjournald
psr : 0000101008126030 ifs : 8000000000001025 ip  :
[<a000000100293250>]    Not tainted
ip is at journal_commit_transaction+0x2b0/0x2ee0
unat: 0000000000000000 pfs : 0000000000001025 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000005541
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001002931f0 b6  : a0000001000155e0 b7  : a000000100011b70
f6  : 1003e8080808080808081 f7  : 1003e0000000000001400
f8  : 1003e0000000000001400 f9  : 1003e00000000000027d8
f10 : 1003e000000000ff00000 f11 : 1003e000000003b5f2d38
r1  : a000000100c2c7b0 r2  : 0000000000000000 r3  : e00001b07b020e40
r8  : e00001b07b020e50 r9  : 0000000000000001 r10 : 0000000000000002
r11 : 0000000000000000 r12 : e00001b07b027b10 r13 : e00001b07b020000
r14 : 0000000000000000 r15 : a000000100a8a420 r16 : 0000000000000000
r17 : e00001b004a45c00 r18 : 0000000000000000 r19 : a000000100a67500
r20 : 0000000000000008 r21 : 0000000000000007 r22 : 0000000000014d9d
r23 : 0000000000000000 r24 : 0000000000000001 r25 : a000000100a2d9f0
r26 : e00001300518cc88 r27 : 0000001008126030 r28 : a0000001006bc150
r29 : 0000000000014d9e r30 : 0000000000000800 r31 : 0000000000000022
                                                                     
          
Call Trace:
 [<a000000100019fe0>] show_stack+0x80/0xa0
                                sp=e00001b07b0276a0 bsp=e00001b07b0210d0
 [<a00000010003f850>] die+0x170/0x200
                                sp=e00001b07b027870 bsp=e00001b07b021098
 [<a00000010005e140>] ia64_do_page_fault+0x200/0xa00
                                sp=e00001b07b027870 bsp=e00001b07b021038
 [<a000000100012320>] ia64_leave_kernel+0x0/0x270
                                sp=e00001b07b027940 bsp=e00001b07b021038
 [<a000000100293250>] journal_commit_transaction+0x2b0/0x2ee0
                                sp=e00001b07b027b10 bsp=e00001b07b020f08
 [<a00000010029b570>] kjournald+0x150/0x460
                                sp=e00001b07b027d80 bsp=e00001b07b020ea8
 [<a00000010001c100>] kernel_thread_helper+0xe0/0x100
                                sp=e00001b07b027e30 bsp=e00001b07b020e80
 [<a000000100009060>] start_kernel_thread+0x20/0x40
                                sp=e00001b07b027e30 bsp=e00001b07b020e80
 <6>note: kjournald[898] exited with preempt_count 1

Comment 10 Red Hat Bugzilla 2004-09-09 18:53:06 UTC

So I tried this test on ia32.  I installed a ia32 box in Boston
(tiamat) to RHEL4 re0909 nightly.  Then I put the 549 arjan kernel
on it.

I mounted the cdrom.  When I unmounted, the system hung.  The graphics
console said (no serial console, so this is by hand the best I can):

spin_is_locked on unitilized spinlock 11fd9818

A bunch of these spew.  Line numbers 165 and 167 of transaction.c
are referenced and the system is generally hung at this point.
Switching to another vc and typing something just results in
more spinlock messages.  I'm guessing this is the same issue but
it sure has a different failure mode (?).  Thoughts?  I'll do a 
new bug search shortly I guess.

Comment 11 Red Hat Bugzilla 2004-09-09 18:57:19 UTC

So is this related to bug 132152 ?

Comment 12 Red Hat Bugzilla 2004-09-09 19:40:25 UTC

In my add above, I mentioned transaction.c line numbers 165 and
167 were called out.  I re-checked my hand-written notes.  It's
165 and 177.  This makes more sense - the two lines are spin_lock
and spin_unlock lines.

Comment 13 Red Hat Bugzilla 2004-09-10 12:40:05 UTC

Since I see a failure in the same situation on x86, I've decided to
adjust the summary slightly and mark it for all platforms.

The way to induce the problem is the same but what happens to the 
system is different.  I think it's likely related.

Comment 14 Red Hat Bugzilla 2004-09-10 12:49:22 UTC

Umm, now I'm confused.

The first report here was in the middle of a wget(1), with no mention
of CD at all.

Now we've got a set of CD reports, involving ext3, with no backtraces.  

Are you telling us that the initial one was involving CD too?  And is
it really an ext3 CD (!), or what?

#132152 looks completely unrelated at first glance.

Given that this is reproducible, please try to hook up a serial
console and capture a trace.

Comment 15 Red Hat Bugzilla 2004-09-10 13:33:53 UTC

Good point - sorry about that.  Mounting/unmounting the CD is the only
way I can make the problem happen easily.  You're right that the 
wget was the frist way I hit the problem.  Maybe I shouldn't have
changed the subject like that.  

For the ia32 box, there isn't a backtrace - it hangs forever with
spinlock messages.  If you're sure that's a separate issue, we can
file a different bug on it.  But since the trigger is the same, I 
was thinking they could be related.  Do you want me to force a 
backtrace using magic sys req?

For ia64, I've included several backtrace attachments.

Comment 16 Red Hat Bugzilla 2004-09-10 13:39:36 UTC

I guess I'm too used to kdb.  I guess the closest we could do
is dumping the registers from sysrq.  I could patch in KDB if
that would help.

Comment 17 Red Hat Bugzilla 2004-09-10 14:10:48 UTC

Again, what sort of fs is on the CD?

And yes, for hangs, the more information the better: at a very
minimum, alt-sysrq-t and -p will help.  If you have time to get at it
with kdb, then sure, the more information you can capture the better.

Comment 18 Red Hat Bugzilla 2004-09-10 14:20:03 UTC

For tiamat, the ia32 box, it's the rhel4 boot.iso and it mounts
as iso9660.  For ia64, it's the boot.iso for rhel4 ia64.

Comment 23 Red Hat Bugzilla 2004-10-12 21:54:59 UTC

This problem seems to be gone.  I tested with fedora core
nightly Oct 12.  Closing.

Note You need to log in before you can comment on or make changes to this bug.