Bug 185655 - kernel crash with badness in panic
Summary: kernel crash with badness in panic
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 5
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-03-16 18:24 UTC by Erik A. Espinoza
Modified: 2015-01-04 22:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-24 21:58:16 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Erik A. Espinoza 2006-03-16 18:24:42 UTC
Description of problem:
We have 250 identical machines configured via kickstart to be identical. During
the course of running jobs, five have crashed with a weird error on the console.
We have not been able to replicate this.

Version-Release number of selected component (if applicable):
2.6.15-1.1831_FC4smp

How reproducible:
Not

Steps to Reproduce:
1.
2.
3.
  
Actual results:
NMI Watchdog detected LOCKUP on CPU 1
CPU 1
Modules linked in: loop nfsd exportfs autofs4 nfs lockd nfs_acl sunrpc dm_mod
video button battery ac ohci_hcd i2c_amd8111 i2c_amd756 i2c_core tg3 floppy ext3
jbd sata_mv libata 3w_9xxx sd_mod scsi_mod
Pid: 366, comm: kswapd0 Tainted: G   M  2.6.15-1.1831_FC4smp #1
RIP: 0010:[<ffffffff80219710>] <ffffffff80219710>{_raw_write_lock+185}
RSP: 0000:ffff81013f569c08  EFLAGS: 00000006
RAX: 0000000083702500 RBX: ffff8100005fa390 RCX: 00000000151acdaf
RDX: 00000000008697b4 RSI: 0000000000000001 RDI: ffff8100005fa390
RBP: ffff8100032db650 R08: ffff81013f569af8 R09: 00000000fffffffa
R10: ffff81013f569af8 R11: 0000000000000002 R12: ffff81000000ea80
R13: ffff81013f569e28 R14: ffff8100005fa390 R15: ffff81000000f400
FS:  00002aaaaaf11d00(0000) GS:ffffffff8059b080(0000) knlGS:00000000f7f1b9e0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000c19a000 CR3: 0000000101599000 CR4: 00000000000006e0
Process kswapd0 (pid: 366, threadinfo ffff81013f568000, task ffff8100082c60d0)
Stack: ffff8100005fa378 ffffffff80172d06 ffff81000000f418 ffff81000000f418
       ffff81000000f428 0000000000000000 0000000000000040 0000001c8016e030
       00000011ffffffff 0000000100000000
Call Trace:<ffffffff80172d06>{shrink_zone+2954}
<ffffffff801ba471>{mb_cache_shrink_fn+123}
       <ffffffff80357a88>{_spin_lock_irqsave+9}
<ffffffff801732aa>{balance_pgdat+598}
       <ffffffff801735ba>{kswapd+402} <ffffffff801527f0>{autoremove_wake_function+0}
       <ffffffff801387c8>{schedule_tail+70} <ffffffff80110c22>{child_rip+8}
       <ffffffff80173428>{kswapd+0} <ffffffff80110c1a>{child_rip+0}



Code: 48 8d 04 92 48 8d 04 80 48 8d 04 80 48 01 c0 48 39 c8 77 ca
console shuts up ...
<3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():1

Call Trace: <NMI> <ffffffff8013d9e5>{profile_task_exit+21}
<ffffffff8013ef9e>{do_exit+34}
       <ffffffff80357a88>{_spin_lock_irqsave+9}
<ffffffff80228bf4>{vgacon_set_cursor_size+54}
       <ffffffff801117a3>{bad_intr+0} <ffffffff8011f4de>{nmi_watchdog_tick+242}
       <ffffffff80111a5b>{default_do_nmi+137} <ffffffff8011f607>{do_nmi+69}
       <ffffffff80110da3>{nmi+127} <ffffffff80219710>{_raw_write_lock+185}
       <EOE> <ffffffff80172d06>{shrink_zone+2954}
<ffffffff801ba471>{mb_cache_shrink_fn+123}
       <ffffffff80357a88>{_spin_lock_irqsave+9}
<ffffffff801732aa>{balance_pgdat+598}
       <ffffffff801735ba>{kswapd+402} <ffffffff801527f0>{autoremove_wake_function+0}
       <ffffffff801387c8>{schedule_tail+70} <ffffffff80110c22>{child_rip+8}
       <ffffffff80173428>{kswapd+0} <ffffffff80110c1a>{child_rip+0}

Kernel panic - not syncing: Aiee, killing interrupt handler!

Call Trace: <NMI> <ffffffff8013c1d1>{panic+133}
<ffffffff80357aea>{_spin_unlock_irq+14}
       <ffffffff8035746c>{__down_read+50} <ffffffff80357a88>{_spin_lock_irqsave+9}

Badness in panic at kernel/panic.c:139          (Tainted:  G   M )

Expected results:
Much like the 245 other machines, I expected the jobs to conitnue running.

Additional info:

Comment 1 Dave Jones 2006-03-17 22:26:34 UTC
(Tainted:  G   M )

this bit is the key to this problem.  The M signifies that you had taken a
machine check exception.  These typically are hardware problems caused by
overheating or bad memory.

I suggest giving the affected machines a run with memtest86

Comment 2 Erik A. Espinoza 2006-03-17 23:45:32 UTC
Any recommendations on how long memtest should be run on one of these systems?
They were all pre-tested at the factory for 48 hours and have been running jobs
reliably for 2 months here.

Comment 3 Dave Jones 2006-09-17 02:29:20 UTC
[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.

Comment 4 Dave Jones 2006-10-16 21:37:10 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 5 Dave Jones 2006-11-24 21:58:16 UTC
This bug has been mass-closed along with all other bugs that
have been in NEEDINFO state for several months.

Due to the large volume of inactive bugs in bugzilla, this
is the only method we have of cleaning out stale bug reports
where the reporter has disappeared.

If you can reproduce this bug after installing all the
current updates, please reopen this bug.

If you are not the reporter, you can add a comment requesting
it be reopened, and someone will get to it asap.

Thank you.


Note You need to log in before you can comment on or make changes to this bug.