Description of problem: We have 250 identical machines configured via kickstart to be identical. During the course of running jobs, five have crashed with a weird error on the console. We have not been able to replicate this. Version-Release number of selected component (if applicable): 2.6.15-1.1831_FC4smp How reproducible: Not Steps to Reproduce: 1. 2. 3. Actual results: NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: loop nfsd exportfs autofs4 nfs lockd nfs_acl sunrpc dm_mod video button battery ac ohci_hcd i2c_amd8111 i2c_amd756 i2c_core tg3 floppy ext3 jbd sata_mv libata 3w_9xxx sd_mod scsi_mod Pid: 366, comm: kswapd0 Tainted: G M 2.6.15-1.1831_FC4smp #1 RIP: 0010:[<ffffffff80219710>] <ffffffff80219710>{_raw_write_lock+185} RSP: 0000:ffff81013f569c08 EFLAGS: 00000006 RAX: 0000000083702500 RBX: ffff8100005fa390 RCX: 00000000151acdaf RDX: 00000000008697b4 RSI: 0000000000000001 RDI: ffff8100005fa390 RBP: ffff8100032db650 R08: ffff81013f569af8 R09: 00000000fffffffa R10: ffff81013f569af8 R11: 0000000000000002 R12: ffff81000000ea80 R13: ffff81013f569e28 R14: ffff8100005fa390 R15: ffff81000000f400 FS: 00002aaaaaf11d00(0000) GS:ffffffff8059b080(0000) knlGS:00000000f7f1b9e0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000c19a000 CR3: 0000000101599000 CR4: 00000000000006e0 Process kswapd0 (pid: 366, threadinfo ffff81013f568000, task ffff8100082c60d0) Stack: ffff8100005fa378 ffffffff80172d06 ffff81000000f418 ffff81000000f418 ffff81000000f428 0000000000000000 0000000000000040 0000001c8016e030 00000011ffffffff 0000000100000000 Call Trace:<ffffffff80172d06>{shrink_zone+2954} <ffffffff801ba471>{mb_cache_shrink_fn+123} <ffffffff80357a88>{_spin_lock_irqsave+9} <ffffffff801732aa>{balance_pgdat+598} <ffffffff801735ba>{kswapd+402} <ffffffff801527f0>{autoremove_wake_function+0} <ffffffff801387c8>{schedule_tail+70} <ffffffff80110c22>{child_rip+8} <ffffffff80173428>{kswapd+0} <ffffffff80110c1a>{child_rip+0} Code: 48 8d 04 92 48 8d 04 80 48 8d 04 80 48 01 c0 48 39 c8 77 ca console shuts up ... <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():1, irqs_disabled():1 Call Trace: <NMI> <ffffffff8013d9e5>{profile_task_exit+21} <ffffffff8013ef9e>{do_exit+34} <ffffffff80357a88>{_spin_lock_irqsave+9} <ffffffff80228bf4>{vgacon_set_cursor_size+54} <ffffffff801117a3>{bad_intr+0} <ffffffff8011f4de>{nmi_watchdog_tick+242} <ffffffff80111a5b>{default_do_nmi+137} <ffffffff8011f607>{do_nmi+69} <ffffffff80110da3>{nmi+127} <ffffffff80219710>{_raw_write_lock+185} <EOE> <ffffffff80172d06>{shrink_zone+2954} <ffffffff801ba471>{mb_cache_shrink_fn+123} <ffffffff80357a88>{_spin_lock_irqsave+9} <ffffffff801732aa>{balance_pgdat+598} <ffffffff801735ba>{kswapd+402} <ffffffff801527f0>{autoremove_wake_function+0} <ffffffff801387c8>{schedule_tail+70} <ffffffff80110c22>{child_rip+8} <ffffffff80173428>{kswapd+0} <ffffffff80110c1a>{child_rip+0} Kernel panic - not syncing: Aiee, killing interrupt handler! Call Trace: <NMI> <ffffffff8013c1d1>{panic+133} <ffffffff80357aea>{_spin_unlock_irq+14} <ffffffff8035746c>{__down_read+50} <ffffffff80357a88>{_spin_lock_irqsave+9} Badness in panic at kernel/panic.c:139 (Tainted: G M ) Expected results: Much like the 245 other machines, I expected the jobs to conitnue running. Additional info:
(Tainted: G M ) this bit is the key to this problem. The M signifies that you had taken a machine check exception. These typically are hardware problems caused by overheating or bad memory. I suggest giving the affected machines a run with memtest86
Any recommendations on how long memtest should be run on one of these systems? They were all pre-tested at the factory for 48 hours and have been running jobs reliably for 2 months here.
[This comment added as part of a mass-update to all open FC4 kernel bugs] FC4 has now transitioned to the Fedora legacy project, which will continue to release security related updates for the kernel. As this bug is not security related, it is unlikely to be fixed in an update for FC4, and has been migrated to FC5. Please retest with Fedora Core 5. Thank you.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
This bug has been mass-closed along with all other bugs that have been in NEEDINFO state for several months. Due to the large volume of inactive bugs in bugzilla, this is the only method we have of cleaning out stale bug reports where the reporter has disappeared. If you can reproduce this bug after installing all the current updates, please reopen this bug. If you are not the reporter, you can add a comment requesting it be reopened, and someone will get to it asap. Thank you.