Description of problem: We have a cluster of 250 Dual Core Dual Opterons running jobs in a grid. One crashed with an interesting kernel message that points to a software bug. All machines are installed via kickstart and are identical in both software/hardware. This problem has not been reproduced. Version-Release number of selected component (if applicable): kernel-smp-2.6.15-1.1831_FC4 How reproducible: not reproducible as of yet. . . Steps to Reproduce: None Yet Actual results: Kernel BUG at mm/rmap.c:493 invalid operand: 0000 [1] SMP last sysfs file: /block/loop7/dev CPU 0 Modules linked in: loop nfsd exportfs autofs4 nfs lockd nfs_acl sunrpc dm_mod video button battery ac ohci_hcd i2c_amd8111 i2c_amd756 i2c_core tg3 floppy ext3 jbd sata_mv libata 3w_9xxx sd_mod scsi_mod Pid: 2024, comm: condor_exec.287 Tainted: G M 2.6.15-1.1831_FC4smp #1 RIP: 0010:[<ffffffff8017dd87>] <ffffffff8017dd87>{page_remove_rmap+129} RSP: 0000:ffff81006456dc38 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffff810001000000 RCX: ffffffff8044cf58 RDX: 0000000000000000 RSI: 0000000000000292 RDI: ffffffff8044cf40 RBP: 000000000d472000 R08: 0000000000000004 R09: 0000000000000004 R10: ffff81006456d908 R11: 0000000000000000 R12: ffff8100005fa390 R13: 000000000d600000 R14: ffff810001000000 R15: 0000000029fa1000 FS: 00002aaaab4cc1a0(0000) GS:ffffffff8059b000(0000) knlGS:00000000f7f259e0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 000000000d472000 CR3: 0000000000101000 CR4: 00000000000006e0 Process condor_exec.287 (pid: 2024, threadinfo ffff81006456c000, task ffff81004f9e8820) Stack: 0000000000000000 ffffffff801767c0 0000000000000000 ffff81006456dd38 ffffffffffffffff 0000000000000000 ffff81013f465d18 ffff81006456dd40 000000000036d000 0000000100000001 Call Trace:<ffffffff801767c0>{unmap_vmas+1071} <ffffffff80179938>{exit_mmap+124} <ffffffff80139f66>{mmput+37} <ffffffff8013f1c4>{do_exit+584} <ffffffff8013fcc1>{sys_exit_group+0} <ffffffff80149eb9>{get_signal_to_deliver+1594} <ffffffff8010f23a>{do_signal+116} <ffffffff80356a8b>{thread_return+158} <ffffffff8017bf34>{do_brk+474} <ffffffff8011017e>{retint_signal+61} Code: 0f 0b 68 6e ff 37 80 c2 ed 01 48 c7 c6 ff ff ff ff bf 20 00 RIP <ffffffff8017dd87>{page_remove_rmap+129} RSP <ffff81006456dc38> <1>Fixing recursive fault but reboot is needed! Expected results: Machine keeps running like the 250 neighbors in our grid Additional info:
[This comment added as part of a mass-update to all open FC4 kernel bugs] FC4 has now transitioned to the Fedora legacy project, which will continue to release security related updates for the kernel. As this bug is not security related, it is unlikely to be fixed in an update for FC4, and has been migrated to FC5. Please retest with Fedora Core 5. Thank you.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
(this is a mass-close to kernel bugs in NEEDINFO state) As indicated previously there has been no update on the progress of this bug therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue still occurs for you and I will try to assist in its resolution. Thank you for taking the time to report the initial bug. If you believe that this bug was closed in error, please feel free to reopen this bug.