From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050405 Firefox/1.0 (Ubuntu package 1.0.2) Description of problem: Kernel crashes under heavy CPU load. Stack trace attached. Version-Release number of selected component (if applicable): kernel-2.6.12-1.1398_FC4 How reproducible: Didn't try Steps to Reproduce: 1.heavy cpu load 2.wait 1-2 days 3.wait some more Actual Results: crash Expected Results: continued service Additional info: Jul 30 10:50:05 map2 kernel: consoletype[21793]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffc002e0 error 14 Jul 30 10:50:05 map2 kernel: ----------- [cut here ] --------- [please bite here ] --------- Jul 30 10:50:05 map2 kernel: Kernel BUG at "mm/mmap.c":2026 Jul 30 10:50:05 map2 kernel: invalid operand: 0000 [1] SMP
Created attachment 117313 [details] full stack trace from logfile
Created attachment 117379 [details] Antoher trace (Dual Opteron)
Hi, we experienced the same problem today. The crash came out of nothing, before that the system was running for 34 days without a problem. The system had basically no load when the crash occured. We run a dual Opteron with 2GB RAM.
Looks like another variant of bug 155857 Is this on a Tyan board by any chance ? Are you running the latest BIOS update ?
In my case it's a Tyan board. It doesn't have the newest BIOS. It's a version from March. Does a BIOS update help in that case?
quite possibly. There is a known CPU errata which some vendors fixed in a BIOS update. It's feasible that this problem could affect kernels which use 4-level page tables (2.6.11 and higher) in certain situations. The bad pmd messages first started appearing just after the 4-level page table support got merged upstream.
I just had another crash; using a Tyan 2882 dual-CPU Opteron motherboard. They just released a new BIOS update, I am going to try it out today. according to their site the following issues are fixed in the latest BIOS revision (2882_303e): * Fixed an IOMMU issue * Fixed an issue where fan's lower than 1500RPM are not * displayed correctly * Fixed an issue where some CPUs report too high * temperature (~70°C) Could the IOMMU issue be related?
Actually, 2882_303e is the beta BIOS; the production version is 2882_303 (I have 2882_302 loaded right now). Here is the changelog for 2882_303, is the "AMD erratum 123" possibly related? * Fixed an issue where the Pepercon's USB KVM would hang * in use * Implemented AMD's recommendations for DDR400 speed * settings when large loads (more than 4 dimms) were used * at once * Added a Auto detect feature and addded support for the * M3289 & M3290 SMDC cards * Implemented AMD erratum 123 * Added a IPMI Over Lan selectable option in the BIOS * [82551]/[BCM5704] * Fixed an issue where Bank Interleaving was not functioning * properly * Fixed an issue where AMD PowerNow! was not working * correctly * Fixed an issue where the reported values of CPU1 Vcore * & CPU2 Vcore were not correct
No, erratum 122 is "TLB Flush Filter may cause coherency problem in multiprocessor systems", though 123 sounds quite nasty too. (Potential effect: Data corruption or system hang).
Can you try the latest errata kernel in updates-testing ? It has a possible workaround for the errata I referred to.
I would like to add that I have experienced the same problem under very heavy system load (Motherboard Tyan 2882, dual AMD Opteron 1.2GHz, 2GB RAM). Bios is 3.02. Linux FC4 2.6.12-1.1398_FC4smp. This happened at 4 in the morning when all the cron jobs run doing their maintenance - I am also running a 6 VMs (vmware wks) that contribute to this. I am planning on upgrading the BIOS to 3.03 tonight and report back the results. The crash dump appears as follows: Aug 28 04:24:37 vmhost1 kernel: zcat[26892]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffa00250 error 14 Aug 28 04:24:37 vmhost1 kernel: ----------- [cut here ] --------- [please bite here ] --------- Aug 28 04:24:37 vmhost1 kernel: Kernel BUG at "mm/mmap.c":2026 Aug 28 04:24:37 vmhost1 kernel: invalid operand: 0000 [1] SMP Aug 28 04:24:37 vmhost1 kernel: CPU 1 Aug 28 04:24:37 vmhost1 kernel: Modules linked in: vmnet(U) parport_pc parport vmmon(U) iscsi_trgt(U) crc32c libcrc32c autofs4 w83627hf eeprom lm85 i2c_sens or i2c_isa i2c_amd756 sunrpc md5 ipv6 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables video button battery ac ohci_hcd i2c_amd8111 i2c_core hw_ra ndom shpchp e100 mii tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod 3w_9xxx sata_sil libata sd_mod scsi_mod Aug 28 04:24:37 vmhost1 kernel: Pid: 26892, comm: zcat Tainted: P M 2.6.12- 1.1398_FC4smp Aug 28 04:24:37 vmhost1 kernel: RIP: 0010:[<ffffffff8017835f>] <ffffffff8017835f>{exit_mmap+383} Aug 28 04:24:37 vmhost1 kernel: RSP: 0018:ffff81002dbcfd28 EFLAGS: 00010202 Aug 28 04:24:37 vmhost1 kernel: RAX: 0000000000000037 RBX: 0000000000000000 RCX: ffff8100581afb90 Aug 28 04:24:37 vmhost1 kernel: RDX: 0000000000000036 RSI: ffff8100581afb20 RDI: ffff81003ffe6200 Aug 28 04:24:37 vmhost1 kernel: RBP: 0000000000000000 R08: ffff81007f767bc8 R09: ffff81002dbcfd30 Aug 28 04:24:37 vmhost1 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff810037c69b40 Aug 28 04:24:39 vmhost1 kernel: R13: ffff810037c69bb8 R14: 000000000000000b R15: ffff81006cd70dd8 Aug 28 04:24:42 vmhost1 kernel: FS: 00002aaaaaaba3e0(0000) GS:ffffffff8050d800 (0000) knlGS:00000000ef5ecbb0 Aug 28 04:24:42 vmhost1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 28 04:24:42 vmhost1 kernel: CR2: 00000000006c4580 CR3: 0000000000101000 CR4: 00000000000006e0 Aug 28 04:24:42 vmhost1 kernel: Process zcat (pid: 26892, threadinfo ffff81002dbce000, task ffff81006cd70800) Aug 28 04:24:42 vmhost1 kernel: Stack: 0000000000000000 0000000000000077 ffff810040e184a0 ffff810037c69b40 Aug 28 04:24:42 vmhost1 kernel: ffff810037c69bc0 ffff81006cd70800 0000000000000001 ffffffff80137134 Aug 28 04:24:42 vmhost1 kernel: ffff81006cd70e54 000000000000000b Aug 28 04:24:42 vmhost1 kernel: Call Trace:<ffffffff80137134>{mmput+52} <ffffffff8013c15d>{do_exit+397} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8013ccbc>{do_group_exit+252} <ffffffff80147e4d>{get_signal_to_deliver+1565} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8010e1cd>{do_signal+157} <ffffffff80210ecc>{_atomic_dec_and_lock+44} Aug 28 04:24:42 vmhost1 kernel: <ffffffff80189d8e>{__fput+270} <ffffffff8035ac32>{thread_return+0} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8035ac84>{thread_return+82} <ffffffff8010f125>{retint_signal+62} Aug 28 04:24:42 vmhost1 kernel: Aug 28 04:24:42 vmhost1 kernel: Aug 28 04:24:42 vmhost1 kernel: Code: 0f 0b e6 b7 37 80 ff ff ff ff ea 07 66 66 90 66 90 48 83 c4 Aug 28 04:24:42 vmhost1 kernel: RIP <ffffffff8017835f>{exit_mmap+383} RSP <ffff81002dbcfd28> Aug 28 04:24:42 vmhost1 kernel: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 Aug 28 04:24:42 vmhost1 kernel: in_atomic():0, irqs_disabled():1 Aug 28 04:24:42 vmhost1 kernel: Aug 28 04:24:42 vmhost1 kernel: Call Trace:<ffffffff8013abd5> {profile_task_exit+21} <ffffffff8013bff2>{do_exit+34} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8022178d>{vgacon_cursor+221} <ffffffff8011066d>{die+77} Aug 28 04:24:42 vmhost1 kernel: <ffffffff80111203>{do_invalid_op+163} <ffffffff8017835f>{exit_mmap+383} Aug 28 04:24:42 vmhost1 kernel: <ffffffff80168f8a>{__pagevec_free+42} <ffffffff8016e990>{release_pages+368} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8010f5b5>{error_exit+0} <ffffffff8017835f>{exit_mmap+383} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8017834c>{exit_mmap+364} <ffffffff80137134>{mmput+52} Aug 28 04:24:42 vmhost1 kernel: <ffffffff8013c15d>{do_exit+397} <ffffffff8013ccbc>{do_group_exit+252} Aug 28 04:24:42 vmhost1 kernel: <ffffffff80147e4d> {get_signal_to_deliver+1565} <ffffffff8010e1cd>{do_signal+157} Aug 28 04:24:42 vmhost1 kernel: <ffffffff80210ecc> {_atomic_dec_and_lock+44} <ffffffff80189d8e>{__fput+270}
Mass update to all FC4 bugs: An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream kernel (2.6.13.2). As there were ~3500 changes upstream between this and the previous kernel, it's possible your bug has been fixed already. Please retest with this update, and update this bug if necessary. Thanks.
2.6.14-1.1637_FC4 has been released as an update for FC4. Please retest with this update, as a large amount of code has been changed in this release, which may have fixed your problem. Thank you.
can confirm that the problem was closed with Kernel 2.6.13-1.1526_FC4