164712 – Kernel BUG at "mm/mmap.c":2026

Bug 164712 - Kernel BUG at "mm/mmap.c":2026

Summary: Kernel BUG at "mm/mmap.c":2026

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-07-30 20:10 UTC by Rob Helmer
Modified:	2015-01-04 22:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-11-10 21:49:59 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
full stack trace from logfile (173.73 KB, text/plain) 2005-07-30 20:11 UTC, Rob Helmer	no flags	Details
Antoher trace (Dual Opteron) (366.72 KB, text/plain) 2005-08-03 00:41 UTC, Thomas Schwanhäuser	no flags	Details
View All

Description Rob Helmer 2005-07-30 20:10:09 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050405 Firefox/1.0 (Ubuntu package 1.0.2)

Description of problem:
Kernel crashes under heavy CPU load. Stack trace attached.

Version-Release number of selected component (if applicable):
kernel-2.6.12-1.1398_FC4

How reproducible:
Didn't try

Steps to Reproduce:
1.heavy cpu load
2.wait 1-2 days
3.wait some more
  

Actual Results:  crash

Expected Results:  continued service

Additional info:

Jul 30 10:50:05 map2 kernel: consoletype[21793]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffc002e0 error 14
Jul 30 10:50:05 map2 kernel: ----------- [cut here ] --------- [please bite here ] ---------
Jul 30 10:50:05 map2 kernel: Kernel BUG at "mm/mmap.c":2026
Jul 30 10:50:05 map2 kernel: invalid operand: 0000 [1] SMP

Comment 1 Rob Helmer 2005-07-30 20:11:53 UTC

Created attachment 117313 [details]
full stack trace from logfile

Comment 2 Thomas Schwanhäuser 2005-08-03 00:41:29 UTC

Created attachment 117379 [details]
Antoher trace (Dual Opteron)

Comment 3 Thomas Schwanhäuser 2005-08-03 00:43:31 UTC

Hi,

we experienced the same problem today. The crash came out of nothing, before
that the system was running for 34 days without a problem. The system had
basically no load when the crash occured. We run a dual Opteron with 2GB RAM.

Comment 4 Dave Jones 2005-08-04 06:24:43 UTC

Looks like another variant of bug 155857
Is this on a Tyan board by any chance ?

Are you running the latest BIOS update ?

Comment 5 Thomas Schwanhäuser 2005-08-04 07:20:29 UTC

In my case it's a Tyan board. It doesn't have the newest BIOS. It's a version 
from March.

Does a BIOS update help in that case?

Comment 6 Dave Jones 2005-08-04 17:26:16 UTC

quite possibly. There is a known CPU errata which some vendors fixed in a BIOS
update.  It's feasible that this problem could affect kernels which use 4-level
page tables (2.6.11 and higher) in certain situations.

The bad pmd messages first started appearing just after the 4-level page table
support got merged upstream.

Comment 7 Rob Helmer 2005-08-04 20:09:10 UTC

I just had another crash; using a Tyan 2882 dual-CPU Opteron motherboard. They
just released a new BIOS update, I am going to try it out today.

according to their site the following issues are fixed in the latest BIOS
revision  (2882_303e):

* Fixed an IOMMU issue
* Fixed an issue where  fan's lower than 1500RPM are not
* displayed correctly
* Fixed an issue where some CPUs report too high
* temperature (~70Â°C)

Could the IOMMU issue be related?

Comment 8 Rob Helmer 2005-08-04 20:22:16 UTC

Actually, 2882_303e is the beta BIOS; the production version is 2882_303 (I have
2882_302 loaded right now). Here is the changelog for 2882_303, is the "AMD
erratum 123" possibly related?

* Fixed an issue where the Pepercon's USB KVM would hang
* in use
* Implemented AMD's recommendations for DDR400 speed
* settings when large loads (more than 4 dimms) were used
* at once
* Added a Auto detect feature and addded support for the
* M3289 & M3290 SMDC cards
* Implemented AMD erratum 123
* Added a IPMI Over Lan selectable option in the BIOS
* [82551]/[BCM5704]
* Fixed an issue where Bank Interleaving was not functioning
* properly
* Fixed an issue where AMD PowerNow! was not working
* correctly
* Fixed an issue where the reported values of CPU1 Vcore
* & CPU2 Vcore were not correct

Comment 9 Dave Jones 2005-08-04 22:47:32 UTC

No, erratum 122 is "TLB Flush Filter may cause coherency problem in
multiprocessor systems", though 123 sounds quite nasty too. (Potential effect:
Data corruption or system hang).

Comment 10 Dave Jones 2005-08-26 21:49:46 UTC

Can you try the latest errata kernel in updates-testing ?
It has a possible workaround for the errata I referred to.

Comment 11 Steffen Plotner 2005-08-28 20:59:41 UTC

I would like to add that I have experienced the same problem under very heavy 
system load (Motherboard Tyan 2882, dual AMD Opteron 1.2GHz, 2GB RAM). Bios is 
3.02. Linux FC4 2.6.12-1.1398_FC4smp.

This happened at 4 in the morning when all the cron jobs run doing their 
maintenance - I am also running a 6 VMs (vmware wks) that contribute to this.

I am planning on upgrading the BIOS to 3.03 tonight and report back the results.

The crash dump appears as follows:


Aug 28 04:24:37 vmhost1 kernel: zcat[26892]: segfault at 0000000000000000 rip 
0000000000000000 rsp 00007fffffa00250 error 14
Aug 28 04:24:37 vmhost1 kernel: ----------- [cut here ] --------- [please bite 
here ] ---------
Aug 28 04:24:37 vmhost1 kernel: Kernel BUG at "mm/mmap.c":2026
Aug 28 04:24:37 vmhost1 kernel: invalid operand: 0000 [1] SMP 
Aug 28 04:24:37 vmhost1 kernel: CPU 1 
Aug 28 04:24:37 vmhost1 kernel: Modules linked in: vmnet(U) parport_pc parport 
vmmon(U) iscsi_trgt(U) crc32c libcrc32c autofs4 w83627hf eeprom lm85 i2c_sens
or i2c_isa i2c_amd756 sunrpc md5 ipv6 ipt_REJECT ipt_state ip_conntrack 
iptable_filter ip_tables video button battery ac ohci_hcd i2c_amd8111 i2c_core 
hw_ra
ndom shpchp e100 mii tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod 
3w_9xxx sata_sil libata sd_mod scsi_mod
Aug 28 04:24:37 vmhost1 kernel: Pid: 26892, comm: zcat Tainted: P   M  2.6.12-
1.1398_FC4smp
Aug 28 04:24:37 vmhost1 kernel: RIP: 0010:[<ffffffff8017835f>] 
<ffffffff8017835f>{exit_mmap+383}
Aug 28 04:24:37 vmhost1 kernel: RSP: 0018:ffff81002dbcfd28  EFLAGS: 00010202
Aug 28 04:24:37 vmhost1 kernel: RAX: 0000000000000037 RBX: 0000000000000000 
RCX: ffff8100581afb90
Aug 28 04:24:37 vmhost1 kernel: RDX: 0000000000000036 RSI: ffff8100581afb20 
RDI: ffff81003ffe6200
Aug 28 04:24:37 vmhost1 kernel: RBP: 0000000000000000 R08: ffff81007f767bc8 
R09: ffff81002dbcfd30
Aug 28 04:24:37 vmhost1 kernel: R10: 0000000000000000 R11: 0000000000000001 
R12: ffff810037c69b40
Aug 28 04:24:39 vmhost1 kernel: R13: ffff810037c69bb8 R14: 000000000000000b 
R15: ffff81006cd70dd8
Aug 28 04:24:42 vmhost1 kernel: FS:  00002aaaaaaba3e0(0000) GS:ffffffff8050d800
(0000) knlGS:00000000ef5ecbb0
Aug 28 04:24:42 vmhost1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Aug 28 04:24:42 vmhost1 kernel: CR2: 00000000006c4580 CR3: 0000000000101000 
CR4: 00000000000006e0
Aug 28 04:24:42 vmhost1 kernel: Process zcat (pid: 26892, threadinfo 
ffff81002dbce000, task ffff81006cd70800)
Aug 28 04:24:42 vmhost1 kernel: Stack: 0000000000000000 0000000000000077 
ffff810040e184a0 ffff810037c69b40 
Aug 28 04:24:42 vmhost1 kernel:        ffff810037c69bc0 ffff81006cd70800 
0000000000000001 ffffffff80137134 
Aug 28 04:24:42 vmhost1 kernel:        ffff81006cd70e54 000000000000000b 
Aug 28 04:24:42 vmhost1 kernel: Call Trace:<ffffffff80137134>{mmput+52} 
<ffffffff8013c15d>{do_exit+397}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8013ccbc>{do_group_exit+252} 
<ffffffff80147e4d>{get_signal_to_deliver+1565}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8010e1cd>{do_signal+157} 
<ffffffff80210ecc>{_atomic_dec_and_lock+44}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff80189d8e>{__fput+270} 
<ffffffff8035ac32>{thread_return+0}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8035ac84>{thread_return+82} 
<ffffffff8010f125>{retint_signal+62}
Aug 28 04:24:42 vmhost1 kernel:        
Aug 28 04:24:42 vmhost1 kernel: 
Aug 28 04:24:42 vmhost1 kernel: Code: 0f 0b e6 b7 37 80 ff ff ff ff ea 07 66 66 
90 66 90 48 83 c4 
Aug 28 04:24:42 vmhost1 kernel: RIP <ffffffff8017835f>{exit_mmap+383} RSP 
<ffff81002dbcfd28>
Aug 28 04:24:42 vmhost1 kernel:  <3>Debug: sleeping function called from 
invalid context at include/linux/rwsem.h:43
Aug 28 04:24:42 vmhost1 kernel: in_atomic():0, irqs_disabled():1
Aug 28 04:24:42 vmhost1 kernel: 
Aug 28 04:24:42 vmhost1 kernel: Call Trace:<ffffffff8013abd5>
{profile_task_exit+21} <ffffffff8013bff2>{do_exit+34}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8022178d>{vgacon_cursor+221} 
<ffffffff8011066d>{die+77}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff80111203>{do_invalid_op+163} 
<ffffffff8017835f>{exit_mmap+383}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff80168f8a>{__pagevec_free+42} 
<ffffffff8016e990>{release_pages+368}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8010f5b5>{error_exit+0} 
<ffffffff8017835f>{exit_mmap+383}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8017834c>{exit_mmap+364} 
<ffffffff80137134>{mmput+52}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff8013c15d>{do_exit+397} 
<ffffffff8013ccbc>{do_group_exit+252}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff80147e4d>
{get_signal_to_deliver+1565} <ffffffff8010e1cd>{do_signal+157}
Aug 28 04:24:42 vmhost1 kernel:        <ffffffff80210ecc>
{_atomic_dec_and_lock+44} <ffffffff80189d8e>{__fput+270}

Comment 12 Dave Jones 2005-09-30 06:16:50 UTC

Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.

Comment 13 Dave Jones 2005-11-10 19:15:25 UTC

2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.

Comment 14 Thomas Schwanhäuser 2005-11-10 21:02:41 UTC

can confirm that the problem was closed with Kernel 2.6.13-1.1526_FC4

Note You need to log in before you can comment on or make changes to this bug.