443651 – Kernel Panic: kernel BUG at mm/rmap:590!

Bug 443651 - Kernel Panic: kernel BUG at mm/rmap:590!

Summary: Kernel Panic: kernel BUG at mm/rmap:590!

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	457458 465193 479142 533192
TreeView+	depends on / blocked

Reported:	2008-04-22 18:04 UTC by Scott LaCroix
Modified:	2018-11-14 18:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-08-12 01:25:11 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Scott LaCroix 2008-04-22 18:04:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14

Description of problem:
This has happened twice now, I haven't written a short program to force the situation but I will work on that now that I know it's not a one-off problem.

I have to retype the kernel crash info by hand, so it may be incomplete. Please let me know if you need more specific information.

Kernel BUG at mm/rmap.c:590!
invalid opcode: 0000 [#1]
SMP
.
.
.
CPU: 1
EIP: 0060:[<c0463166>] not tainted VLI
EFLAGS: 00210286 (2.6.18-53el5PAE #1)
Process <our application> (pid 18897 ... )
Call Trace:
 [<c045d30b>] unmap_vmas+0x2f5/0x58f
 [<c046044f>] exit_mmap+0x68/0xdf
 [<c0428be7>] do_exit+0x1eb/0x734
 [<c0605f48>] do_page_fault+0x54f/0x5d3
 [<c04e4381>] copy_to_user+0x31/0x48
 [<c06059f9>] do_page_fault+0x0/0x5d3
 [<c0405a71>] error_code+0x39/0x40

Code: ...
EIP: [<c0463166>] page_remove_rmap+0x16/0x6d SS:ESP 0068:e5133ea8


I will work on a way to test this, if I find something easily reproducable I will update this bug

Version-Release number of selected component (if applicable):
kernel-2.6.18-53el5PAE

How reproducible:
Didn't try


Steps to Reproduce:
1.
2.
3.

Actual Results:


Expected Results:


Additional info:

Comment 1 Larry Woodman 2008-04-24 16:05:49 UTC

Nothing obvious here, the page->_mapcount went negative!!!  Could be a real
logic bug or  corruption in the page structure.  Please let me know ASAP if this
is reproducable.

Larry Woodman

Comment 2 Larry Woodman 2008-04-24 16:08:45 UTC


Also, if you can get a crashdump file that would be great.

Larry

Comment 3 Larry Woodman 2008-08-26 17:54:18 UTC

I added debug code in RHEL5-U3 to the BUG() statement encountered above so that more debugging data gets printed to the console.  While this will not fix the problem it will help us debug it if it is encountered agian.

---------------------------------------------------------------------------------
void page_remove_rmap(struct page *page)
{
        if (atomic_add_negative(-1, &page->_mapcount)) {
                if (unlikely(page_mapcount(page) < 0)) {
                        printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
                        printk (KERN_EMERG "  page->flags = %lx\n", page->flags);
                        printk (KERN_EMERG "  page->count = %x\n", page_count(page));
                        printk (KERN_EMERG "  page->mapping = %p\n", page->mapping);
                        BUG();
                }
---------------------------------------------------------------------------------

Comment 5 Guy Streeter 2009-02-13 20:13:49 UTC

Customer saw this in the kdump kernel:

Eeek! page_mapcount(page) went negative! (-1)
 page->flags = 414
 page->count = 1
 page->mapping = 0000000000000000
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/rmap.c:587
invalid opcode: 0000 [1] SMP
last sysfs file: /block/sda/range
CPU 0
Modules linked in: nfs nfs_acl fscache lockd sunrpc e1000e ext3 jbd usb_storage ata_piix libata megarai
d_sas sd_mod scsi_mod
Pid: 21344, comm: exe Not tainted 2.6.18-128.el5 #1
RIP: 0010:[<ffffffff8000ad22>]  [<ffffffff8000ad22>] page_remove_rmap+0x79/0xdd
RSP: 0018:ffff810005169de8  EFLAGS: 00010282
RAX: 0000000000000026 RBX: ffff8100015ae348 RCX: 0000000000000286
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802f7adc
RBP: ffff8100015ae348 R08: 0000000000000000 R09: ffff8100015003d4
R10: 0000000000000001 R11: ffffffff80161742 R12: 000000000000ffff
R13: 000000000052f000 R14: ffff81000009f978 R15: ffff8100014e9400
FS:  0000000000000000(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000594f10 CR3: 000000000516a000 CR4: 00000000000006e0
Process exe (pid: 21344, threadinfo ffff810005168000, task ffff810008f007a0)
Stack:  000000000000ff20 ffffffff80007bd2 00000000000200d2 0000000000000000
ffff810005169ed8 ffffffffffffffff 0000000000000000 ffff81000830a1e8
ffff810005169ee0 00000000003d4efb 0000000000000000 000000018002df6f
Call Trace:
[<ffffffff80007bd2>] unmap_vmas+0x4ed/0x848
[<ffffffff80039aad>] exit_mmap+0x78/0xf3
[<ffffffff8003bc07>] mmput+0x30/0x83
[<ffffffff800152f8>] do_exit+0x2b1/0x91f
[<ffffffff80048c18>] cpuset_exit+0x0/0x6c
[<ffffffff8005d116>] system_call+0x7e/0x83


Code: 0f 0b 68 24 c3 29 80 c2 4b 02 8b 73 18 48 89 df 83 f6 01 83
RIP  [<ffffffff8000ad22>] page_remove_rmap+0x79/0xdd
RSP <ffff810005169de8>
<0>Kernel panic - not syncing: Fatal exception

Comment 6 RHEL Program Management 2009-02-16 15:43:28 UTC

Updating PM score.

Comment 7 Martin Wilck 2009-02-17 12:26:07 UTC

The problem is quite well reproducable here. Writing a kdump isn't feasible because this happens during kdump.

But if we need to try a custom debug kernel we'd be able to.
Martin

Comment 11 Larry Woodman 2009-05-19 14:45:24 UTC

This looks like a duplicate of BZ497884.  The mtu of the bnx2/eth1 driver is set to 9000 bytes as configured in the ifcfg file.

crash> net_device.mtu ffff810429f04000
  mtu = 9000, 

This has caused corruption which in this case looks like it trashed the pgd page.  When the victim proceess incurrs a pagefault it fails with an OOM because the pud_offset returned a NULL.  When the victim process exits the system panics tearing down the corrupted pagetables.

------------------------------------------------------------------------------
The driver shipped with RHEL5.2 causes a kernel panic inside of 1 minute when jumbo frames are enabled - seemed serious enough for a hotfix.  As there was already work done to update the driver, it seemed reasonable enough to piggy-back on that BZ for the HotFix so that the customer can run it in production.
------------------------------------------------------------------------------

To work around this issue please dont configure jumbo frames or upgrade the kernel to this hotfix:

http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=3691


Larry Woodman & Dave Anderson(dont shoot, we're only the messengers)...

Comment 12 Chris Van Hoof 2009-05-21 14:38:03 UTC

(In reply to comment #11)
[...]
> Larry Woodman & Dave Anderson(dont shoot, we're only the messengers)...  

Thanks Larry -- This is perfect!

--chris

Comment 17 Scooter Morris 2010-05-05 18:51:52 UTC

We just had this same crash, but we're not running jumbo frames.  Here is our backtrace:

[2010-05-05 10:46:17]Eeek! page_mapcount(page) went negative! (-1)^M
[2010-05-05 10:46:17]  page->flags = 18080000000004^M
[2010-05-05 10:46:17]  page->count = 0^M
[2010-05-05 10:46:17]  page->mapping = 0000000000000000^M
[2010-05-05 10:46:17]----------- [cut here ] --------- [please bite here ] ---------^M
[2010-05-05 10:46:17]Kernel BUG at mm/rmap.c:590^M
[2010-05-05 10:46:17]invalid opcode: 0000 [1] SMP ^M
[2010-05-05 10:46:17]last sysfs file: /devices/pci0000:00/0000:00:06.0/0000:0b:00.0/0000:0c:09.0/0000:0d:00.0/host0/rport-0:0-4/target0:0:4/0:0:4:7/timeout^M
[2010-05-05 10:46:17]CPU 0 ^M
[2010-05-05 10:46:17]Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle arptable_filter arp_tables x_tables ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st ide_cd sg cdrom hpilo bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M
[2010-05-05 10:46:18]Pid: 19600, comm: imap-login Not tainted 2.6.18-194.el5 #1^M
[2010-05-05 10:46:18]RIP: 0010:[<ffffffff8000afd7>]  [<ffffffff8000afd7>] page_remove_rmap+0x79/0xe7^M
[2010-05-05 10:46:18]RSP: 0018:ffff810737445de8  EFLAGS: 00010282^M
[2010-05-05 10:46:18]RAX: 0000000000000026 RBX: ffff81010064cc68 RCX: ffffffff80312da8^M
[2010-05-05 10:46:19]RDX: ffffffff80312da8 RSI: 0000000000000000 RDI: ffffffff80312da0^M
[2010-05-05 10:46:19]RBP: ffff81010064cc68 R08: ffffffff80312da8 R09: 0000000000000001^M
[2010-05-05 10:46:19]R10: 0000000000000000 R11: 000000000000027f R12: 000000001cccb025^M
[2010-05-05 10:46:19]R13: 000000338c400000 R14: ffff81071975e000 R15: ffff810809c8a440^M
[2010-05-05 10:46:19]FS:  0000000000000000(0000) GS:ffffffff803cb000(0000) knlGS:0000000000000000^M
[2010-05-05 10:46:19]CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[2010-05-05 10:46:19]CR2: 000000338f403350 CR3: 00000007e6480000 CR4: 00000000000006e0^M
[2010-05-05 10:46:19]Process imap-login (pid: 19600, threadinfo ffff810737444000, task ffff81077ea1c860)^M
[2010-05-05 10:46:19]Stack:  000000001cccb020 ffffffff80007c5c 0000000000000000 ffff810737445ed8^M
[2010-05-05 10:46:19] ffffffffffffffff 0000000000000000 ffff8107f84251e8 ffff810737445ee0^M
[2010-05-05 10:46:19] 00000000002ede7a 0000000000000000 0000000109c8a440 ffff810809c8a440^M
[2010-05-05 10:46:19]Call Trace:^M
[2010-05-05 10:46:19] [<ffffffff80007c5c>] unmap_vmas+0x563/0x904^M
[2010-05-05 10:46:19] [<ffffffff8003a31e>] exit_mmap+0x87/0x102^M
[2010-05-05 10:46:20] [<ffffffff8003c498>] mmput+0x30/0x84^M
[2010-05-05 10:46:20] [<ffffffff800157be>] do_exit+0x2b1/0x911^M
[2010-05-05 10:46:20] [<ffffffff800496a1>] cpuset_exit+0x0/0x88^M
[2010-05-05 10:46:20] [<ffffffff8005e28d>] tracesys+0xd5/0xe0^M
[2010-05-05 10:46:20]^M
[2010-05-05 10:46:20]^M
[2010-05-05 10:46:20]Code: 0f 0b 68 dc 24 2b 80 c2 4e 02 48 8b 53 18 f6 c2 01 74 02 8b ^M
[2010-05-05 10:46:20]RIP  [<ffffffff8000afd7>] page_remove_rmap+0x79/0xe7^M
[2010-05-05 10:46:20] RSP <ffff810737445de8>^M


And here are our interfaces:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet xxx.xxx.xx.xx/32 scope global lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:1e:0b:d0:3e:34 brd ff:ff:ff:ff:ff:ff
    inet xxx.xxx.xx.xx/24 brd 169.230.27.255 scope global eth0
    inet xxx.xxx.xx.xx/24 scope global secondary eth0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:1e:0b:d0:3e:36 brd ff:ff:ff:ff:ff:ff
    inet xxx.xxx.xx.xx/25 brd 169.230.126.127 scope global eth1
    inet xxx.xxx.xx.xx/25 scope global secondary eth1
4: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0

Comment 20 James 2010-10-05 17:07:24 UTC

We're experiencing a very similar problem: http://i.imgur.com/n4VUs.png

The call trace is slightly different but the majority of it is the same.

Comment 22 RHEL Program Management 2010-12-07 09:50:30 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 27 Larry Woodman 2011-02-03 20:12:29 UTC

This is looking more like corruption in a pte when a process exits.  Both BUGs occur durring exit when the zap_pte_range() goes through the pte page, calls vm_normal_page() to extract the pfn and convert it to a page structure.  In the first case the pfn was that of a valid pages but it BUG'd in page_remove_rmap() because the page_mapcount() was negative(probably because the pfn was clobbered) In the second case vm_normal_page() BUG'd because the pfn wasnt that of a valid page of RAM.

Whats so special about this hardware/configuration that you see this problem but nobody else sees it???

Larry Woodman

Comment 28 James 2011-02-04 08:11:19 UTC

Hi Larry,
we see this issue on a wide variety of setups, but I haven't found a way to reproduce it. However we haven't seen it for about 2 or 3 months now.

Comment 29 Martin Wilck 2011-02-04 09:14:33 UTC

(In reply to comment #27)

> Whats so special about this hardware/configuration that you see this problem
> but nobody else sees it???

Frankly, I hope that you will be able to tell ...

AFAICS, there are 4 different reporting parties linked to this BZ, so this phenomenon is not unique to our environment.

For clarification: Our problem (Fujitsu support case 00321582) has been added to this BZ because of the similarity in BUG messages. We have only seen this problem while doing kdump over network (nfs, not 100% sure about ssh), usually around the time when the dump is almost finished. It has been seen on several different systems, but so far only in a certain environment (our QA lab in Augsburg, Germany). Attempts to reproduce it at Red Hat Farnborough and in our development lab in Paderborn have failed. We have been unable to identify why this is so.

Comment 30 Martin Wilck 2011-05-03 09:15:09 UTC

FYI - the Fujitsu support case 00321582 has been closed as restriction. We acknowledge that we'll probably never see a solution for this problem.

Comment 33 RHEL Program Management 2011-06-20 21:08:39 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 34 RHEL Program Management 2011-08-12 01:25:11 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Note You need to log in before you can comment on or make changes to this bug.