From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14 Description of problem: This has happened twice now, I haven't written a short program to force the situation but I will work on that now that I know it's not a one-off problem. I have to retype the kernel crash info by hand, so it may be incomplete. Please let me know if you need more specific information. Kernel BUG at mm/rmap.c:590! invalid opcode: 0000 [#1] SMP . . . CPU: 1 EIP: 0060:[<c0463166>] not tainted VLI EFLAGS: 00210286 (2.6.18-53el5PAE #1) Process <our application> (pid 18897 ... ) Call Trace: [<c045d30b>] unmap_vmas+0x2f5/0x58f [<c046044f>] exit_mmap+0x68/0xdf [<c0428be7>] do_exit+0x1eb/0x734 [<c0605f48>] do_page_fault+0x54f/0x5d3 [<c04e4381>] copy_to_user+0x31/0x48 [<c06059f9>] do_page_fault+0x0/0x5d3 [<c0405a71>] error_code+0x39/0x40 Code: ... EIP: [<c0463166>] page_remove_rmap+0x16/0x6d SS:ESP 0068:e5133ea8 I will work on a way to test this, if I find something easily reproducable I will update this bug Version-Release number of selected component (if applicable): kernel-2.6.18-53el5PAE How reproducible: Didn't try Steps to Reproduce: 1. 2. 3. Actual Results: Expected Results: Additional info:
Nothing obvious here, the page->_mapcount went negative!!! Could be a real logic bug or corruption in the page structure. Please let me know ASAP if this is reproducable. Larry Woodman
Also, if you can get a crashdump file that would be great. Larry
I added debug code in RHEL5-U3 to the BUG() statement encountered above so that more debugging data gets printed to the console. While this will not fix the problem it will help us debug it if it is encountered agian. --------------------------------------------------------------------------------- void page_remove_rmap(struct page *page) { if (atomic_add_negative(-1, &page->_mapcount)) { if (unlikely(page_mapcount(page) < 0)) { printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page)); printk (KERN_EMERG " page->flags = %lx\n", page->flags); printk (KERN_EMERG " page->count = %x\n", page_count(page)); printk (KERN_EMERG " page->mapping = %p\n", page->mapping); BUG(); } ---------------------------------------------------------------------------------
Customer saw this in the kdump kernel: Eeek! page_mapcount(page) went negative! (-1) page->flags = 414 page->count = 1 page->mapping = 0000000000000000 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at mm/rmap.c:587 invalid opcode: 0000 [1] SMP last sysfs file: /block/sda/range CPU 0 Modules linked in: nfs nfs_acl fscache lockd sunrpc e1000e ext3 jbd usb_storage ata_piix libata megarai d_sas sd_mod scsi_mod Pid: 21344, comm: exe Not tainted 2.6.18-128.el5 #1 RIP: 0010:[<ffffffff8000ad22>] [<ffffffff8000ad22>] page_remove_rmap+0x79/0xdd RSP: 0018:ffff810005169de8 EFLAGS: 00010282 RAX: 0000000000000026 RBX: ffff8100015ae348 RCX: 0000000000000286 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802f7adc RBP: ffff8100015ae348 R08: 0000000000000000 R09: ffff8100015003d4 R10: 0000000000000001 R11: ffffffff80161742 R12: 000000000000ffff R13: 000000000052f000 R14: ffff81000009f978 R15: ffff8100014e9400 FS: 0000000000000000(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000594f10 CR3: 000000000516a000 CR4: 00000000000006e0 Process exe (pid: 21344, threadinfo ffff810005168000, task ffff810008f007a0) Stack: 000000000000ff20 ffffffff80007bd2 00000000000200d2 0000000000000000 ffff810005169ed8 ffffffffffffffff 0000000000000000 ffff81000830a1e8 ffff810005169ee0 00000000003d4efb 0000000000000000 000000018002df6f Call Trace: [<ffffffff80007bd2>] unmap_vmas+0x4ed/0x848 [<ffffffff80039aad>] exit_mmap+0x78/0xf3 [<ffffffff8003bc07>] mmput+0x30/0x83 [<ffffffff800152f8>] do_exit+0x2b1/0x91f [<ffffffff80048c18>] cpuset_exit+0x0/0x6c [<ffffffff8005d116>] system_call+0x7e/0x83 Code: 0f 0b 68 24 c3 29 80 c2 4b 02 8b 73 18 48 89 df 83 f6 01 83 RIP [<ffffffff8000ad22>] page_remove_rmap+0x79/0xdd RSP <ffff810005169de8> <0>Kernel panic - not syncing: Fatal exception
Updating PM score.
The problem is quite well reproducable here. Writing a kdump isn't feasible because this happens during kdump. But if we need to try a custom debug kernel we'd be able to. Martin
This looks like a duplicate of BZ497884. The mtu of the bnx2/eth1 driver is set to 9000 bytes as configured in the ifcfg file. crash> net_device.mtu ffff810429f04000 mtu = 9000, This has caused corruption which in this case looks like it trashed the pgd page. When the victim proceess incurrs a pagefault it fails with an OOM because the pud_offset returned a NULL. When the victim process exits the system panics tearing down the corrupted pagetables. ------------------------------------------------------------------------------ The driver shipped with RHEL5.2 causes a kernel panic inside of 1 minute when jumbo frames are enabled - seemed serious enough for a hotfix. As there was already work done to update the driver, it seemed reasonable enough to piggy-back on that BZ for the HotFix so that the customer can run it in production. ------------------------------------------------------------------------------ To work around this issue please dont configure jumbo frames or upgrade the kernel to this hotfix: http://seg.rdu.redhat.com/scripts/hotfix/edit.pl?id=3691 Larry Woodman & Dave Anderson(dont shoot, we're only the messengers)...
(In reply to comment #11) [...] > Larry Woodman & Dave Anderson(dont shoot, we're only the messengers)... Thanks Larry -- This is perfect! --chris
We just had this same crash, but we're not running jumbo frames. Here is our backtrace: [2010-05-05 10:46:17]Eeek! page_mapcount(page) went negative! (-1)^M [2010-05-05 10:46:17] page->flags = 18080000000004^M [2010-05-05 10:46:17] page->count = 0^M [2010-05-05 10:46:17] page->mapping = 0000000000000000^M [2010-05-05 10:46:17]----------- [cut here ] --------- [please bite here ] ---------^M [2010-05-05 10:46:17]Kernel BUG at mm/rmap.c:590^M [2010-05-05 10:46:17]invalid opcode: 0000 [1] SMP ^M [2010-05-05 10:46:17]last sysfs file: /devices/pci0000:00/0000:00:06.0/0000:0b:00.0/0000:0c:09.0/0000:0d:00.0/host0/rport-0:0-4/target0:0:4/0:0:4:7/timeout^M [2010-05-05 10:46:17]CPU 0 ^M [2010-05-05 10:46:17]Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle arptable_filter arp_tables x_tables ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st ide_cd sg cdrom hpilo bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M [2010-05-05 10:46:18]Pid: 19600, comm: imap-login Not tainted 2.6.18-194.el5 #1^M [2010-05-05 10:46:18]RIP: 0010:[<ffffffff8000afd7>] [<ffffffff8000afd7>] page_remove_rmap+0x79/0xe7^M [2010-05-05 10:46:18]RSP: 0018:ffff810737445de8 EFLAGS: 00010282^M [2010-05-05 10:46:18]RAX: 0000000000000026 RBX: ffff81010064cc68 RCX: ffffffff80312da8^M [2010-05-05 10:46:19]RDX: ffffffff80312da8 RSI: 0000000000000000 RDI: ffffffff80312da0^M [2010-05-05 10:46:19]RBP: ffff81010064cc68 R08: ffffffff80312da8 R09: 0000000000000001^M [2010-05-05 10:46:19]R10: 0000000000000000 R11: 000000000000027f R12: 000000001cccb025^M [2010-05-05 10:46:19]R13: 000000338c400000 R14: ffff81071975e000 R15: ffff810809c8a440^M [2010-05-05 10:46:19]FS: 0000000000000000(0000) GS:ffffffff803cb000(0000) knlGS:0000000000000000^M [2010-05-05 10:46:19]CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [2010-05-05 10:46:19]CR2: 000000338f403350 CR3: 00000007e6480000 CR4: 00000000000006e0^M [2010-05-05 10:46:19]Process imap-login (pid: 19600, threadinfo ffff810737444000, task ffff81077ea1c860)^M [2010-05-05 10:46:19]Stack: 000000001cccb020 ffffffff80007c5c 0000000000000000 ffff810737445ed8^M [2010-05-05 10:46:19] ffffffffffffffff 0000000000000000 ffff8107f84251e8 ffff810737445ee0^M [2010-05-05 10:46:19] 00000000002ede7a 0000000000000000 0000000109c8a440 ffff810809c8a440^M [2010-05-05 10:46:19]Call Trace:^M [2010-05-05 10:46:19] [<ffffffff80007c5c>] unmap_vmas+0x563/0x904^M [2010-05-05 10:46:19] [<ffffffff8003a31e>] exit_mmap+0x87/0x102^M [2010-05-05 10:46:20] [<ffffffff8003c498>] mmput+0x30/0x84^M [2010-05-05 10:46:20] [<ffffffff800157be>] do_exit+0x2b1/0x911^M [2010-05-05 10:46:20] [<ffffffff800496a1>] cpuset_exit+0x0/0x88^M [2010-05-05 10:46:20] [<ffffffff8005e28d>] tracesys+0xd5/0xe0^M [2010-05-05 10:46:20]^M [2010-05-05 10:46:20]^M [2010-05-05 10:46:20]Code: 0f 0b 68 dc 24 2b 80 c2 4e 02 48 8b 53 18 f6 c2 01 74 02 8b ^M [2010-05-05 10:46:20]RIP [<ffffffff8000afd7>] page_remove_rmap+0x79/0xe7^M [2010-05-05 10:46:20] RSP <ffff810737445de8>^M And here are our interfaces: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet xxx.xxx.xx.xx/32 scope global lo 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:1e:0b:d0:3e:34 brd ff:ff:ff:ff:ff:ff inet xxx.xxx.xx.xx/24 brd 169.230.27.255 scope global eth0 inet xxx.xxx.xx.xx/24 scope global secondary eth0 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:1e:0b:d0:3e:36 brd ff:ff:ff:ff:ff:ff inet xxx.xxx.xx.xx/25 brd 169.230.126.127 scope global eth1 inet xxx.xxx.xx.xx/25 scope global secondary eth1 4: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
We're experiencing a very similar problem: http://i.imgur.com/n4VUs.png The call trace is slightly different but the majority of it is the same.
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
This is looking more like corruption in a pte when a process exits. Both BUGs occur durring exit when the zap_pte_range() goes through the pte page, calls vm_normal_page() to extract the pfn and convert it to a page structure. In the first case the pfn was that of a valid pages but it BUG'd in page_remove_rmap() because the page_mapcount() was negative(probably because the pfn was clobbered) In the second case vm_normal_page() BUG'd because the pfn wasnt that of a valid page of RAM. Whats so special about this hardware/configuration that you see this problem but nobody else sees it??? Larry Woodman
Hi Larry, we see this issue on a wide variety of setups, but I haven't found a way to reproduce it. However we haven't seen it for about 2 or 3 months now.
(In reply to comment #27) > Whats so special about this hardware/configuration that you see this problem > but nobody else sees it??? Frankly, I hope that you will be able to tell ... AFAICS, there are 4 different reporting parties linked to this BZ, so this phenomenon is not unique to our environment. For clarification: Our problem (Fujitsu support case 00321582) has been added to this BZ because of the similarity in BUG messages. We have only seen this problem while doing kdump over network (nfs, not 100% sure about ssh), usually around the time when the dump is almost finished. It has been seen on several different systems, but so far only in a certain environment (our QA lab in Augsburg, Germany). Attempts to reproduce it at Red Hat Farnborough and in our development lab in Paderborn have failed. We have been unable to identify why this is so.
FYI - the Fujitsu support case 00321582 has been closed as restriction. We acknowledge that we'll probably never see a solution for this problem.
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.