Bug 646384

Summary: kernel BUG at mm/migrate.c:113!
Product: Red Hat Enterprise Linux 6 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Andrea Arcangeli <aarcange>
Status: CLOSED ERRATA QA Contact: Caspar Zhang <czhang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.0CC: aarcange, czhang, dhoward, fhrbata, plyons, qcai, syeghiay, tburke
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-81.el6 Doc Type: Bug Fix
Doc Text:
Running certain workload tests on a Non-Uniform Memory Architecture (NUMA) system could cause kernel panic at mm/migrate.c:113. This was due to a false positive BUG_ON. With this update, the false positive BUG_ON has been removed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:01:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 580951    
Bug Blocks: 647391    

Description Qian Cai 2010-10-25 09:42:48 UTC
Description of problem:
kernel BUG at mm/migrate.c:113!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu63/cache/index2/shared_cpu_map
CPU 0 
Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_filter ip_tables bridge stp llc kvm_intel kvm autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core sg igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode]

Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_filter ip_tables bridge stp llc kvm_intel kvm autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core sg igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode]
Pid: 28103, comm: largepages15 Tainted: G        W  ----------------  2.6.32-76.el6.test.x86_64 #1 QSSC-S4R
RIP: 0010:[<ffffffff8115b4ea>]  [<ffffffff8115b4ea>] remove_migration_pte+0x20a/0x2f0
RSP: 0000:ffff88105d7c99a8  EFLAGS: 00010246
RAX: 8000000937e000e5 RBX: ffff880c6cc2cdc0 RCX: ffffea000732f1e0
RDX: ffff880bf61d9000 RSI: ffff8809021d2d40 RDI: 0000000000000000
RBP: ffff88105d7c9a08 R08: 00003ffffffff000 R09: ffff880000000000
R10: ffffc00000000fff R11: ffff880894b0edf0 R12: 00007ffff7ce4000
R13: ffff8809021d2d40 R14: ffffea000f717138 R15: ffffffff8115b2e0
FS:  00007ffff7ff1700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffff7df1000 CR3: 000000086bc54000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process largepages15 (pid: 28103, threadinfo ffff88105d7c8000, task ffff88105b2c2080)
Stack:
 0000000000000000 00003ffffffff000 ffff88046c6891b8 ffffc00000000fff
<0> ffffea000000001e ffffea000f5075c0 800000020e8e4045 ffff880c6c8c52b8
<0> ffffea000f717138 ffffea000732f1e0 ffff880c6c8cd4d8 ffffffff8115b2e0
Call Trace:
 [<ffffffff8115b2e0>] ? remove_migration_pte+0x0/0x2f0
 [<ffffffff8113e8fe>] rmap_walk+0x16e/0x1c0
 [<ffffffff8115b892>] ? migrate_page_copy+0x102/0x1c0
 [<ffffffff8115c08d>] migrate_pages+0x48d/0x5d0
 [<ffffffff81152750>] ? compaction_alloc+0x0/0x370
 [<ffffffff811521ac>] compact_zone+0x4ec/0x630
 [<ffffffff81152591>] compact_zone_order+0xa1/0xe0
 [<ffffffff811526db>] try_to_compact_pages+0x10b/0x180
 [<ffffffff8111e6cc>] __alloc_pages_nodemask+0x55c/0x810
 [<ffffffff811505f4>] alloc_pages_vma+0x84/0x110
 [<ffffffff8113f1c0>] ? anon_vma_prepare+0x30/0x160
 [<ffffffff81167995>] do_huge_pmd_anonymous_page+0x135/0x340
 [<ffffffff811365b5>] handle_mm_fault+0x245/0x2b0
 [<ffffffff814cd8d3>] do_page_fault+0x123/0x3a0
 [<ffffffff814cb345>] page_fault+0x25/0x30
Code: 48 09 c6 48 89 f2 48 c1 ea 3b 83 fa 1e 74 24 83 fa 1f 74 1f 48 8b 45 c8 66 ff 00 66 66 90 e9 06 ff ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 40 00 eb fa 48 b8 ff ff ff ff ff ff ff 07 48 21 c6 
RIP  [<ffffffff8115b4ea>] remove_migration_pte+0x20a/0x2f0
 RSP <ffff88105d7c99a8>

Version-Release number of selected component (if applicable):
kernel from RHBZ#622327#c81.

How reproducible:
unknown

Steps to Reproduce:
1. prepare a NUMA system (reproduced on a Nehalem-EX system).
2. threade_memtest+oom+kernelbuild+kvm workloads.
3. reproducer from RHBZ#642570 and modify largepages15.c to use KSM.
# for i in `seq 1 60`; do ./largepages15 & done
  
Actual results:
panic

Expected results:
No panic.

Additional info:
Unfortunately, kdump did not work in this case so no vmcore captured.

Comment 3 Andrea Arcangeli 2010-10-25 17:39:23 UTC
Fix posted to rhkernel-list with Message-ID: <20101025173439.GM910>

I removed the false positive BUG_ON and introduced one new VM_BUG_ON in a s/!pmd_present/pmd_none/ related place, the VM_BUG_ON introduced will be converted to BUG_ON to exercise it in the build that I will provide to QA.

The build system I use has disk full problem, as soon as it's fixed I'll provide a build with patch included. Thanks!

Comment 4 Andrea Arcangeli 2010-10-25 18:11:58 UTC
Build with fix in comment #3 included (with VM_BUG_ON converted to BUG_ON) here:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2850419

Comment 5 RHEL Program Management 2010-10-26 10:49:26 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 12 Aristeu Rozanski 2010-11-12 19:14:19 UTC
Patch(es) available on kernel-2.6.32-82.el6

Comment 18 Martin Prpič 2011-05-09 12:21:56 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Running certain workload tests on a Non-Uniform Memory Architecture (NUMA) system could cause kernel panic at mm/migrate.c:113. This was due to a false positive BUG_ON. With this update, the false positive BUG_ON has been removed.

Comment 19 errata-xmlrpc 2011-05-19 12:01:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html