Bug 1099985 - [abrt] BUG: Bad page map in process java-abrt pte:00000320 pmd:3a241067
Summary: [abrt] BUG: Bad page map in process java-abrt pte:00000320 pmd:3a241067
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 20
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL: https://retrace.fedoraproject.org/faf...
Whiteboard: abrt_hash:7337c85337d0fef252982e292fe...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-21 17:50 UTC by Alan Hamilton
Modified: 2014-06-06 02:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1101274 (view as bug list)
Environment:
Last Closed: 2014-06-06 02:11:22 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
File: dmesg (23.22 KB, text/plain)
2014-05-21 17:50 UTC, Alan Hamilton
no flags Details
Testcase (539 bytes, text/x-c)
2014-05-23 15:44 UTC, Alan Hamilton
no flags Details
NUMA patch (1.90 KB, patch)
2014-05-27 03:33 UTC, Alan Hamilton
no flags Details | Diff

Description Alan Hamilton 2014-05-21 17:50:32 UTC
Description of problem:


Additional info:
reporter:       libreport-2.2.2
BUG: Bad page map in process java-abrt  pte:00000320 pmd:3a241067
addr:00007ff62b957000 vm_flags:08000070 anon_vma:          (null) mapping:          (null) index:7ff62b957
CPU: 0 PID: 17174 Comm: java-abrt Not tainted 3.14.4-200.fc20.x86_64 #1
0000000000000000 00000000067cd5e0 ffff880040fbda90 ffffffff816ef1d2
00007ff62b957000 ffff880040fbdad8 ffffffff8119dd6a 0000000000000320
00000007ff62b957 ffff88003a241ab8 0000000000000320 00007ff62b957000
Call Trace:
[<ffffffff816ef1d2>] dump_stack+0x45/0x56
[<ffffffff8119dd6a>] print_bad_pte+0x1aa/0x250
[<ffffffff8119f089>] vm_normal_page+0x69/0x80
[<ffffffff8119f50b>] unmap_single_vma+0x46b/0x950
[<ffffffff811a0ab9>] unmap_vmas+0x49/0x90
[<ffffffff811aa0bc>] exit_mmap+0xac/0x1a0
[<ffffffff816f5ee3>] ? _raw_spin_unlock_irqrestore+0x13/0x20
[<ffffffff81087323>] mmput+0x63/0x100
[<ffffffff8108c708>] do_exit+0x278/0xa30
[<ffffffff810bc945>] ? check_preempt_curr+0x85/0xa0
[<ffffffff810bc979>] ? ttwu_do_wakeup+0x19/0xc0
[<ffffffff8108cf3f>] do_group_exit+0x3f/0xa0
[<ffffffff8109ceab>] get_signal_to_deliver+0x1cb/0x5d0
[<ffffffff81014477>] do_signal+0x57/0x640
[<ffffffff81004f42>] ? xen_mc_flush+0x182/0x1b0
[<ffffffff81003d5b>] ? xen_write_msr_safe+0x7b/0xc0
[<ffffffff81013619>] ? __switch_to+0x169/0x4c0
[<ffffffff810b8fbd>] ? finish_task_switch+0x4d/0x100
[<ffffffff81014ad0>] do_notify_resume+0x70/0xa0
[<ffffffff816ff822>] int_signal+0x12/0x17

Comment 1 Alan Hamilton 2014-05-21 17:50:38 UTC
Created attachment 898080 [details]
File: dmesg

Comment 2 Alan Hamilton 2014-05-23 15:42:18 UTC
This appears to be related to the issue discussed in this thread: https://lkml.org/lkml/2014/1/21/544

The box getting this error is indeed running in a virtualized Xen environment. When I run the testcase from that thread, it also crashed. The testcase runs without error on a physical instances.

I wasn't able to find an official upstream bug report, though.

Comment 3 Alan Hamilton 2014-05-23 15:44:53 UTC
Created attachment 898715 [details]
Testcase

Runs without error on physical instance, but crashes in Xen environment.

Comment 4 Andrew Jones 2014-05-23 16:14:22 UTC
Hi Alan,

(In reply to Alan Hamilton from comment #2)
> This appears to be related to the issue discussed in this thread:
> https://lkml.org/lkml/2014/1/21/544
> 

This issue now has a patch

https://lkml.org/lkml/2014/2/4/148

Have you tried running a kernel with that patch to see if the BUG (comment 0) on your machine goes away?

Thanks,
drew

Comment 5 Vitaly Kuznetsov 2014-05-26 14:01:20 UTC
Testcase crashes Dom0 kernel (3.14.4-200.fc20.x86_64) as well:
[15432.582805] ------------[ cut here ]------------
[15432.584110] kernel BUG at include/linux/mm.h:307!
[15432.585427] invalid opcode: 0000 [#1] SMP 
[15432.586616] Modules linked in: rfcomm fuse ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnep iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core x86_pkg_temp_thermal coretemp btusb microcode videodev media iwldvm bluetooth mac80211 joydev serio_raw 6lowpan_iphc iwlwifi snd_hda_intel snd_hda_codec sdhci_pci sdhci cfg80211 lpc_ich snd_hwdep mmc_core snd_seq i2c_i801 snd_seq_device mfd_core
[15432.604024]  snd_pcm tpm_tis wmi thinkpad_acpi tpm rfkill snd_timer e1000e mei_me snd ptp shpchp soundcore mei pps_core xen_acpi_processor xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss nfs_acl lockd sunrpc dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel i915 firewire_ohci firewire_core crc_itu_t i2c_algo_bit drm_kms_helper video drm i2c_core
[15432.607336] CPU: 0 PID: 13348 Comm: b1099985 Tainted: G        W    3.14.4-200.fc20.x86_64 #1
[15432.608149] Hardware name: LENOVO 4243BQ9/4243BQ9, BIOS 8AET46WW (1.26 ) 05/18/2011
[15432.608873] task: ffff880092061300 ti: ffff88007983e000 task.ti: ffff88007983e000
[15432.609488] RIP: e030:[<ffffffff816ec5c5>]  [<ffffffff816ec5c5>] put_page_testzero.part.16+0xb/0xd
[15432.610232] RSP: e02b:ffff88007983fd20  EFLAGS: 00010246
[15432.610663] RAX: 0000000000000000 RBX: ffff88007983fe60 RCX: 0000000000000006
[15432.611238] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[15432.611801] RBP: ffff88007983fd20 R08: ffffffff81ee377c R09: 0000000000000451
[15432.612351] R10: 0000000000000450 R11: 0000000000000003 R12: 0000000000000000
[15432.612914] R13: ffff88007983fe68 R14: 0000000000000001 R15: ffffea0000000000
[15432.613494] FS:  00007f0ddcfd1740(0000) GS:ffff88010d400000(0000) knlGS:0000000000000000
[15432.614106] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[15432.614580] CR2: 00007f0ddcfe9000 CR3: 000000009c8bb000 CR4: 0000000000042660
[15432.615126] Stack:
[15432.615278]  ffff88007983fd90 ffffffff81182c6c 0000000081004f42 0000000000000000
[15432.615917]  00007f0ddcfe9000 00007f0ddcfe9000 ffff88007983fd50 ffff88007983fd50
[15432.616569]  00000000df3f5ecb 0000000000000001 ffffea0000000000 0000000000000001
[15432.617277] Call Trace:
[15432.617477]  [<ffffffff81182c6c>] release_pages+0x23c/0x260
[15432.617936]  [<ffffffff811b4e65>] free_pages_and_swap_cache+0x95/0xb0
[15432.618483]  [<ffffffff8119db4c>] tlb_flush_mmu.part.57+0x4c/0x90
[15432.618987]  [<ffffffff8119e715>] tlb_finish_mmu+0x55/0x60
[15432.619427]  [<ffffffff811a6b02>] unmap_region+0xe2/0x130
[15432.619860]  [<ffffffff811a7101>] ? vma_rb_erase+0x121/0x220
[15432.620313]  [<ffffffff811a8f26>] do_munmap+0x226/0x3b0
[15432.620727]  [<ffffffff811a90f1>] vm_munmap+0x41/0x60
[15432.621126]  [<ffffffff811aa002>] SyS_munmap+0x22/0x30
[15432.621535]  [<ffffffff816ff569>] system_call_fastpath+0x16/0x1b
[15432.622007] Code: 48 85 c9 48 0f 48 c8 48 85 d2 48 0f 49 c2 48 01 c8 49 89 06 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 31 f6 48 89 e5 e8 fb f1 a8 ff <0f> 0b 55 31 f6 48 89 e5 e8 ee f1 a8 ff 0f 0b 55 31 f6 48 89 e5 
[15432.623912] RIP  [<ffffffff816ec5c5>] put_page_testzero.part.16+0xb/0xd
[15432.624437]  RSP <ffff88007983fd20>
[15432.839647] ---[ end trace f9cc0ac90b72f5c1 ]---

Comment 6 Vitaly Kuznetsov 2014-05-26 16:26:12 UTC
upstream v3.15-rc2 has fix for this issue:

commit 29c7787075c92ca8af353acd5301481e6f37082f
Author: Mel Gorman <mgorman>
Date:   Fri Apr 18 15:07:21 2014 -0700

    mm: use paravirt friendly ops for NUMA hinting ptes
    
    David Vrabel identified a regression when using automatic NUMA balancing
    under Xen whereby page table entries were getting corrupted due to the
    use of native PTE operations.  Quoting him
    
    	Xen PV guest page tables require that their entries use machine
    	addresses if the preset bit (_PAGE_PRESENT) is set, and (for
    	successful migration) non-present PTEs must use pseudo-physical
    	addresses.  This is because on migration MFNs in present PTEs are
    	translated to PFNs (canonicalised) so they may be translated back
    	to the new MFN in the destination domain (uncanonicalised).
    
    	pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
    	set and clear the _PAGE_PRESENT bit using pte_set_flags(),
    	pte_clear_flags(), etc.
    
    	In a Xen PV guest, these functions must translate MFNs to PFNs
    	when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
    	_PAGE_PRESENT.
    
    His suggested fix converted p[te|md]_[set|clear]_flags to using
    paravirt-friendly ops but this is overkill.  He suggested an alternative
    of using p[te|md]_modify in the NUMA page table operations but this is
    does more work than necessary and would require looking up a VMA for
    protections.
    
    This patch modifies the NUMA page table operations to use paravirt
    friendly operations to set/clear the flags of interest.  Unfortunately
    this will take a performance hit when updating the PTEs on
    CONFIG_PARAVIRT but I do not see a way around it that does not break
    Xen.
    
    Signed-off-by: Mel Gorman <mgorman>
    Acked-by: David Vrabel <david.vrabel>
    Tested-by: David Vrabel <david.vrabel>
    Cc: Ingo Molnar <mingo>
    Cc: Peter Anvin <hpa>
    Cc: Fengguang Wu <fengguang.wu>
    Cc: Linus Torvalds <torvalds>
    Cc: Steven Noonan <steven>
    Cc: Rik van Riel <riel>
    Cc: Peter Zijlstra <peterz>
    Cc: Andrea Arcangeli <aarcange>
    Cc: Dave Hansen <dave.hansen>
    Cc: Srikar Dronamraju <srikar.ibm.com>
    Cc: Cyrill Gorcunov <gorcunov>
    Cc: <stable.org>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1ec08c1..a8015a7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -693,24 +693,35 @@ static inline int pmd_numa(pmd_t pmd)
 #ifndef pte_mknonnuma
 static inline pte_t pte_mknonnuma(pte_t pte)
 {
-	pte = pte_clear_flags(pte, _PAGE_NUMA);
-	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+	pteval_t val = pte_val(pte);
+
+	val &= ~_PAGE_NUMA;
+	val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
+	return __pte(val);
 }
 #endif
 
 #ifndef pmd_mknonnuma
 static inline pmd_t pmd_mknonnuma(pmd_t pmd)
 {
-	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
-	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+	pmdval_t val = pmd_val(pmd);
+
+	val &= ~_PAGE_NUMA;
+	val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
+
+	return __pmd(val);
 }
 #endif
 
 #ifndef pte_mknuma
 static inline pte_t pte_mknuma(pte_t pte)
 {
-	pte = pte_set_flags(pte, _PAGE_NUMA);
-	return pte_clear_flags(pte, _PAGE_PRESENT);
+	pteval_t val = pte_val(pte);
+
+	val &= ~_PAGE_PRESENT;
+	val |= _PAGE_NUMA;
+
+	return __pte(val);
 }
 #endif
 
@@ -729,8 +740,12 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
 #ifndef pmd_mknuma
 static inline pmd_t pmd_mknuma(pmd_t pmd)
 {
-	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+	pmdval_t val = pmd_val(pmd);
+
+	val &= ~_PAGE_PRESENT;
+	val |= _PAGE_NUMA;
+
+	return __pmd(val);
 }
 #endif

Comment 7 Alan Hamilton 2014-05-27 03:32:03 UTC
I did find the same patch. Apparently there have been other commits to that file since the current Fedora release of kernel-3.14.4-200, so I used the full file as of that patch. That does seem to fix the issue. The testcase no longer crashes in Xen. I've attached the patch I used.

Comment 8 Alan Hamilton 2014-05-27 03:33:09 UTC
Created attachment 899375 [details]
NUMA patch

Comment 9 Josh Boyer 2014-05-27 12:26:01 UTC
(In reply to Vitaly Kuznetsov from comment #6)
> upstream v3.15-rc2 has fix for this issue:
> 
> commit 29c7787075c92ca8af353acd5301481e6f37082f
> Author: Mel Gorman <mgorman>
> Date:   Fri Apr 18 15:07:21 2014 -0700
> 
>     mm: use paravirt friendly ops for NUMA hinting ptes

That's currently queued for 3.14.5 stable.  We'll pick it up with that release.

Comment 10 Alan Hamilton 2014-06-06 02:11:22 UTC
3.14.5 does resolve the issue. I'm not seeing the bad page map crashes, and the testcase no longer crashes either.


Note You need to log in before you can comment on or make changes to this bug.