Bug 727700
Summary: | Anomaly in mbind memory map causing Java Hotspot JVM Seg fault with NUMA aware ParallelScavange GC | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Chris Phillips @ TO <chphilli> | ||||||
Component: | kernel | Assignee: | KOSAKI Motohiro <mkosaki> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Caspar Zhang <czhang> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 6.3 | CC: | bugproxy, czhang, dhoward, jjarvis, jkachuck, jraju, jweiner, jwest, liaobin7360049, lokesh.gidra, lwang, qcai, sbest | ||||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | kernel-2.6.32-229.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
An anomaly in the memory map created by the mbind() function caused a segmentation fault in Hotspot Java Virtual Machines with the NUMA-aware Parallel Scavenge garbage collector. A backported upstream patch that fixes mbind() has been provided and the crashes no longer occur in the described scenario.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-06-20 07:46:57 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 711169, 767187, 802379, 804141 | ||||||||
Attachments: |
|
Hi, I tried to find out the culprit. Its seems that call to vma_merge in mbind_range, which has been added to bound the number of vma's. When I tried disabling the call to vma_merge, the problem didn't reproduce using the attached test program. Also, it seems that the initial implementation of do_mbind didn't have merge feature, and it worked fine back then. 2.6.27 doesn't have vma_merge and it works fine with this kernel. Lokesh Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. *** Bug 752867 has been marked as a duplicate of this bug. *** Created attachment 550879 [details]
strace output of mmap, munmap, and mbind calls with OpenCL stderr prints (prefixed by "DAB") of allocation and error information.
------- Comment From tpnoonan.com 2012-01-06 09:34 EDT------- can we now request for 6.2.z? Hello IBM, Please provide a client impact statement for the Z request. Thank You Joe Kachuck This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. ------- Comment From tpnoonan.com 2012-01-29 20:45 EDT------- hi red hat, justification for z-stream: This seems like a pretty serious and fundamental problem given the pervasiveness of multi-core processors and multi-threaded applications. We won't be able to announce support on RHEL 6.x for our SW* (which is planned to be released in Jan 2012) if a patch is not publicly available. *It is OpenCL for Power. OpenCL??? is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. For more information on OpenCL, see http://en.wikipedia.org/wiki/OpenCL http://www.khronos.org/ ------- Comment From tpnoonan.com 2012-01-30 11:22 EDT------- regression, works in 6.0 fails in 6.1/6.2 Patch(es) available on kernel-2.6.32-229.el6 Hello, This bug has been copied as 6.2 z-stream (EUS) bug #802379 Thank You Joe Kachuck be released in Jan 2012) if a patch is not publicly available. *It is OpenCL for Power. OpenCL? is the first open, royalty-free standard for Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: An anomaly in the memory map created by the mbind() function caused a segmentation fault in Hotspot Java Virtual Machines with the NUMA-aware Parallel Scavenge garbage collector. A backported upstream patch that fixes mbind() has been provided and the crashes no longer occur in the described scenario. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0862.html Removing external tracker bug with the id 'https://access.redhat.com/site/solutions/352973' as it is not valid for this tracker |
Created attachment 516402 [details] Small C test that demonstrates the issue. Description of problem: An anomaly in the memory map created by mbind causes a segmentation fault in Hotspot JVM's with NUMA aware ParallelScavange GC as demonstrated by the attached program. It does a mmap of 256MB anon region. And then does mbinds in this region at random location with random node and random size until the problem is hit. At the end when the problem is hit, it prints the memory map of this region. The problem is reproduced almost instantly. Version-Release number of selected component (if applicable): Linux kernels starting from 2.6.32 How reproducible: Easily reproduces with provided test on x86_64 numa hardware. Steps to Reproduce: 1. gcc -g test_numa_mbind.c -lnuma -o test_numa_mbind 2. ./test_numa_mbind Actual results: Start addr= 758200000 end addr= 768200000 pid= 14044 num_nodes: 2 addr: 758200000 len: 144879616 node:0 addr: 760c2b000 len: 54956032 node:1 addr: 764094000 len: 10043392 node:1 addr: 764a28000 len: 12087296 node:0 addr: 7655af000 len: 8683520 node:0 addr: 765df7000 len: 33890304 node:1 addr: 767e49000 len: 1572864 node:1 addr: 767fc9000 len: 1040384 node:0 addr: 7680c7000 len: 602112 node:0 addr: 76815a000 len: 73728 node:1 addr: 76816c000 len: 450560 node:1 addr: 7681da000 len: 8192 node:1 addr: 7681dc000 len: 81920 node:1 addr: 7681f0000 len: 49152 node:0 addr: 7681fc000 len: 4096 node:1 addr: 7681fd000 len: 4096 node:0 addr: 7681fe000 len: 4096 node:1 addr: 758200000 len: 182714368 node:0 addr: 763040000 len: 73404416 node:0 addr: 767641000 len: 8613888 node:1 addr: 767e78000 len: 835584 node:0 addr: 767f44000 len: 872448 node:1 addr: 768019000 len: 712704 node:1 addr: 7680c7000 len: 1122304 node:1 addr: 7681d9000 len: 61440 node:1 addr: 7681e8000 len: 40960 node:1 addr: 7681f2000 len: 40960 node:1 addr: 7681fc000 len: 4096 node:0 Hit the bug!! 758200000 - 760c2b000 760c2b000 - 765df7000 765df7000 - 767641000 767641000 - 767e78000 767e78000 - 767f44000 767f44000 - 767fc9000 767fc9000 - 7681d9000 7681d9000 - 7681f0000 This is where the problem is 7681fc000 - 7681fe000 7681fe000 - 7681ff000 7681ff000 - 768200000 Expected results: Loop forever... Start addr= 758200000 end addr= 768200000 pid= 21510 num_nodes: 1 addr: 758200000 len: 220610560 node:0 addr: 765464000 len: 5795840 node:0 addr: 7659eb000 len: 7446528 node:0 addr: 766105000 len: 5173248 node:0 addr: 7665f4000 len: 7860224 node:0 addr: 766d73000 len: 17272832 node:0 addr: 767dec000 len: 2998272 node:0 addr: 7680c8000 len: 942080 node:0 addr: 7681ae000 len: 278528 node:0 addr: 7681f2000 len: 4096 node:0 addr: 7681f3000 len: 8192 node:0 addr: 7681f5000 len: 32768 node:0 addr: 7681fd000 len: 8192 node:0 addr: 758200000 len: 233336832 node:0 addr: 766087000 len: 8790016 node:0 addr: 7668e9000 len: 8929280 node:0 addr: 76716d000 len: 14655488 node:0 addr: 767f67000 len: 618496 node:0 addr: 767ffe000 len: 667648 node:0 addr: 7680a1000 len: 503808 node:0 addr: 76811c000 len: 548864 node:0 addr: 7681a2000 len: 204800 node:0 addr: 7681d4000 len: 176128 node:0 ... Additional info: