Bug 727700 - Anomaly in mbind memory map causing Java Hotspot JVM Seg fault with NUMA aware ParallelScavange GC
Anomaly in mbind memory map causing Java Hotspot JVM Seg fault with NUMA awar...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.3
x86_64 Linux
urgent Severity urgent
: rc
: ---
Assigned To: KOSAKI Motohiro
Caspar Zhang
: Regression, ZStream
: 752867 (view as bug list)
Depends On:
Blocks: 767187 711169 802379 804141
  Show dependency treegraph
 
Reported: 2011-08-02 18:35 EDT by Chris Phillips @ TO
Modified: 2014-07-25 00:17 EDT (History)
13 users (show)

See Also:
Fixed In Version: kernel-2.6.32-229.el6
Doc Type: Bug Fix
Doc Text:
An anomaly in the memory map created by the mbind() function caused a segmentation fault in Hotspot Java Virtual Machines with the NUMA-aware Parallel Scavenge garbage collector. A backported upstream patch that fixes mbind() has been provided and the crashes no longer occur in the described scenario.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-20 03:46:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Small C test that demonstrates the issue. (2.50 KB, text/x-csrc)
2011-08-02 18:35 EDT, Chris Phillips @ TO
no flags Details
strace output of mmap, munmap, and mbind calls with OpenCL stderr prints (prefixed by "DAB") of allocation and error information. (96.08 KB, text/plain)
2012-01-05 05:55 EST, IBM Bug Proxy
no flags Details

  None (edit)
Description Chris Phillips @ TO 2011-08-02 18:35:06 EDT
Created attachment 516402 [details]
Small C test that demonstrates the issue.

Description of problem:

An anomaly in the memory map created by mbind causes a
segmentation fault in  Hotspot JVM's with NUMA aware ParallelScavange GC as demonstrated by the attached program. It does a mmap of 256MB anon region. And then does mbinds in this region at random location with random node and random size until the problem is hit. At the end when the problem is hit, it prints the memory map of this region. The problem is reproduced almost instantly.

Version-Release number of selected component (if applicable):
Linux kernels starting from 2.6.32

How reproducible:

Easily reproduces with provided test on x86_64 numa hardware.

Steps to Reproduce:
1. gcc -g test_numa_mbind.c -lnuma -o test_numa_mbind
2. ./test_numa_mbind

  
Actual results:

Start addr= 758200000 end addr= 768200000 pid= 14044 num_nodes: 2
addr: 758200000 len: 144879616 node:0
addr: 760c2b000 len: 54956032 node:1
addr: 764094000 len: 10043392 node:1
addr: 764a28000 len: 12087296 node:0
addr: 7655af000 len: 8683520 node:0
addr: 765df7000 len: 33890304 node:1
addr: 767e49000 len: 1572864 node:1
addr: 767fc9000 len: 1040384 node:0
addr: 7680c7000 len: 602112 node:0
addr: 76815a000 len: 73728 node:1
addr: 76816c000 len: 450560 node:1
addr: 7681da000 len: 8192 node:1
addr: 7681dc000 len: 81920 node:1
addr: 7681f0000 len: 49152 node:0
addr: 7681fc000 len: 4096 node:1
addr: 7681fd000 len: 4096 node:0
addr: 7681fe000 len: 4096 node:1
addr: 758200000 len: 182714368 node:0
addr: 763040000 len: 73404416 node:0
addr: 767641000 len: 8613888 node:1
addr: 767e78000 len: 835584 node:0
addr: 767f44000 len: 872448 node:1
addr: 768019000 len: 712704 node:1
addr: 7680c7000 len: 1122304 node:1
addr: 7681d9000 len: 61440 node:1
addr: 7681e8000 len: 40960 node:1
addr: 7681f2000 len: 40960 node:1
addr: 7681fc000 len: 4096 node:0
Hit the bug!!
758200000 - 760c2b000
760c2b000 - 765df7000
765df7000 - 767641000
767641000 - 767e78000
767e78000 - 767f44000
767f44000 - 767fc9000
767fc9000 - 7681d9000
7681d9000 - 7681f0000
This is where the problem is
7681fc000 - 7681fe000
7681fe000 - 7681ff000
7681ff000 - 768200000


Expected results:
Loop forever... 
Start addr= 758200000 end addr= 768200000 pid= 21510 num_nodes: 1
addr: 758200000 len: 220610560 node:0
addr: 765464000 len: 5795840 node:0
addr: 7659eb000 len: 7446528 node:0
addr: 766105000 len: 5173248 node:0
addr: 7665f4000 len: 7860224 node:0
addr: 766d73000 len: 17272832 node:0
addr: 767dec000 len: 2998272 node:0
addr: 7680c8000 len: 942080 node:0
addr: 7681ae000 len: 278528 node:0
addr: 7681f2000 len: 4096 node:0
addr: 7681f3000 len: 8192 node:0
addr: 7681f5000 len: 32768 node:0
addr: 7681fd000 len: 8192 node:0
addr: 758200000 len: 233336832 node:0
addr: 766087000 len: 8790016 node:0
addr: 7668e9000 len: 8929280 node:0
addr: 76716d000 len: 14655488 node:0
addr: 767f67000 len: 618496 node:0
addr: 767ffe000 len: 667648 node:0
addr: 7680a1000 len: 503808 node:0
addr: 76811c000 len: 548864 node:0
addr: 7681a2000 len: 204800 node:0
addr: 7681d4000 len: 176128 node:0
...

Additional info:
Comment 2 Lokesh Gidra 2011-09-02 06:55:36 EDT
Hi,

I tried to find out the culprit. Its seems that call to vma_merge in mbind_range, which has been added to bound the number of vma's. When I tried disabling the call to vma_merge, the problem didn't reproduce using the attached test program. Also, it seems that the initial implementation of do_mbind didn't have merge feature, and it worked fine back then. 2.6.27 doesn't have vma_merge and it works fine with this kernel.

Lokesh
Comment 3 RHEL Product and Program Management 2011-10-07 11:43:32 EDT
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 4 Steve Best 2012-01-05 05:44:07 EST
*** Bug 752867 has been marked as a duplicate of this bug. ***
Comment 5 IBM Bug Proxy 2012-01-05 05:55:54 EST
Created attachment 550879 [details]
strace output of mmap, munmap, and mbind calls with OpenCL stderr prints (prefixed by "DAB") of allocation and error information.
Comment 6 IBM Bug Proxy 2012-01-06 09:40:54 EST
------- Comment From tpnoonan@us.ibm.com 2012-01-06 09:34 EDT-------
can we now request for 6.2.z?
Comment 7 Joseph Kachuck 2012-01-06 11:53:56 EST
Hello IBM,
Please provide a client impact statement for the Z request.

Thank You
Joe Kachuck
Comment 8 RHEL Product and Program Management 2012-01-13 11:00:06 EST
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 9 IBM Bug Proxy 2012-01-29 20:50:35 EST
------- Comment From tpnoonan@us.ibm.com 2012-01-29 20:45 EDT-------
hi red hat, justification for z-stream: This seems like a pretty serious and fundamental problem given the
pervasiveness of multi-core processors and multi-threaded applications.  We
won't be able to announce support on RHEL 6.x for our SW* (which is planned to
be released in Jan 2012)  if a patch is not publicly available. *It is OpenCL for Power.  OpenCL??? is the first open, royalty-free standard for
cross-platform, parallel programming of modern processors found in personal
computers, servers and handheld/embedded devices. OpenCL (Open Computing
Language) greatly improves speed and responsiveness for a wide spectrum of
applications in numerous market categories from gaming and entertainment to
scientific and medical software.

For more information on OpenCL, see

http://en.wikipedia.org/wiki/OpenCL
http://www.khronos.org/
Comment 10 IBM Bug Proxy 2012-01-30 11:31:36 EST
------- Comment From tpnoonan@us.ibm.com 2012-01-30 11:22 EDT-------
regression, works in 6.0 fails in 6.1/6.2
Comment 12 Aristeu Rozanski 2012-02-10 14:45:45 EST
Patch(es) available on kernel-2.6.32-229.el6
Comment 17 Joseph Kachuck 2012-03-12 10:38:25 EDT
Hello,
This bug has been copied as 6.2 z-stream (EUS) bug #802379

Thank You
Joe Kachuck
Comment 18 IBM Bug Proxy 2012-04-16 11:04:09 EDT
be released in Jan 2012)  if a patch is not publicly available. *It is OpenCL for Power.  OpenCL? is the first open, royalty-free standard for
Comment 19 Tomas Capek 2012-04-18 08:31:16 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An anomaly in the memory map created by the mbind() function caused a segmentation fault in Hotspot Java Virtual Machines with the NUMA-aware Parallel Scavenge garbage collector. A backported upstream patch that fixes mbind() has been provided and the crashes no longer occur in the described scenario.
Comment 21 errata-xmlrpc 2012-06-20 03:46:57 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html
Comment 22 Red Hat Bugzilla 2013-10-03 20:28:35 EDT
Removing external tracker bug with the id 'https://access.redhat.com/site/solutions/352973' as it is not valid for this tracker

Note You need to log in before you can comment on or make changes to this bug.