Bug 826649 - ocaml ppc64: large entry function causes minor heap corruption
ocaml ppc64: large entry function causes minor heap corruption
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: ocaml (Show other bugs)
rawhide
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Richard W.M. Jones
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-30 13:12 EDT by Richard W.M. Jones
Modified: 2012-06-15 20:03 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-06 14:27:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
NOT minimal reproducer (32.26 KB, application/x-tar)
2012-05-30 18:26 EDT, Richard W.M. Jones
no flags Details
NOT minimal reproducer (45.10 KB, application/x-tar)
2012-05-31 12:18 EDT, Richard W.M. Jones
no flags Details
bz826649_v2.tar.gz (11.94 KB, application/x-tar)
2012-06-06 09:36 EDT, Richard W.M. Jones
no flags Details

  None (edit)
Description Richard W.M. Jones 2012-05-30 13:12:04 EDT
Description of problem:

The non-upstream ppc64 backend contains some sort of bug so
that it cannot correctly compile entry functions that are
over ~ 26 KB in size.  Such a module will compile OK, but when
the program is run it causes minor heap corruption (that is,
corruption of the minor heap, not "minor corruption"!)

This manifests itself in the ocaml-camomile package, in
the function public/charEncoding.ml.  See the ~500 lines
at the end of the code which look like:

  let () = alias "foo" "bar"

When compiled these are combined into a single large function.
If the function is > ~ 26 KB, then minor heap corruption occurs.
This manifests itself as the program tools/camomilelocaledef.opt
giving out of memory errors during the build.

Version-Release number of selected component (if applicable):

ocaml 3.12.1

Steps to Reproduce:
(to follow)
Comment 1 Richard W.M. Jones 2012-05-30 18:26:02 EDT
Created attachment 587868 [details]
NOT minimal reproducer

This is a very large reproducer.  I'm still trying to reduce
the size of it, but at least this demonstrates the bug.

Unpack the tarball (warning: it doesn't create a subdirectory) and do:

ocamlopt.opt configInt.mli avlTree.ml iMap.ml iSet.ml \
  bitsvect.ml bytesvect.ml tbl31.ml database.ml \
  unimap.mli unimap.ml oOChannel.ml uChar.ml \
  unicodeString.mli byte_labeled_dag.ml charmap.ml \
  charEncoding.mli charEncoding.ml camomileconfig.ml test.ml
strace ./a.out

Note that just before it fails with the out of memory error, it
is trying to allocate some massive random amount of heap using mmap.

What in fact happens is that it's moving ("oldifying"[1]) values from
the minor heap to the major heap.  It reads the header of one of these
values, but in fact it's reading uninitialized data.  It then tries
to allocate the required space according to the header, but because
the header was just random junk, it requests a random number from mmap.

It's also possible to run the program under valgrind, and sometimes
(but unfortunately not in this particular reproducer) valgrind will
show that it's reading uninitialized memory.

[1] For an overview of how the GC works, see:

https://rwmj.wordpress.com/2009/08/04/ocaml-internals/
https://rwmj.wordpress.com/2009/08/05/ocaml-internals-part-2-strings-and-other-types/
https://rwmj.wordpress.com/2009/08/06/ocaml-internals-part-3-the-minor-heap/
https://rwmj.wordpress.com/2009/08/07/ocaml-internals-part-4-the-major-heap/
https://rwmj.wordpress.com/2009/08/08/ocaml-internals-part-5-garbage-collection/
Comment 2 Richard W.M. Jones 2012-05-30 18:29:26 EDT
Forgot to add:

If, after reproducing the bug, you open up charEncoding.ml
in your edit, and delete all the lines that have the form:

let () = alias ...

and redo the test, it'll work fine.
Comment 3 Richard W.M. Jones 2012-05-30 18:31:53 EDT
If you want to look at the asm output, add the -S option:

$ ocamlopt.opt -S configInt.mli avlTree.ml iMap.ml iSet.ml \
  bitsvect.ml bytesvect.ml tbl31.ml database.ml \
  unimap.mli unimap.ml oOChannel.ml uChar.ml \
  unicodeString.mli byte_labeled_dag.ml charmap.ml \
  charEncoding.mli charEncoding.ml camomileconfig.ml test.ml
$ ls *.s
avlTree.s           bytesvect.s       charmap.s   iSet.s       test.s
bitsvect.s          camomileconfig.s  database.s  oOChannel.s  uChar.s
byte_labeled_dag.s  charEncoding.s    iMap.s      tbl31.s      unimap.s
Comment 4 Richard W.M. Jones 2012-05-31 12:18:08 EDT
Created attachment 588150 [details]
NOT minimal reproducer

I've no idea about this bug.  However here is a possibly clearer
reproducer.

Checkout and 'fedpkg prep' the current Fedora ocaml source.

Unpack the tarball.

Edit "./make" and set the path to point to the ocaml source.

./make
./a.out

You will see warnings similar to:

*** oldifying suspicious object of size 17590817533954 ***
*** oldifying suspicious object of size 269035240 ***
[etc]

If you run this under gdb, you can put a breakpoint in
minor_gc.c:caml_oldify_one at the line that prints that
warning.

Here is a typical stack trace:

(gdb) break minor_gc.c:131
Breakpoint 1 at 0x10055a2c: file minor_gc.c, line 131.
(gdb) run
Starting program: /tmp/pp/a.out 

Breakpoint 1, caml_oldify_one (v=17590977892344, p=0x100aaf74)
    at minor_gc.c:131
131               fprintf (stderr, "*** oldifying suspicious object of size %zu ***\n", sz);
(gdb) bt
#0  caml_oldify_one (v=17590977892344, p=0x100aaf74) at minor_gc.c:131
#1  0x00000000100574c4 in .caml_oldify_local_roots ()
#2  0x0000000010056000 in caml_empty_minor_heap () at minor_gc.c:233
#3  0x0000000010056308 in caml_minor_collection () at minor_gc.c:276
#4  0x0000000010066fb8 in .caml_gc_minor ()
#5  0x000000001006c5f4 in .caml_c_call ()
#6  0x000000001001a84c in .camlCharEncoding__entry ()
#7  0x00000000100028a0 in caml_startup__code_begin ()
#8  0x000000001006c778 in .caml_start_program ()
#9  0x0000000010056be8 in .caml_main ()
#10 0x0000000010002064 in .main ()

Unfortunately I cannot add debug to caml_oldify_local_roots so it's
rather hard to see exactly what triggers the warning.
Comment 5 Richard W.M. Jones 2012-06-06 09:36:18 EDT
Created attachment 589894 [details]
bz826649_v2.tar.gz

Smaller reproducer, derived from libguestfs.

Unpack the tarball.

Type:

ocamlopt.opt guestfs.ml test.ml dummy.c -o test

When run, the program immediately segfaults:

$ gdb ./test
(gdb) run
Starting program: /tmp/p/test 

Program received signal SIGSEGV, Segmentation fault.
0x00000000100367dc in .caml_oldify_mopup ()
(gdb) bt
#0  0x00000000100367dc in .caml_oldify_mopup ()
#1  0x0000000010036920 in .caml_empty_minor_heap ()
#2  0x0000000010044218 in .caml_gc_compaction ()
#3  0x0000000010049198 in .caml_c_call ()
#4  0x00000000100043a8 in .camlTest__entry ()
#5  0x0000000010002515 in caml_startup__code_begin ()
#6  0x000000001004931d in .caml_start_program ()
#7  0x0000000010049a10 in .caml_main ()
#8  0x00000000100020f4 in .main ()

Note that no 'guestfs' code is actually run, except for the
long entry function in the Guestfs module (which does nothing
except register some external functions).

To see the assembler output, add -S option to ocamlopt.opt.
Comment 6 Steven Munroe 2012-06-06 11:37:31 EDT
You did not give us much to work with.

Would like to see the GP registers, especially R2 and R13, which have different usage between PPC32/PPC64. This is a common source of trouble.

R13 is the thread pointer. Do not change R13, ever! (only the kernel and clone syscall should modify R13).

R2 is the TOC pointer an needs to preserved across calls. Also PPC64 uses function descriptors so function calls and the PLT are different.

While you can ignore some parts of the ABI within you JIT code, you must follow the ABI when calling out to libraries.
Comment 7 Richard W.M. Jones 2012-06-06 13:55:44 EDT
I believe the problem is that r31 (which is used to point
to the next free space in the minor heap) was not kept 8-byte
aligned.  This resulted in subtle minor heap corruption.

This is the patch I am testing:

diff --git a/asmcomp/power64/emit.mlp b/asmcomp/power64/emit.mlp
index ba54e99..42f585d 100644
--- a/asmcomp/power64/emit.mlp
+++ b/asmcomp/power64/emit.mlp
@@ -607,7 +607,7 @@ let rec emit_instr i dslot =
         `      bge     {emit_label lbl}\n`;
         record_frame i.live;
         `      bl      {emit_label !call_gc_label}\n`; (* Must be 4 insns to restart *)
-        `{emit_label lbl}:     addi    {emit_reg i.res.(0)}, {emit_gpr 31}, 4\n`
+        `{emit_label lbl}:     addi    {emit_reg i.res.(0)}, {emit_gpr 31}, {emit_int size_addr}\n`
     | Lop(Iintop Isub) ->               (* subfc has swapped arguments *)
         `      subfc   {emit_reg i.res.(0)}, {emit_reg i.arg.(1)}, {emit_reg i.arg.(0)}\n`
     | Lop(Iintop Imod) ->
Comment 8 Richard W.M. Jones 2012-06-06 14:27:13 EDT
The patch above fixes all observed instances of this
bug.  I have added it to OCaml 3.12.1-12.
Comment 9 Fedora Update System 2012-06-06 15:25:25 EDT
ocaml-3.12.1-12.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/ocaml-3.12.1-12.fc17
Comment 10 Fedora Update System 2012-06-15 20:03:39 EDT
ocaml-3.12.1-12.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.