Description of problem: The non-upstream ppc64 backend contains some sort of bug so that it cannot correctly compile entry functions that are over ~ 26 KB in size. Such a module will compile OK, but when the program is run it causes minor heap corruption (that is, corruption of the minor heap, not "minor corruption"!) This manifests itself in the ocaml-camomile package, in the function public/charEncoding.ml. See the ~500 lines at the end of the code which look like: let () = alias "foo" "bar" When compiled these are combined into a single large function. If the function is > ~ 26 KB, then minor heap corruption occurs. This manifests itself as the program tools/camomilelocaledef.opt giving out of memory errors during the build. Version-Release number of selected component (if applicable): ocaml 3.12.1 Steps to Reproduce: (to follow)
Created attachment 587868 [details] NOT minimal reproducer This is a very large reproducer. I'm still trying to reduce the size of it, but at least this demonstrates the bug. Unpack the tarball (warning: it doesn't create a subdirectory) and do: ocamlopt.opt configInt.mli avlTree.ml iMap.ml iSet.ml \ bitsvect.ml bytesvect.ml tbl31.ml database.ml \ unimap.mli unimap.ml oOChannel.ml uChar.ml \ unicodeString.mli byte_labeled_dag.ml charmap.ml \ charEncoding.mli charEncoding.ml camomileconfig.ml test.ml strace ./a.out Note that just before it fails with the out of memory error, it is trying to allocate some massive random amount of heap using mmap. What in fact happens is that it's moving ("oldifying"[1]) values from the minor heap to the major heap. It reads the header of one of these values, but in fact it's reading uninitialized data. It then tries to allocate the required space according to the header, but because the header was just random junk, it requests a random number from mmap. It's also possible to run the program under valgrind, and sometimes (but unfortunately not in this particular reproducer) valgrind will show that it's reading uninitialized memory. [1] For an overview of how the GC works, see: https://rwmj.wordpress.com/2009/08/04/ocaml-internals/ https://rwmj.wordpress.com/2009/08/05/ocaml-internals-part-2-strings-and-other-types/ https://rwmj.wordpress.com/2009/08/06/ocaml-internals-part-3-the-minor-heap/ https://rwmj.wordpress.com/2009/08/07/ocaml-internals-part-4-the-major-heap/ https://rwmj.wordpress.com/2009/08/08/ocaml-internals-part-5-garbage-collection/
Forgot to add: If, after reproducing the bug, you open up charEncoding.ml in your edit, and delete all the lines that have the form: let () = alias ... and redo the test, it'll work fine.
If you want to look at the asm output, add the -S option: $ ocamlopt.opt -S configInt.mli avlTree.ml iMap.ml iSet.ml \ bitsvect.ml bytesvect.ml tbl31.ml database.ml \ unimap.mli unimap.ml oOChannel.ml uChar.ml \ unicodeString.mli byte_labeled_dag.ml charmap.ml \ charEncoding.mli charEncoding.ml camomileconfig.ml test.ml $ ls *.s avlTree.s bytesvect.s charmap.s iSet.s test.s bitsvect.s camomileconfig.s database.s oOChannel.s uChar.s byte_labeled_dag.s charEncoding.s iMap.s tbl31.s unimap.s
Created attachment 588150 [details] NOT minimal reproducer I've no idea about this bug. However here is a possibly clearer reproducer. Checkout and 'fedpkg prep' the current Fedora ocaml source. Unpack the tarball. Edit "./make" and set the path to point to the ocaml source. ./make ./a.out You will see warnings similar to: *** oldifying suspicious object of size 17590817533954 *** *** oldifying suspicious object of size 269035240 *** [etc] If you run this under gdb, you can put a breakpoint in minor_gc.c:caml_oldify_one at the line that prints that warning. Here is a typical stack trace: (gdb) break minor_gc.c:131 Breakpoint 1 at 0x10055a2c: file minor_gc.c, line 131. (gdb) run Starting program: /tmp/pp/a.out Breakpoint 1, caml_oldify_one (v=17590977892344, p=0x100aaf74) at minor_gc.c:131 131 fprintf (stderr, "*** oldifying suspicious object of size %zu ***\n", sz); (gdb) bt #0 caml_oldify_one (v=17590977892344, p=0x100aaf74) at minor_gc.c:131 #1 0x00000000100574c4 in .caml_oldify_local_roots () #2 0x0000000010056000 in caml_empty_minor_heap () at minor_gc.c:233 #3 0x0000000010056308 in caml_minor_collection () at minor_gc.c:276 #4 0x0000000010066fb8 in .caml_gc_minor () #5 0x000000001006c5f4 in .caml_c_call () #6 0x000000001001a84c in .camlCharEncoding__entry () #7 0x00000000100028a0 in caml_startup__code_begin () #8 0x000000001006c778 in .caml_start_program () #9 0x0000000010056be8 in .caml_main () #10 0x0000000010002064 in .main () Unfortunately I cannot add debug to caml_oldify_local_roots so it's rather hard to see exactly what triggers the warning.
Created attachment 589894 [details] bz826649_v2.tar.gz Smaller reproducer, derived from libguestfs. Unpack the tarball. Type: ocamlopt.opt guestfs.ml test.ml dummy.c -o test When run, the program immediately segfaults: $ gdb ./test (gdb) run Starting program: /tmp/p/test Program received signal SIGSEGV, Segmentation fault. 0x00000000100367dc in .caml_oldify_mopup () (gdb) bt #0 0x00000000100367dc in .caml_oldify_mopup () #1 0x0000000010036920 in .caml_empty_minor_heap () #2 0x0000000010044218 in .caml_gc_compaction () #3 0x0000000010049198 in .caml_c_call () #4 0x00000000100043a8 in .camlTest__entry () #5 0x0000000010002515 in caml_startup__code_begin () #6 0x000000001004931d in .caml_start_program () #7 0x0000000010049a10 in .caml_main () #8 0x00000000100020f4 in .main () Note that no 'guestfs' code is actually run, except for the long entry function in the Guestfs module (which does nothing except register some external functions). To see the assembler output, add -S option to ocamlopt.opt.
You did not give us much to work with. Would like to see the GP registers, especially R2 and R13, which have different usage between PPC32/PPC64. This is a common source of trouble. R13 is the thread pointer. Do not change R13, ever! (only the kernel and clone syscall should modify R13). R2 is the TOC pointer an needs to preserved across calls. Also PPC64 uses function descriptors so function calls and the PLT are different. While you can ignore some parts of the ABI within you JIT code, you must follow the ABI when calling out to libraries.
I believe the problem is that r31 (which is used to point to the next free space in the minor heap) was not kept 8-byte aligned. This resulted in subtle minor heap corruption. This is the patch I am testing: diff --git a/asmcomp/power64/emit.mlp b/asmcomp/power64/emit.mlp index ba54e99..42f585d 100644 --- a/asmcomp/power64/emit.mlp +++ b/asmcomp/power64/emit.mlp @@ -607,7 +607,7 @@ let rec emit_instr i dslot = ` bge {emit_label lbl}\n`; record_frame i.live; ` bl {emit_label !call_gc_label}\n`; (* Must be 4 insns to restart *) - `{emit_label lbl}: addi {emit_reg i.res.(0)}, {emit_gpr 31}, 4\n` + `{emit_label lbl}: addi {emit_reg i.res.(0)}, {emit_gpr 31}, {emit_int size_addr}\n` | Lop(Iintop Isub) -> (* subfc has swapped arguments *) ` subfc {emit_reg i.res.(0)}, {emit_reg i.arg.(1)}, {emit_reg i.arg.(0)}\n` | Lop(Iintop Imod) ->
The patch above fixes all observed instances of this bug. I have added it to OCaml 3.12.1-12.
ocaml-3.12.1-12.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/ocaml-3.12.1-12.fc17
ocaml-3.12.1-12.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.