Bug 1874249 - Illegal instruction when running coqc on s390x
Summary: Illegal instruction when running coqc on s390x
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: ocaml
Version: rawhide
Hardware: s390x
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker
TreeView+ depends on / blocked
 
Reported: 2020-08-31 19:51 UTC by Jerry James
Modified: 2021-10-06 09:00 UTC (History)
3 users (show)

Fixed In Version: ocaml-4.13.1-1.fc36
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-06 09:00:59 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Jerry James 2020-08-31 19:51:24 UTC
Description of problem:
You mentioned the "coq pit" in bug 1870368.  Well, here we go...

I just attempted to update coq to version 8.12.0.  The build succeeded on all architectures except s390x:

https://koji.fedoraproject.org/koji/taskinfo?taskID=50502314

Two of the tests failed with "illegal instruction" errors:

/bin/sh: line 17: 3375524 Illegal instruction     (core dumped) coqc -q -R prerequisite TestSuite "bugs/opened/bug_3395.v" 2>&1
FAILED    bugs/opened/bug_3395.v.log

TEST      success/bteauto.v  (-async-proofs off)
/bin/sh: line 12: 3398081 Illegal instruction     (core dumped) coqc -q -R prerequisite TestSuite "success/bteauto.v" "-async-proofs" "off" $opts 2>&1
FAILED    success/bteauto.v.log

This may prove to be a coq bug, but it seems likely to be an ocaml bug.

Version-Release number of selected component (if applicable):
ocaml-4.11.0-1.fc34.s390x

How reproducible:
Always

Steps to Reproduce:
1. fedpkg clone coq
2. cd coq
3. fedpkg build

Actual results:
Build failure due to illegal instructions.

Expected results:
Successful build.

Additional info:

Comment 1 Richard W.M. Jones 2020-09-15 12:35:09 UTC
Danny: is it possible we have a Rawhide s390x machine we
can use to test-build coq?

----

We've had a couple of coq/s390x bugs.

The first one was reported upstream here:

https://github.com/coq/coq/issues/11939

I believe at some point we decided that this was something in the
OCaml compiler itself, and not specific to s390x at all
(https://github.com/ocaml/ocaml/issues/9391), and it's also
unclear what its status is upstream now.

Assuming this bug isn't the same (and I guess we won't really
know that without a stack trace), then it's probably this one
which was filed upstream twice:

https://github.com/coq/coq/issues/11395
https://github.com/coq/coq/issues/12967

I don't think we can do much further diagnosis without being
able to build on an s390x machine.

Comment 2 Jerry James 2020-09-15 22:59:44 UTC
I've got an account on marist.  I meant to try to debug this sooner.  Sorry for the delay.

Anyway, it's not going well.  I can easily repeat the crashes, but gdb isn't much help:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000003ffb0af6fa0 in ?? ()
(gdb) bt
Python Exception <class 'gdb.error'> PC not saved: 
#0  0x000003ffb0af6fa0 in ?? ()

That looks suspiciously like bug 1850710.  I'll see if I can coax gdb to give me any more information.

Comment 3 Jerry James 2020-09-15 23:13:41 UTC
Comment 2 referred to running the bugs/opened/bug_3395.v test.  More partial backtraces below.

With bugs/closed/bug_5127.v:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000002aa3afdc6b4 in camlConstr__map_with_binders_2386 ()
    at kernel/constr.ml:752
752	let map_with_binders g f l c0 = match kind c0 with
(gdb) bt
#0  0x000002aa3afdc6b4 in camlConstr__map_with_binders_2386 ()
    at kernel/constr.ml:752
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

With complexity/pattern.v:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000002aa37aca5c2 in camlUState__add_universe_constraints_1404 ()
    at engine/uState.ml:308
308	  let univs, local = ctx.uctx_local in
(gdb) bt
#0  0x000002aa37aca5c2 in camlUState__add_universe_constraints_1404 ()
    at engine/uState.ml:308
#1  0x000002aa37ad5eaa in camlEvd__add_universe_constraints_2288 ()
    at engine/evd.ml:652
#2  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

With success/Typeclasses.v:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000002aa2e4b4e90 in camlNotation_ops__77 ()
(gdb) bt
#0  0x000002aa2e4b4e90 in camlNotation_ops__77 ()
#1  0x000002aa2e1d4c38 in caml_raise ()
#2  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

With success/auto.v:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000002 in ?? ()
(gdb) bt
#0  0x0000000000000002 in ?? ()
#1  0x000002aa12d54c38 in caml_raise ()
#2  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

With bteauto.v:

Program terminated with signal SIGILL, Illegal instruction.
#0  0x0000000000000005 in ?? ()
(gdb) bt
#0  0x0000000000000005 in ?? ()
#1  0x000002aa27dd4c38 in caml_raise ()
#2  0x000003ffb43d72a8 in ?? ()
PC not saved

Comment 4 Jerry James 2020-09-15 23:28:49 UTC
Since the same address in caml_raise shows up 3 times in comment 3, here is a disassembly of that function:

   0x000002aa27dd4bd0 <+0>:	stmg	%r11,%r15,88(%r15)
   0x000002aa27dd4bd6 <+6>:	lgrl	%r1,0x2aa27e71f18
   0x000002aa27dd4bdc <+12>:	lay	%r15,-160(%r15)
   0x000002aa27dd4be2 <+18>:	lgr	%r11,%r2
   0x000002aa27dd4be6 <+22>:	ltg	%r1,0(%r1)
   0x000002aa27dd4bec <+28>:	je	0x2aa27dd4bf2 <caml_raise+34>
   0x000002aa27dd4bf0 <+32>:	basr	%r14,%r1
   0x000002aa27dd4bf2 <+34>:	lgrl	%r1,0x2aa27e6a780
   0x000002aa27dd4bf8 <+40>:	lg	%r2,0(%r1)
   0x000002aa27dd4bfe <+46>:	ltg	%r4,16(%r2)
   0x000002aa27dd4c04 <+52>:	je	0x2aa27dd4c38 <caml_raise+104>
   0x000002aa27dd4c08 <+56>:	ltg	%r1,280(%r2)
   0x000002aa27dd4c0e <+62>:	jne	0x2aa27dd4c28 <caml_raise+88>
   0x000002aa27dd4c12 <+66>:	j	0x2aa27dd4c2e <caml_raise+94>
   0x000002aa27dd4c16 <+70>:	lg	%r1,0(%r1)
   0x000002aa27dd4c1c <+76>:	stg	%r1,280(%r2)
   0x000002aa27dd4c22 <+82>:	cgije	%r1,0,0x2aa27dd4c2e <caml_raise+94>
   0x000002aa27dd4c28 <+88>:	clgrjh	%r4,%r1,0x2aa27dd4c16 <caml_raise+70>
   0x000002aa27dd4c2e <+94>:	lgr	%r3,%r11
   0x000002aa27dd4c32 <+98>:	brasl	%r14,0x2aa27e0e900 <caml_raise_exception>
   0x000002aa27dd4c38 <+104>:	lgr	%r2,%r11
   0x000002aa27dd4c3c <+108>:	brasl	%r14,0x2aa27dfee60 <caml_fatal_uncaught_exception>

which seems to mean that inside caml_raise_exception we are jumping to addresses like 2 and 5.  Here is disassembly of that function, for comparison against the handwritten assembly in runtime/s390x.S:

   0x000002aa27e0e900 <+0>:	lgr	%r10,%r2
   0x000002aa27e0e904 <+4>:	lgr	%r2,%r3
   0x000002aa27e0e908 <+8>:	lg	%r0,224(%r10)
   0x000002aa27e0e90e <+14>:	cgfi	%r0,0
   0x000002aa27e0e914 <+20>:	jne	0x2aa27e0e938 <caml_raise_exception+56>
   0x000002aa27e0e918 <+24>:	lg	%r15,16(%r10)
   0x000002aa27e0e91e <+30>:	lg	%r11,0(%r10)
   0x000002aa27e0e924 <+36>:	lg	%r1,0(%r15)
   0x000002aa27e0e92a <+42>:	lg	%r13,8(%r15)
   0x000002aa27e0e930 <+48>:	agfi	%r15,16
   0x000002aa27e0e936 <+54>:	br	%r1
   0x000002aa27e0e938 <+56>:	ldgr	%f15,%r2
   0x000002aa27e0e93c <+60>:	lg	%r3,208(%r10)
   0x000002aa27e0e942 <+66>:	lg	%r4,200(%r10)
   0x000002aa27e0e948 <+72>:	lg	%r5,16(%r10)
   0x000002aa27e0e94e <+78>:	lay	%r15,-160(%r15)
   0x000002aa27e0e954 <+84>:	brasl	%r14,0x2aa27e0ec30 <caml_stash_backtrace>
   0x000002aa27e0e95a <+90>:	lay	%r15,160(%r15)
   0x000002aa27e0e960 <+96>:	lgdr	%r2,%f15
   0x000002aa27e0e964 <+100>:	j	0x2aa27e0e918 <caml_raise_exception+24>

Comment 5 Jerry James 2020-09-15 23:44:40 UTC
It looks like that can only mean that the stack was already corrupt when caml_raise_exception was called, which is what gdb claims.  Running "ulimit -s unlimited" first makes bugs/closed/bug_5127.v, complexity/pattern.v, and success/Typeclasses.v pass.  The other 3 still fail.  Running that last test under valgrind produces a bunch of test output, followed by a lot of valgrind chatter.  Here's the first part:

==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x9C0458: ??? (class_tactics.ml:349)
==3368006==    by 0x9C031D: ??? (class_tactics.ml:339)
==3368006== 
==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x9BFB72: ??? (class_tactics.ml:297)
==3368006==    by 0x9C0517: ??? (class_tactics.ml:406)
==3368006== 
==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x9C0524: ??? (class_tactics.ml:410)
==3368006==    by 0x9C0517: ??? (class_tactics.ml:406)
==3368006== 
==3368006== Invalid write of size 8
==3368006==    at 0xC42330: ??? (names.ml:628)
==3368006==    by 0xC42335: ??? (names.ml:628)
==3368006==  Address 0x156e30c8 is 268,439,688 bytes inside a block of size 268,443,552 in arena "client"
==3368006== 
==3368006== Invalid write of size 8
==3368006==    at 0xF1688E: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3368006==    by 0xC42335: ??? (names.ml:628)
==3368006==  Address 0x156e30d0 is 268,439,696 bytes inside a block of size 268,443,552 in arena "client"
==3368006== 
==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x99F65E: ??? (hints.ml:613)
==3368006==    by 0x99F62D: ??? (hints.ml:612)
==3368006== 
==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x79C6F0: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3368006==    by 0x79DC0D: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3368006== 
==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x99ADCE: ??? (btermdn.ml:177)
==3368006==    by 0x79DC1F: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3368006== 
==3368006== Conditional jump or move depends on uninitialised value(s)
==3368006==    at 0x79C6F0: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3368006==    by 0x79DC31: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)

Comment 6 Jerry James 2020-09-16 22:22:02 UTC
The above information is slightly wrong.  The bugs/opened/bug_3395.v test can segfault without the test getting marked as failed.  I've done a few experiments.

- Default: 6 tests fail with segfaults or illegal instruction errors due to jumping to bad addresses:
  1. bugs/opened/bug_3395.v
  2. bugs/closed/bug_5127.v
  3. complexity/pattern.v
  4. success/Typeclasses.v
  5. success/auto.v
  6. success/bteauto.v

- Add "ulimit -s unlimited" to the top of %build: tests 2 and 3 pass, 1 and 4-6 still crash.  All tests below keep "ulimit -s unlimited" in place.

- Rebuild ocaml with LTO disabled: tests 2 and 3 pass, 1 and 4-6 still crash

- Rebuild ocaml without -fstack-protector-strong and -fstack-clash-protection: tests 2 and 3 pass, 1 and 4-6 still crash

- Rebuild ocaml with -O1 instead of -O2: tests 1-3 and 5-6 pass, test 4 still crashes

- Rebuild ocaml with -O1 and -fsanitize=address: tests 2-4 pass, tests 1 and 5-6 crash with unhelpful address sanitizer output:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==3792239==ERROR: AddressSanitizer: SEGV on unknown address 0x020015329000 (pc 0x020015329ff8 bp 0x02001b7f10a8 sp 0x03ffdc0f7858 T0)
==3792239==The signal is caused by a UNKNOWN memory access.
==3792239==Hint: PC is at a non-executable region. Maybe a wild jump?
    #0 0x20015329ff8  (<unknown module>)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>) 
==3792239==ABORTING

- Rebuild ocaml with -O1 and -fsanitize=undefined: tests 1-3 and 5-6 pass, test 4 still crashes with no output from the sanitizer

I ran each experiment once, so I don't know if there is any nondeterminism involved in which tests pass and which fail.  Whatever is going wrong appears to be happening in ocaml-generated code, rather than gcc-generated code, since the sanitizers aren't seeing it.

Comment 7 Jerry James 2020-09-16 23:24:49 UTC
I'm in a Rawhide mock chroot with "ulimit -s unlimited" in effect.  I built OCaml with -O1 -g3 in hopes of getting better debuginfo.  Running valgrind on test 5 with some extra options:

[mockbuild@lfedora1 test-suite]$ valgrind --leak-check=no --num-callers=8 --read-var-info=yes --track-origins=yes ../topbin/coqc_bin.exe -coqlib /builddir/build/BUILD/coq-8.12.0/_build/install/default/lib/coq -R prerequisite TestSuite success/auto.v
==3952449== Memcheck, a memory error detector
==3952449== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3952449== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==3952449== Command: ../topbin/coqc_bin.exe -coqlib /builddir/build/BUILD/coq-8.12.0/_build/install/default/lib/coq -R prerequisite TestSuite success/auto.v
==3952449== 
==3952449== Warning: set address range perms: large range [0x56e2040, 0x156e3040) (undefined)
File "./success/auto.v", line 14, characters 0-15:
Warning: Adding and removing hints in the core database implicitly is
deprecated. Please specify a hint database.
[implicit-core-hint-db,deprecated]
File "./success/auto.v", line 24, characters 0-20:
Warning:
Command Undo. is not recommended in batch mode. In particular, going back in the document is not efficient in batch mode due to Coq not caching previous states for memory optimization reasons. If your use is intentional, you may want to disable this warning and pass the "-async-proofs-cache force" option to Coq.
[undo-batch-mode,non-interactive]
File "./success/auto.v", line 34, characters 0-33:
Warning: Interpreting this declaration as if a global declaration prefixed by
"Local", i.e. as a global declaration which shall not be available without
qualification when imported. [local-declaration,scope]
File "./success/auto.v", line 36, characters 0-26:
Warning: Interpreting this declaration as if a global declaration prefixed by
"Local", i.e. as a global declaration which shall not be available without
qualification when imported. [local-declaration,scope]
(* debug eauto: *)
Debug: 1 depth=5 
B_instance_0 : B nat
     : B nat
B_instance_0 : B nat
     : B nat
==3952449== Conditional jump or move depends on uninitialised value(s)
==3952449==    at 0xBDB3D4: ??? (evd.ml:291)
==3952449==    by 0xBDB3C7: ??? (evd.ml:291)
==3952449==  Uninitialised value was created by a heap allocation
==3952449==    at 0x4836010: malloc (vg_replace_malloc.c:307)
==3952449==    by 0xEE1C51: caml_stat_alloc_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1C7D: caml_stat_alloc_aligned_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1343: caml_set_minor_heap_size (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEF1F13: caml_gc_set (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEFEFF9: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449== 
==3952449== Conditional jump or move depends on uninitialised value(s)
==3952449==    at 0xACF714: ??? (reductionops.ml:1800)
==3952449==    by 0xACF707: ??? (reductionops.ml:1799)
==3952449==  Uninitialised value was created by a heap allocation
==3952449==    at 0x4836010: malloc (vg_replace_malloc.c:307)
==3952449==    by 0xEE1C51: caml_stat_alloc_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1C7D: caml_stat_alloc_aligned_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1343: caml_set_minor_heap_size (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEF1F13: caml_gc_set (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEFEFF9: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449== 
==3952449== Conditional jump or move depends on uninitialised value(s)
==3952449==    at 0xACAEE4: ??? (reductionops.ml:1382)
==3952449==    by 0xACB00F: ??? (reductionops.ml:1383)
==3952449==  Uninitialised value was created by a heap allocation
==3952449==    at 0x4836010: malloc (vg_replace_malloc.c:307)
==3952449==    by 0xEE1C51: caml_stat_alloc_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1C7D: caml_stat_alloc_aligned_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1343: caml_set_minor_heap_size (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEF1F13: caml_gc_set (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEFEFF9: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449== 
==3952449== Conditional jump or move depends on uninitialised value(s)
==3952449==    at 0xC58390: ??? (esubst.ml:78)
==3952449==    by 0xCA5151: ??? (cClosure.ml:958)
==3952449==  Uninitialised value was created by a heap allocation
==3952449==    at 0x4836010: malloc (vg_replace_malloc.c:307)
==3952449==    by 0xEE1C51: caml_stat_alloc_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1C7D: caml_stat_alloc_aligned_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1343: caml_set_minor_heap_size (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEF1F13: caml_gc_set (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEFEFF9: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449== 
==3952449== Conditional jump or move depends on uninitialised value(s)
==3952449==    at 0xC9F0DA: ??? (cClosure.ml:505)
==3952449==    by 0xC9ED07: ??? (cClosure.ml:495)
==3952449==  Uninitialised value was created by a heap allocation
==3952449==    at 0x4836010: malloc (vg_replace_malloc.c:307)
==3952449==    by 0xEE1C51: caml_stat_alloc_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1C7D: caml_stat_alloc_aligned_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1343: caml_set_minor_heap_size (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEF1F13: caml_gc_set (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEFEFF9: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449== 
==3952449== Conditional jump or move depends on uninitialised value(s)
==3952449==    at 0xC9F132: ??? (cClosure.ml:505)
==3952449==    by 0xC9F125: ??? (cClosure.ml:309)
==3952449==  Uninitialised value was created by a heap allocation
==3952449==    at 0x4836010: malloc (vg_replace_malloc.c:307)
==3952449==    by 0xEE1C51: caml_stat_alloc_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1C7D: caml_stat_alloc_aligned_noexc (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEE1343: caml_set_minor_heap_size (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEF1F13: caml_gc_set (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xEFEFF9: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449== 
==3952449== Invalid write of size 8
==3952449==    at 0xE99EE8: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xE99EED: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==  Address 0x156e30c8 is 268,439,688 bytes inside a block of size 268,443,552 in arena "client"
==3952449== 
==3952449== Invalid write of size 8
==3952449==    at 0xEFEFE6: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0xE99EED: ??? (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==  Address 0x156e30d0 is 268,439,696 bytes inside a block of size 268,443,552 in arena "client"
==3952449== 
==3952449== Jump to the invalid address stated on the next line
==3952449==    at 0x2: ???
==3952449==    by 0xEDA75B: caml_raise (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0x0: ???
==3952449==  Address 0x2 is not stack'd, malloc'd or (recently) free'd
==3952449== 
==3952449== 
==3952449== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3952449==  Bad permissions for mapped region at address 0x2
==3952449==    at 0x2: ???
==3952449==    by 0xEDA75B: caml_raise (in /builddir/build/BUILD/coq-8.12.0/_build/default/topbin/coqc_bin.exe)
==3952449==    by 0x0: ???
==3952449== 
==3952449== HEAP SUMMARY:
==3952449==     in use at exit: 486,441,602 bytes in 846 blocks
==3952449==   total heap usage: 9,193 allocs, 8,347 frees, 501,160,368 bytes allocated
==3952449== 
==3952449== For a detailed leak analysis, rerun with: --leak-check=full
==3952449== 
==3952449== For lists of detected and suppressed errors, rerun with: -s
==3952449== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 0 from 0)

And just for fun, here is GDB's disassembly of the function around evd.ml line 291:
(gdb) disassemble camlEvd__mk_freelisted_1532
Dump of assembler code for function camlEvd__mk_freelisted_1532:
   0x000002aa00ad33b0 <+0>:	lay	%r15,-16(%r15)
   0x000002aa00ad33b6 <+6>:	stg	%r14,8(%r15)
   0x000002aa00ad33bc <+12>:	stg	%r2,0(%r15)
   0x000002aa00ad33c2 <+18>:	brasl	%r14,0x2aa00ad32e0 <camlEvd__metavars_of_1455>
   0x000002aa00ad33c8 <+24>:	lay	%r11,-24(%r11)
   0x000002aa00ad33ce <+30>:	clg	%r11,8(%r10)
   0x000002aa00ad33d4 <+36>:	jgl	0x2aa00ad340a <camlEvd__mk_freelisted_1532+90>
   0x000002aa00ad33da <+42>:	la	%r4,8(%r11)
   0x000002aa00ad33de <+46>:	lg	%r14,8(%r15)
   0x000002aa00ad33e4 <+52>:	lg	%r6,0(%r15)
   0x000002aa00ad33ea <+58>:	lghi	%r5,2048
   0x000002aa00ad33ee <+62>:	stg	%r5,-8(%r4)
   0x000002aa00ad33f4 <+68>:	stg	%r6,0(%r4)
   0x000002aa00ad33fa <+74>:	stg	%r2,8(%r4)
   0x000002aa00ad3400 <+80>:	lgr	%r2,%r4
   0x000002aa00ad3404 <+84>:	la	%r15,16(%r15)
   0x000002aa00ad3408 <+88>:	br	%r14
   0x000002aa00ad340a <+90>:	brasl	%r14,0x2aa00df6ee8 <caml_system__code_begin>
   0x000002aa00ad3410 <+96>:	jg	0x2aa00ad33da <camlEvd__mk_freelisted_1532+42>
   0x000002aa00ad3416 <+102>:	nopr	%r7
End of assembler dump.

Comment 8 Richard W.M. Jones 2020-09-21 08:00:29 UTC
One thing to try if you have the time is running the OCaml compiler
test suite.  After compiling OCaml from source you have to do:

make -C testsuite -j1 all

The test suite consists of many small programs, so if any fail,
especially ones related to exception handling that appear to crash
in the same way as above, it could be a good way to get a minimal reproducer.

We don't actually run this for s390x because IIRC it made the Koji
build take forever.  (In fact in our current build we don't expect the test
suite to pass on any architecture at all, although it's pretty close
to passing on x86-64.)

Comment 9 Dan Horák 2020-09-21 14:03:28 UTC
(In reply to Richard W.M. Jones from comment #1)
> Danny: is it possible we have a Rawhide s390x machine we
> can use to test-build coq?

you could use the same lfedora1 machine Jerry is using (I see you as user rjones there), or we can find something internally with sufficient disk space

Comment 10 Jerry James 2020-09-22 21:45:44 UTC
(In reply to Richard W.M. Jones from comment #8)
> One thing to try if you have the time is running the OCaml compiler
> test suite.  After compiling OCaml from source you have to do:
> 
> make -C testsuite -j1 all
> 
> The test suite consists of many small programs, so if any fail,
> especially ones related to exception handling that appear to crash
> in the same way as above, it could be a good way to get a minimal reproducer.
> 
> We don't actually run this for s390x because IIRC it made the Koji
> build take forever.  (In fact in our current build we don't expect the test
> suite to pass on any architecture at all, although it's pretty close
> to passing on x86-64.)

Okay, I'll try to do that.  Over the weekend, I built a bunch of packages that are missing on Fedora 31 so I could try building there.  With ocaml-4.08.1-0.rc2.1 on Fedora 31, coq 8.12.0 passes all of the tests identified as crashing above.

I haven't been able to repeat this process with Fedora 32, due to the Heisenbug that plagues coq builds there.  I think we fingered an ephemeron bug for that.  Anyway, I have not yet succeeded in getting a coq build to complete on Fedora 32.

Comment 11 Ben Cotton 2021-02-09 16:16:16 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 34 development cycle.
Changing version to 34.

Comment 12 Fedora Update System 2021-10-06 09:00:59 UTC
FEDORA-2021-37d56e6b48 has been pushed to the Fedora 36 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.