1526848 – SIGSEGV during game shutdown with hedgewars 0.9.23 (hwengine)

Bug 1526848 - SIGSEGV during game shutdown with hedgewars 0.9.23 (hwengine)

Summary: SIGSEGV during game shutdown with hedgewars 0.9.23 (hwengine)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	fpc
Sub Component:
Version:	27
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Joost van der Sluis
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1517404
TreeView+	depends on / blocked

Reported:	2017-12-18 02:04 UTC by Richard Shaw
Modified:	2018-04-04 17:07 UTC (History)
CC List:	14 users (show)
See Also:	https://issues.hedgewars.org/show_bug.cgi?id=570 https://bugs.freepascal.org/view.php?id=15582
Fixed In Version:	fpc-3.0.2-4.fc27
Clone Of:
Environment:
Last Closed:	2018-04-04 17:07:40 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Minimal test demo for hwengine (1.34 KB, application/octet-stream) 2017-12-18 02:05 UTC, Richard Shaw	no flags	Details
gdb backtrace from f27 w/ intel graphics (3.71 KB, text/plain) 2017-12-18 02:06 UTC, Richard Shaw	no flags	Details
gdb backtrace from f26 w/ nvidia graphics (2.90 KB, text/plain) 2017-12-18 02:07 UTC, Richard Shaw	no flags	Details
SIGSEGV report w/ registers (2.03 KB, text/plain) 2017-12-18 02:08 UTC, Richard Shaw	no flags	Details
gdb sharedlibrary (6.45 KB, application/mbox) 2017-12-18 02:15 UTC, Richard Shaw	no flags	Details
View All

Description Richard Shaw 2017-12-18 02:04:30 UTC

Description of problem:
During game shutdown (freeing resources) a segmentation fault occurs.

The problem appears to be the compiler not keeping %rsp 16-byte aligned.

Version-Release number of selected component (if applicable):
hedgewars-0.9.23-1.fc27 (in updates-testing) also reproducible on the F25 update.


Steps to Reproduce:
1. run hedgewars with minimal test demo (test.53.hwd)
2. Game exits

Actual results:
SIGSEGV occurs


Expected results:
game exits normally

Comment 1 Richard Shaw 2017-12-18 02:05:33 UTC

Created attachment 1369242 [details]
Minimal test demo for hwengine

Debug command used:

$ gdb --args hwengine test.53.hwd

Comment 2 Richard Shaw 2017-12-18 02:06:24 UTC

Created attachment 1369243 [details]
gdb backtrace from f27 w/ intel graphics

Comment 3 Richard Shaw 2017-12-18 02:07:12 UTC

Created attachment 1369244 [details]
gdb backtrace from f26 w/ nvidia graphics

Comment 4 Richard Shaw 2017-12-18 02:08:01 UTC

Created attachment 1369245 [details]
SIGSEGV report w/ registers

Comment 5 Richard Shaw 2017-12-18 02:12:22 UTC

Packages where I think the fault is occurring:

$ rpm -qa | grep libglvnd | grep x86_64
libglvnd-debuginfo-1.0.0-1.fc26.x86_64
libglvnd-1.0.0-1.fc26.x86_64
libglvnd-devel-1.0.0-1.fc26.x86_64
libglvnd-gles-1.0.0-1.fc26.x86_64
libglvnd-glx-1.0.0-1.fc26.x86_64
libglvnd-opengl-1.0.0-1.fc26.x86_64
libglvnd-egl-1.0.0-1.fc26.x86_64
libglvnd-core-devel-1.0.0-1.fc26.x86_64

Comment 6 Richard Shaw 2017-12-18 02:15:33 UTC

Created attachment 1369248 [details]
gdb sharedlibrary

Comment 7 John Reiser 2017-12-18 06:19:42 UTC

The bug is in fpc-3.0.2-1.fc26.src.rpm
===== line 104, fpcbuild-3.0.2/fpcsrc/rtl/linux/x86_64/cprt0.as
/* fake main routine which will be run from libc */
        .globl main_stub
        .type main_stub,@function
main_stub:
        /* save return address */
        popq    %rax

        // stack alignment
        pushq   %rax

        movq    ___fpc_ret_rbp@GOTPCREL(%rip),%rcx
        movq    %rbp,(%rcx)
        movq    ___fpc_ret@GOTPCREL(%rip),%rcx
        movq    %rax,(%rcx)

  <<snip>>

        .globl _haltproc
        .type _haltproc,@function
_haltproc:
        movq    operatingsystem_result@GOTPCREL(%rip),%rax
        movzwl  (%rax),%eax

        /* return to libc */
        movq    ___fpc_ret_rbp@GOTPCREL(%rip),%rcx
        movq    (%rcx),%rbp
        movq    ___fpc_ret@GOTPCREL(%rip),%rcx
        movq    (%rcx),%rdx
        pushq    %rdx
        ret
=====
The 'ret' goes to the saved __fpc_ret with %rsp of (8 mod 16) because _haltproc was call()ed, thus %rsp is (8 mod 16) at entry, and _haltproc leaves with the same value in %rsp: the 'ret' cancels the 'pushq'.  But the value of %rsp immediately after a 'ret' must be (0 mod 16), just like immediately before a 'call'.

Incidentally, there is a missing ".balign 8" at the beginning of the .data section.  The address of __fpc_ret is (4 mod 8) but should be (0 mod 8).

The Submitter should change the Component from 'gcc' to whoever owns fpc-3.0.2.  (I am not allowed to change the Component.)

The Submitter owes me a beer for this one.

Comment 8 Florian Weimer 2017-12-18 07:39:03 UTC

%rsp isn't restored before returning from main_stub, either.  If any of the (indirect) callers of the function use local variables, but not a frame pointer, that will cause problems as well.  fpc should use setjmp/longjmp from libc or a correct reimplementation of that.

Comment 9 John Reiser 2017-12-18 17:18:54 UTC

> fpc should use setjmp/longjmp from libc or a correct reimplementation of that.

I am irked at the bug, but the developer/maintainer deserves some sympathy.

1. setjmp/longjmp has a history of poor engineering. There are multiple standards, and sizeof "the save buffer" has not been constant.  It is prudent to reserve 2*sizeof(void *) more space than otherwise apparent.

2. More than once I have had to debug a random project where longjmp() has discarded the most-valuable clues: all the active stack frames from the call to longjmp() back to the returning-for-the-second-time setjmp().  No traceback, no local variables.  You get no hint about how _haltproc() was called.  And if a long-dormant error condition causes longjmp() to attempt a return when the _caller_ of the corresponding setjmp() already has returned, then you are really in trouble.  setjmp/longjmp almost demands the rigor of a formal state machine model.

3. What is really happening is ... co-routines!  _libc_start_main is one, main_stub is another, and _haltproc is merely a co_resume().  Yes, the main_stub co-routine is left dangling when _libc_start_main calls exit().  But hedgewars is a game, so the dangling co-routine is merely a continuation that is not saved.


Suggestion to maintainer: explicitly comment why you are (or are not) calling setjmp/longjmp.

Comment 10 Gwyn Ciesla 2018-02-07 20:26:01 UTC

Fascinatingly, after being forced to reinstall f27 on the same hardware (don't ask, it's embarrasing) I no longer see this issue. I'm not sure if something was fixed or if it was the reinstall, but the system was only a day or so behind fully updated before I reinstalled, and I got the issue at that time.

Comment 11 John Reiser 2018-02-08 04:46:53 UTC

(In reply to Gwyn Ciesla from comment #10)
> after ... reinstall f27 ... I no longer see this issue.

The reported SIGSEGV occurred when glibc was calling destructors (on_exit, atexit, cxx_exit) because gcc used a SSEx instruction which requires 16-byte alignment to store a struct of two 8-byte pointers onto the stack.  The generated code was "too fancy", not even close to optimal; storing each pointer separately is faster and smaller.  So if either glibc or gcc was changed, then that could have changed the code to avoid the SSEx instruction, and the SIGSEGV would not occur, even though the bug in fpc-3.0.2-1 has not been fixed.

Comment 12 Richard Shaw 2018-02-08 13:35:58 UTC

I'll test on my end and see if it's fixed. If so at least I can go ahead and push the updates to stable.

Comment 13 Gwyn Ciesla 2018-02-13 21:08:01 UTC

(In reply to Gwyn Ciesla from comment #10)
> Fascinatingly, after being forced to reinstall f27 on the same hardware
> (don't ask, it's embarrasing) I no longer see this issue. I'm not sure if
> something was fixed or if it was the reinstall, but the system was only a
> day or so behind fully updated before I reinstalled, and I got the issue at
> that time.

Please ignore this comment. I forgot to update to 0.9.23 before testing, and the issue is still there.

Comment 14 Richard Shaw 2018-02-13 21:10:47 UTC

Whoops...

Comment 15 Artur Frenszek-Iwicki 2018-02-17 12:35:01 UTC

Has anyone tried reporting this issue upstream?

Comment 16 Richard Shaw 2018-02-17 13:02:38 UTC

Adding the bug I reported upstream to the see also link.

However, the problem appears to be in Fedora fpc, not hedgewars.

Comment 17 Mattia Verga 2018-02-17 13:37:49 UTC

The same version compiled with the same fpc version does not produce any SIGSEV on Rawhide.

Comment 18 Richard Shaw 2018-02-17 13:56:41 UTC

I've started a scratch build for f27. If it solves the problem I'll do an official rebuild:

https://koji.fedoraproject.org/koji/taskinfo?taskID=25121334

Comment 19 John Reiser 2018-02-17 16:07:04 UTC

Rebuilding hedgewars might "accidentally" evade the problem on some x86_64 systems, but only if fpc-3.0.2 has been improved, and/or if gcc happens to compile dl-error-skeleton.c differently.  

The SIGSEGV occurs in /lib64/libc.so.6 when the instruction:
  => 0x7ffff6354e9c <_dl_catch_error+108>:	movaps %xmm0,0x50(%rsp)
is executed with %rsp not 16-byte aligned., because the opcode 'movaps' requires 16-byte alignment.  fpc-3.0.2 is responsible for creating the misalignment, glibc is responsible for the source code in _dl_catch_error, and gcc is responsible for generating the 'movaps' opcode when compiling glibc.

If fpc has not changed (see Comment #7 above; portions of fpc are statically linked into hedgewars) then the SIGSEGV depends only on the code that gcc generates for glibc.  So a new hedgewars won't make any difference on those systems where libc.so.6 uses 'movaps'.

Low-level details: the source code is:
===== dl-error-skeleton.c:187
187	  c.objname = objname;
188	  c.errstring = errstring;
=====
where .objname and .errstring are two pointers that are the first two members of a struct.  The generated code is:
=====
   0x7ffff6354e30 <_dl_catch_error>:	push   %rbx
   0x7ffff6354e31 <_dl_catch_error+1>:	sub    $0x140,%rsp
   0x7ffff6354e38 <_dl_catch_error+8>:	lea    0x4c(%rsp),%rax
   0x7ffff6354e3d <_dl_catch_error+13>:	mov    %rdi,0x10(%rsp)
   0x7ffff6354e42 <_dl_catch_error+18>:	mov    %rsi,0x8(%rsp)
   0x7ffff6354e47 <_dl_catch_error+23>:	mov    %rdi,0x20(%rsp)
   0x7ffff6354e4c <_dl_catch_error+28>:	lea    0x50(%rsp),%rdi
   0x7ffff6354e51 <_dl_catch_error+33>:	mov    %rcx,0x30(%rsp)
   0x7ffff6354e56 <_dl_catch_error+38>:	mov    %rax,0x68(%rsp)
   0x7ffff6354e5b <_dl_catch_error+43>:	mov    0x279f26(%rip),%rax        # 0x7ffff65ced88
   0x7ffff6354e62 <_dl_catch_error+50>:	movq   0x10(%rsp),%xmm0   ## objname in low 64 bits of %xmm0
   0x7ffff6354e68 <_dl_catch_error+56>:	mov    %r8,0x38(%rsp)
   0x7ffff6354e6d <_dl_catch_error+61>:	movhps 0x8(%rsp),%xmm0   ## errstring in high 64 bits of %xmm0
   0x7ffff6354e72 <_dl_catch_error+66>:	mov    %rdx,0x28(%rsp)
   0x7ffff6354e77 <_dl_catch_error+71>:	mov    %rdx,0x60(%rsp)
   0x7ffff6354e7c <_dl_catch_error+76>:	mov    %fs:(%rax),%rsi
   0x7ffff6354e80 <_dl_catch_error+80>:	mov    %rdi,%fs:(%rax)
   0x7ffff6354e84 <_dl_catch_error+84>:	lea    0x70(%rsp),%rdi
   0x7ffff6354e89 <_dl_catch_error+89>:	mov    %fs:0x28,%rbx
   0x7ffff6354e92 <_dl_catch_error+98>:	mov    %rbx,0x138(%rsp)
   0x7ffff6354e9a <_dl_catch_error+106>:	xor    %ebx,%ebx
=> 0x7ffff6354e9c <_dl_catch_error+108>:	movaps %xmm0,0x50(%rsp)  ## store objname and errstring
=====

Comment 20 John Reiser 2018-02-17 16:27:00 UTC

fpc-3.0.4-1.fc28 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1038094) still has the bug in Comment #7.  So now it's entirely up to how gcc compiles glibc.

Comment 21 Richard Shaw 2018-03-03 14:11:07 UTC

Scratch build with fpc 3.0.4 plus whatever other dependencies have been updated, hopefully it's "accidentally" fixed until it can be truly fixed.

https://koji.fedoraproject.org/koji/taskinfo?taskID=25443192

Comment 22 John Reiser 2018-03-03 16:50:44 UTC

Changing the fpc compiler (fpc bug 15582) will have no effect on this hedgewars SIGSEGV.  It's good to know that fpc is aware of the compiler issue, and hedgewars might suffer from that compiler issue in the future, but the immediate hedgewars problem is not with the fpc compiler.

fpc bug 15582 identifies the issue that the fpc compiler for i686 and x86_64 should align the stack pointer of all/most/some subroutines to (0 mod 16), if those subroutines might call any code that was compiled by gcc.  However, the current hedgewars SIGSEGV is caused by some hand-written assembly-language code in rtl/linux/x86_64/cprt0.as, so changing the fpc compiler will have no effect.

Comment 23 John Reiser 2018-03-03 17:01:59 UTC

The quickest, dirtiest, ugliest fix is to force the stack alignment in _haltproc:

        movq    ___fpc_ret@GOTPCREL(%rip),%rcx
        movq    (%rcx),%rdx
+       andq $~0xf,%rsp  ## force stack alignment until fpc rtl/linux/x86_64/cprt0.as is fixed
        pushq    %rdx
        ret

Like I said, this is VILE.  But it will work.

Comment 24 Richard Shaw 2018-03-03 17:53:07 UTC

fpc upstream shows the following as their fix:

$ svn diff --git -r38399:38400
Index: rtl/linux/x86_64/si_c.inc
===================================================================
diff --git a/rtl/linux/x86_64/si_c.inc b/rtl/linux/x86_64/si_c.inc
--- a/rtl/linux/x86_64/si_c.inc	(revision 38399)
+++ b/rtl/linux/x86_64/si_c.inc	(revision 38400)
@@ -172,6 +172,7 @@
 procedure _FPC_libc_haltproc(e:longint); assembler; nostackframe; public name '_haltproc';
   asm
     movl      %edi,%eax
+    popq    %rdx                    { keep stack aligned }
     movq fpc_ret(%rip),%rdx         { return to libc }
     movq fpc_ret_rbp(%rip),%rbp
     pushq %rdx
Index: rtl/linux/x86_64/cprt0.as
===================================================================
diff --git a/rtl/linux/x86_64/cprt0.as b/rtl/linux/x86_64/cprt0.as
--- a/rtl/linux/x86_64/cprt0.as	(revision 38399)
+++ b/rtl/linux/x86_64/cprt0.as	(revision 38400)
@@ -134,6 +134,7 @@
 	movq    ___fpc_ret_rbp@GOTPCREL(%rip),%rcx
         movq    (%rcx),%rbp
 	movq    ___fpc_ret@GOTPCREL(%rip),%rcx
+        popq    %rdx
         movq    (%rcx),%rdx
         pushq    %rdx
 	ret

Comment 25 John Reiser 2018-03-03 19:18:55 UTC

The 'popq' will work if and only if %rsp is (8 mod 16) at entry to the *_haltproc subroutine.  This is true when all subroutines keep the stack 16-byte aligned.  But if some subroutine in the call chain mis-aligns the stack, so that %rsp at entry is not (8 mod 16), then the stack is mis-aligned upon exit, and the same SIGSEGV may occur.

The 'andq' always works to align the stack, regardless of how mis-aligned the stack is at entry.  So the 'andq' is safer.  But there still is risk if main_stub uses on-stack local variables after _haltproc returns (see Florian's comment #8).  The true fix REQUIRES using setjmp+longjmp.

Comment 26 Florian.Klaempfl 2018-03-04 09:32:58 UTC

> The 'popq' will work if and only if %rsp is (8 mod 16) at entry to the *_haltproc subroutine.

This is a requirement of the ABI (see section 3.2.2 in AMD64 ABI Draft 0.99.8 – April 25, 2016 – 10:07).

About main_stub: it is never supposed to return, this is why it contains a even a hlt (if this hlt is ever reached, the OS returned from an exit call).

Neverthless, I agree, that using setjmp/longjmp would be usefull, especially to get rid of this assembler mess.

Comment 27 Richard Shaw 2018-03-04 13:29:10 UTC

For now I have submitted a pull request to fpc using commit r38400 and have verified it fixes the crash in hedgewars. If an improved fix (via setjmp/longjmp or otherwise) is submitted I'll update the pull request.

https://src.fedoraproject.org/rpms/fpc/pull-request/1

Comment 28 Richard Shaw 2018-03-04 13:30:10 UTC

Gwyn, if you would like to independently verify the fix works I setup a copr for testing:

https://copr.fedorainfracloud.org/coprs/hobbes1069/hedgewars/

Comment 29 Gwyn Ciesla 2018-03-05 15:05:46 UTC

Confirmed! Thank you!

Comment 30 Fedora Update System 2018-03-17 21:55:05 UTC

fpc-3.0.2-4.fc27 lazarus-1.8.0-4.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-b6e484dff7

Comment 31 Fedora Update System 2018-03-18 18:59:39 UTC

fpc-3.0.2-4.fc27, lazarus-1.8.0-4.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-b6e484dff7

Comment 32 Fedora Update System 2018-04-04 17:07:40 UTC

fpc-3.0.2-4.fc27, lazarus-1.8.0-4.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.