Description of problem: During game shutdown (freeing resources) a segmentation fault occurs. The problem appears to be the compiler not keeping %rsp 16-byte aligned. Version-Release number of selected component (if applicable): hedgewars-0.9.23-1.fc27 (in updates-testing) also reproducible on the F25 update. Steps to Reproduce: 1. run hedgewars with minimal test demo (test.53.hwd) 2. Game exits Actual results: SIGSEGV occurs Expected results: game exits normally
Created attachment 1369242 [details] Minimal test demo for hwengine Debug command used: $ gdb --args hwengine test.53.hwd
Created attachment 1369243 [details] gdb backtrace from f27 w/ intel graphics
Created attachment 1369244 [details] gdb backtrace from f26 w/ nvidia graphics
Created attachment 1369245 [details] SIGSEGV report w/ registers
Packages where I think the fault is occurring: $ rpm -qa | grep libglvnd | grep x86_64 libglvnd-debuginfo-1.0.0-1.fc26.x86_64 libglvnd-1.0.0-1.fc26.x86_64 libglvnd-devel-1.0.0-1.fc26.x86_64 libglvnd-gles-1.0.0-1.fc26.x86_64 libglvnd-glx-1.0.0-1.fc26.x86_64 libglvnd-opengl-1.0.0-1.fc26.x86_64 libglvnd-egl-1.0.0-1.fc26.x86_64 libglvnd-core-devel-1.0.0-1.fc26.x86_64
Created attachment 1369248 [details] gdb sharedlibrary
The bug is in fpc-3.0.2-1.fc26.src.rpm ===== line 104, fpcbuild-3.0.2/fpcsrc/rtl/linux/x86_64/cprt0.as /* fake main routine which will be run from libc */ .globl main_stub .type main_stub,@function main_stub: /* save return address */ popq %rax // stack alignment pushq %rax movq ___fpc_ret_rbp@GOTPCREL(%rip),%rcx movq %rbp,(%rcx) movq ___fpc_ret@GOTPCREL(%rip),%rcx movq %rax,(%rcx) <<snip>> .globl _haltproc .type _haltproc,@function _haltproc: movq operatingsystem_result@GOTPCREL(%rip),%rax movzwl (%rax),%eax /* return to libc */ movq ___fpc_ret_rbp@GOTPCREL(%rip),%rcx movq (%rcx),%rbp movq ___fpc_ret@GOTPCREL(%rip),%rcx movq (%rcx),%rdx pushq %rdx ret ===== The 'ret' goes to the saved __fpc_ret with %rsp of (8 mod 16) because _haltproc was call()ed, thus %rsp is (8 mod 16) at entry, and _haltproc leaves with the same value in %rsp: the 'ret' cancels the 'pushq'. But the value of %rsp immediately after a 'ret' must be (0 mod 16), just like immediately before a 'call'. Incidentally, there is a missing ".balign 8" at the beginning of the .data section. The address of __fpc_ret is (4 mod 8) but should be (0 mod 8). The Submitter should change the Component from 'gcc' to whoever owns fpc-3.0.2. (I am not allowed to change the Component.) The Submitter owes me a beer for this one.
%rsp isn't restored before returning from main_stub, either. If any of the (indirect) callers of the function use local variables, but not a frame pointer, that will cause problems as well. fpc should use setjmp/longjmp from libc or a correct reimplementation of that.
> fpc should use setjmp/longjmp from libc or a correct reimplementation of that. I am irked at the bug, but the developer/maintainer deserves some sympathy. 1. setjmp/longjmp has a history of poor engineering. There are multiple standards, and sizeof "the save buffer" has not been constant. It is prudent to reserve 2*sizeof(void *) more space than otherwise apparent. 2. More than once I have had to debug a random project where longjmp() has discarded the most-valuable clues: all the active stack frames from the call to longjmp() back to the returning-for-the-second-time setjmp(). No traceback, no local variables. You get no hint about how _haltproc() was called. And if a long-dormant error condition causes longjmp() to attempt a return when the _caller_ of the corresponding setjmp() already has returned, then you are really in trouble. setjmp/longjmp almost demands the rigor of a formal state machine model. 3. What is really happening is ... co-routines! _libc_start_main is one, main_stub is another, and _haltproc is merely a co_resume(). Yes, the main_stub co-routine is left dangling when _libc_start_main calls exit(). But hedgewars is a game, so the dangling co-routine is merely a continuation that is not saved. Suggestion to maintainer: explicitly comment why you are (or are not) calling setjmp/longjmp.
Fascinatingly, after being forced to reinstall f27 on the same hardware (don't ask, it's embarrasing) I no longer see this issue. I'm not sure if something was fixed or if it was the reinstall, but the system was only a day or so behind fully updated before I reinstalled, and I got the issue at that time.
(In reply to Gwyn Ciesla from comment #10) > after ... reinstall f27 ... I no longer see this issue. The reported SIGSEGV occurred when glibc was calling destructors (on_exit, atexit, cxx_exit) because gcc used a SSEx instruction which requires 16-byte alignment to store a struct of two 8-byte pointers onto the stack. The generated code was "too fancy", not even close to optimal; storing each pointer separately is faster and smaller. So if either glibc or gcc was changed, then that could have changed the code to avoid the SSEx instruction, and the SIGSEGV would not occur, even though the bug in fpc-3.0.2-1 has not been fixed.
I'll test on my end and see if it's fixed. If so at least I can go ahead and push the updates to stable.
(In reply to Gwyn Ciesla from comment #10) > Fascinatingly, after being forced to reinstall f27 on the same hardware > (don't ask, it's embarrasing) I no longer see this issue. I'm not sure if > something was fixed or if it was the reinstall, but the system was only a > day or so behind fully updated before I reinstalled, and I got the issue at > that time. Please ignore this comment. I forgot to update to 0.9.23 before testing, and the issue is still there.
Whoops...
Has anyone tried reporting this issue upstream?
Adding the bug I reported upstream to the see also link. However, the problem appears to be in Fedora fpc, not hedgewars.
The same version compiled with the same fpc version does not produce any SIGSEV on Rawhide.
I've started a scratch build for f27. If it solves the problem I'll do an official rebuild: https://koji.fedoraproject.org/koji/taskinfo?taskID=25121334
Rebuilding hedgewars might "accidentally" evade the problem on some x86_64 systems, but only if fpc-3.0.2 has been improved, and/or if gcc happens to compile dl-error-skeleton.c differently. The SIGSEGV occurs in /lib64/libc.so.6 when the instruction: => 0x7ffff6354e9c <_dl_catch_error+108>: movaps %xmm0,0x50(%rsp) is executed with %rsp not 16-byte aligned., because the opcode 'movaps' requires 16-byte alignment. fpc-3.0.2 is responsible for creating the misalignment, glibc is responsible for the source code in _dl_catch_error, and gcc is responsible for generating the 'movaps' opcode when compiling glibc. If fpc has not changed (see Comment #7 above; portions of fpc are statically linked into hedgewars) then the SIGSEGV depends only on the code that gcc generates for glibc. So a new hedgewars won't make any difference on those systems where libc.so.6 uses 'movaps'. Low-level details: the source code is: ===== dl-error-skeleton.c:187 187 c.objname = objname; 188 c.errstring = errstring; ===== where .objname and .errstring are two pointers that are the first two members of a struct. The generated code is: ===== 0x7ffff6354e30 <_dl_catch_error>: push %rbx 0x7ffff6354e31 <_dl_catch_error+1>: sub $0x140,%rsp 0x7ffff6354e38 <_dl_catch_error+8>: lea 0x4c(%rsp),%rax 0x7ffff6354e3d <_dl_catch_error+13>: mov %rdi,0x10(%rsp) 0x7ffff6354e42 <_dl_catch_error+18>: mov %rsi,0x8(%rsp) 0x7ffff6354e47 <_dl_catch_error+23>: mov %rdi,0x20(%rsp) 0x7ffff6354e4c <_dl_catch_error+28>: lea 0x50(%rsp),%rdi 0x7ffff6354e51 <_dl_catch_error+33>: mov %rcx,0x30(%rsp) 0x7ffff6354e56 <_dl_catch_error+38>: mov %rax,0x68(%rsp) 0x7ffff6354e5b <_dl_catch_error+43>: mov 0x279f26(%rip),%rax # 0x7ffff65ced88 0x7ffff6354e62 <_dl_catch_error+50>: movq 0x10(%rsp),%xmm0 ## objname in low 64 bits of %xmm0 0x7ffff6354e68 <_dl_catch_error+56>: mov %r8,0x38(%rsp) 0x7ffff6354e6d <_dl_catch_error+61>: movhps 0x8(%rsp),%xmm0 ## errstring in high 64 bits of %xmm0 0x7ffff6354e72 <_dl_catch_error+66>: mov %rdx,0x28(%rsp) 0x7ffff6354e77 <_dl_catch_error+71>: mov %rdx,0x60(%rsp) 0x7ffff6354e7c <_dl_catch_error+76>: mov %fs:(%rax),%rsi 0x7ffff6354e80 <_dl_catch_error+80>: mov %rdi,%fs:(%rax) 0x7ffff6354e84 <_dl_catch_error+84>: lea 0x70(%rsp),%rdi 0x7ffff6354e89 <_dl_catch_error+89>: mov %fs:0x28,%rbx 0x7ffff6354e92 <_dl_catch_error+98>: mov %rbx,0x138(%rsp) 0x7ffff6354e9a <_dl_catch_error+106>: xor %ebx,%ebx => 0x7ffff6354e9c <_dl_catch_error+108>: movaps %xmm0,0x50(%rsp) ## store objname and errstring =====
fpc-3.0.4-1.fc28 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1038094) still has the bug in Comment #7. So now it's entirely up to how gcc compiles glibc.
Scratch build with fpc 3.0.4 plus whatever other dependencies have been updated, hopefully it's "accidentally" fixed until it can be truly fixed. https://koji.fedoraproject.org/koji/taskinfo?taskID=25443192
Changing the fpc compiler (fpc bug 15582) will have no effect on this hedgewars SIGSEGV. It's good to know that fpc is aware of the compiler issue, and hedgewars might suffer from that compiler issue in the future, but the immediate hedgewars problem is not with the fpc compiler. fpc bug 15582 identifies the issue that the fpc compiler for i686 and x86_64 should align the stack pointer of all/most/some subroutines to (0 mod 16), if those subroutines might call any code that was compiled by gcc. However, the current hedgewars SIGSEGV is caused by some hand-written assembly-language code in rtl/linux/x86_64/cprt0.as, so changing the fpc compiler will have no effect.
The quickest, dirtiest, ugliest fix is to force the stack alignment in _haltproc: movq ___fpc_ret@GOTPCREL(%rip),%rcx movq (%rcx),%rdx + andq $~0xf,%rsp ## force stack alignment until fpc rtl/linux/x86_64/cprt0.as is fixed pushq %rdx ret Like I said, this is VILE. But it will work.
fpc upstream shows the following as their fix: $ svn diff --git -r38399:38400 Index: rtl/linux/x86_64/si_c.inc =================================================================== diff --git a/rtl/linux/x86_64/si_c.inc b/rtl/linux/x86_64/si_c.inc --- a/rtl/linux/x86_64/si_c.inc (revision 38399) +++ b/rtl/linux/x86_64/si_c.inc (revision 38400) @@ -172,6 +172,7 @@ procedure _FPC_libc_haltproc(e:longint); assembler; nostackframe; public name '_haltproc'; asm movl %edi,%eax + popq %rdx { keep stack aligned } movq fpc_ret(%rip),%rdx { return to libc } movq fpc_ret_rbp(%rip),%rbp pushq %rdx Index: rtl/linux/x86_64/cprt0.as =================================================================== diff --git a/rtl/linux/x86_64/cprt0.as b/rtl/linux/x86_64/cprt0.as --- a/rtl/linux/x86_64/cprt0.as (revision 38399) +++ b/rtl/linux/x86_64/cprt0.as (revision 38400) @@ -134,6 +134,7 @@ movq ___fpc_ret_rbp@GOTPCREL(%rip),%rcx movq (%rcx),%rbp movq ___fpc_ret@GOTPCREL(%rip),%rcx + popq %rdx movq (%rcx),%rdx pushq %rdx ret
The 'popq' will work if and only if %rsp is (8 mod 16) at entry to the *_haltproc subroutine. This is true when all subroutines keep the stack 16-byte aligned. But if some subroutine in the call chain mis-aligns the stack, so that %rsp at entry is not (8 mod 16), then the stack is mis-aligned upon exit, and the same SIGSEGV may occur. The 'andq' always works to align the stack, regardless of how mis-aligned the stack is at entry. So the 'andq' is safer. But there still is risk if main_stub uses on-stack local variables after _haltproc returns (see Florian's comment #8). The true fix REQUIRES using setjmp+longjmp.
> The 'popq' will work if and only if %rsp is (8 mod 16) at entry to the *_haltproc subroutine. This is a requirement of the ABI (see section 3.2.2 in AMD64 ABI Draft 0.99.8 – April 25, 2016 – 10:07). About main_stub: it is never supposed to return, this is why it contains a even a hlt (if this hlt is ever reached, the OS returned from an exit call). Neverthless, I agree, that using setjmp/longjmp would be usefull, especially to get rid of this assembler mess.
For now I have submitted a pull request to fpc using commit r38400 and have verified it fixes the crash in hedgewars. If an improved fix (via setjmp/longjmp or otherwise) is submitted I'll update the pull request. https://src.fedoraproject.org/rpms/fpc/pull-request/1
Gwyn, if you would like to independently verify the fix works I setup a copr for testing: https://copr.fedorainfracloud.org/coprs/hobbes1069/hedgewars/
Confirmed! Thank you!
fpc-3.0.2-4.fc27 lazarus-1.8.0-4.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-b6e484dff7
fpc-3.0.2-4.fc27, lazarus-1.8.0-4.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-b6e484dff7
fpc-3.0.2-4.fc27, lazarus-1.8.0-4.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.