Bug 1240487 - erl segfault on fedora-23-i686 (autoconf testsuite)
Summary: erl segfault on fedora-23-i686 (autoconf testsuite)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: erlang
Version: 23
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Peter Lemenkov
QA Contact: Fedora Extras Quality Assurance
URL: http://bugs.erlang.org/browse/ERL-80
Whiteboard:
Depends On:
Blocks: 1236072
TreeView+ depends on / blocked
 
Reported: 2015-07-07 05:52 UTC by Pavel Raiskup
Modified: 2016-02-21 16:27 UTC (History)
10 users (show)

Fixed In Version: erlang-17.4-6.fc22 erlang-17.4-6.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-21 02:21:33 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Reproducer (787 bytes, application/x-gzip)
2015-07-07 06:10 UTC, Pavel Raiskup
no flags Details

Description Pavel Raiskup 2015-07-07 05:52:27 UTC
We observe autoconf FTBFS on rawhide (testsuite failures).  One of the
testsuite failures is related to Erlang & autoconf, but it appears only on
i686.  I tried to cut related testcase out into segfault-i686.tar.gz
reproducer:

  $ tar -xf segfault-i686.tar.gz
  $ cd segfault-i686
  $ make && make run
  erlc -b beam my_testsuite.erl
  cd lib && ./compile
  erl -pa ./lib -s my_testsuite test
  Erlang/OTP 17 [erts-6.3] [source] [smp:4:4] [async-threads:10] [hipe]
  [kernel-poll:false]

  Eshell V6.3  (abort with ^G)
  1>   All 3 tests passed.
  Makefile:6: recipe for target 'run' failed
  make: *** [run] Segmentation fault (core dumped)

The segfault ^^ breaks autoconf testsuite, but I'm not able to diagnose
properly.  Any help appreciated, let me know if you need some other info.

FTBFS:
https://kojipkgs.fedoraproject.org//work/tasks/7404/10217404/build.log

Pavel

Comment 1 Pavel Raiskup 2015-07-07 06:10:24 UTC
Created attachment 1049095 [details]
Reproducer

Comment 2 Jan Kurik 2015-07-15 13:21:08 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 3 Randy Barlow 2016-01-05 06:48:23 UTC
I have also been experiencing this bug, and am unfortunately unable to run the test suites on my packages for i686. I see this issue on Rawhide.

Comment 4 Peter Lemenkov 2016-01-18 12:43:33 UTC
Ok, now I've got the same issues both in F-23 and in Rawhide. The latest failure is here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=12592117

Surprisingly but I can see them only in Koji. If I run build manually (with rpmbuild) everything is fine.

Comment 5 Peter Lemenkov 2016-01-18 12:46:16 UTC
Unfortunately this reproducer doesn't reproduce the issue on my machine. Everything is fine:

[petro@fedora32i686 segfault-i686]$ make run
erl -pa ./lib -s my_testsuite test
Erlang/OTP 18 [erts-7.2.1] [source] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.2.1  (abort with ^G)
1>   All 3 tests passed.
[petro@fedora32i686 segfault-i686]$

Comment 6 Peter Lemenkov 2016-01-18 13:13:24 UTC
(In reply to Pavel Raiskup from comment #0)
> We observe autoconf FTBFS on rawhide (testsuite failures).  One of the
> testsuite failures is related to Erlang & autoconf, but it appears only on
> i686.  I tried to cut related testcase out into segfault-i686.tar.gz
> reproducer:
> 
>   $ tar -xf segfault-i686.tar.gz
>   $ cd segfault-i686
>   $ make && make run
>   erlc -b beam my_testsuite.erl
>   cd lib && ./compile
>   erl -pa ./lib -s my_testsuite test
>   Erlang/OTP 17 [erts-6.3] [source] [smp:4:4] [async-threads:10] [hipe]
>   [kernel-poll:false]
> 
>   Eshell V6.3  (abort with ^G)
>   1>   All 3 tests passed.
>   Makefile:6: recipe for target 'run' failed
>   make: *** [run] Segmentation fault (core dumped)
> 
> The segfault ^^ breaks autoconf testsuite, but I'm not able to diagnose
> properly.  Any help appreciated, let me know if you need some other info.
> 
> FTBFS:
> https://kojipkgs.fedoraproject.org//work/tasks/7404/10217404/build.log
> 
> Pavel

Pavel, I've just checked - the issue is still there. Unfortunately I can't reproduce it on my hardware (KVM VM) - only in Fedora Koji. Do you have an access to the machine where it's possible to reproduce the issue?

I really don't have any clue on what's going on there?

Comment 7 Pavel Raiskup 2016-01-18 13:55:46 UTC
I'm not able to reproduce this now.

Comment 8 Randy Barlow 2016-01-18 19:39:03 UTC
Hello Peter!

Is it possible that the recent update to Erlang 18 fixed this issue?

P.S. Now we really have to get ejabberd updated, as it doesn't seem to work with Erlang 18 ☺ If you have some time, jcline and I have a few package review requests waiting. We CAN review each other's if necessary, but we'd rather that someone with more Erlang experience than we have review them if you or anyone else has the time. Oh, if we only had more time, right?

Comment 9 Peter Lemenkov 2016-01-18 20:18:38 UTC
(In reply to Randy Barlow from comment #8)
> Hello Peter!
> 
> Is it possible that the recent update to Erlang 18 fixed this issue?

Randy, it's certainly not fixed yet. And I'm afraid this issue has something with Koji buildsystem itself (hardware + software + configuration) rather that with Erlang itself.

I'm still trying to find an Erlang-related issue but I failed to reproduce it anywhere (with Erlang on a native i686 Rawhide, with mockbuilds for Rawhide at RHEL6/RHEL7) on machines available to me.

The only place where I can reproduce this issue with 100% reproducibility is Fedora Koji Buildsystem. This makes me very suspicious.

Comment 10 Randy Barlow 2016-01-18 20:39:17 UTC
Hi Peter!

Interesting, working on problems that are hard to reproduce is tricky. I am sad to say that I am out of ideas. If you can think of a way I can assist, I am happy to!

Comment 11 Peter Lemenkov 2016-01-19 12:46:13 UTC
Filip, you mentioned in bug 1221824#c20 that you have a reproducer. Could you please run it again with strace or gdb attached? We really need your help here. :)

Comment 12 Filip Andres 2016-01-19 12:56:38 UTC
Hi Peter,
sure, will do in the evening, when I get to my fedora box.

f.

Comment 13 Filip Andres 2016-01-20 17:48:02 UTC
Hi,
I have been commenting into the other issue (https://bugzilla.redhat.com/show_bug.cgi?id=1221824), sorry :-) Copying the most important parts here:

* strace -- useless, the VM crashes in userspace (https://bugzilla.redhat.com/attachment.cgi?id=1116279)

* gdb stracktrace

(gdb) bt
#0  0x56798d70 in ethr_dw_atomic_cmpxchg () at ../include/internal/i386/atomic.h:177
#1  0x566103ce in ethr_dw_atomic_cmpxchg_nob (xchg=0xf4e0609c, new=0xf4e060a4, var=0x568688f0 <erts_proc+48>)
    at beam/erl_threads.h:1456
#2  erts_atomic64_inc_read_nob (var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1646
#3  step_interval_nob (icp=0x568688f0 <erts_proc+48>) at beam/utils.c:4954
#4  erts_smp_step_interval_nob (icp=icp@entry=0x568688f0 <erts_proc+48>) at beam/utils.c:5004
#5  0x5671572b in ptab_list_bif_engine (c_p=c_p@entry=0xf6dc0218, res_accp=res_accp@entry=0xf4e06178, 
    mbp=mbp@entry=0xf1f80a88) at beam/erl_ptab.c:927
#6  0x56716a5d in erts_ptab_list (c_p=c_p@entry=0xf6dc0218, ptab=0x568688c0 <erts_proc>) at beam/erl_ptab.c:766
#7  0x5661be76 in processes_0 (A__p=0xf6dc0218, BIF__ARGS=0xf7483100) at beam/bif.c:3841
#8  0x5659978b in process_main () at beam/beam_emu.c:3690
#9  0x56638784 in sched_thread_func (vesdp=0xf6087dc0) at beam/erl_process.c:8021
#10 0x567a19cc in thr_wrapper (vtwd=0xffffd1b4) at pthread/ethread.c:114
#11 0xf7f164be in start_thread (arg=0xf4e06b40) at pthread_create.c:333
#12 0xf7e2a3fe in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114

* the problem seems to be triggered by the i686 build using the -mtune=atom flag, I tried the following change and the resulting binary doesn't have the same problem:

%ifarch %{ix86}
%global optflags -mtune=generic
%endif

Build:
http://koji.fedoraproject.org/koji/taskinfo?taskID=12621253

Now the erlang:processes() command executes successfully:

$ mock -r fedora-rawhide-i386 --no-clean --shell
INFO: mock.py version 1.2.14 starting (python version = 3.4.2)...
Start: init plugins
INFO: selinux enabled
Finish: init plugins
Start: run
Start: chroot init
INFO: calling preinit hooks
INFO: enabled root cache
INFO: enabled dnf cache
Start: cleaning dnf metadata
Finish: cleaning dnf metadata
INFO: enabled ccache
Finish: chroot init
Start: shell
<mock-chroot>sh-4.3# erl
Erlang/OTP 18 [erts-7.2.1] [source] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.2.1  (abort with ^G)
1> erlang:processes().
[<0.0.0>,<0.3.0>,<0.6.0>,<0.7.0>,<0.9.0>,<0.10.0>,<0.11.0>,
 <0.12.0>,<0.14.0>,<0.15.0>,<0.16.0>,<0.17.0>,<0.18.0>,
 <0.20.0>,<0.21.0>,<0.22.0>,<0.23.0>,<0.24.0>,<0.25.0>,
 <0.26.0>,<0.27.0>,<0.28.0>,<0.29.0>,<0.30.0>,<0.34.0>]
2> 

Resume:
There seem to be an error in the fallback implementation of ethr_dw_atomic_cmpxchg. I'm not sure whether these binaries would run on an Atom processor though (and I don't have means to test it).
I guess I may ask in the erlang-bugs mailing list but I would let it to you to decide if building for generic processor (instead of Atom) is a viable workaround or not.

Comment 14 Peter Lemenkov 2016-01-23 17:25:59 UTC
Found a way to see actual stacktrace.

Run erl in GDB as shown above. Then when you got a SIGSEGV you will have a corrupted stack. First we need to recover it by adding/removing random values to/from $esp register (stack pointer). I believe those who know Intel assembly already know what values one should try first. I tried stepping by 4 in each direction until I realized that I have to add 32. So, please, do:

(gdb) set $pc = *(void **)$esp
(gdb) set $esp = $esp + 32
(gdb) bt
#0  0x568688f0 in erts_proc ()
#1  0x566103ce in ethr_dw_atomic_cmpxchg_nob (xchg=0xf461609c, new=0xf46160a4, var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1456
#2  erts_atomic64_inc_read_nob (var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1646
#3  step_interval_nob (icp=0x568688f0 <erts_proc+48>) at beam/utils.c:4954
#4  erts_smp_step_interval_nob (icp=icp@entry=0x568688f0 <erts_proc+48>) at beam/utils.c:5004
#5  0x5671572b in ptab_list_bif_engine (c_p=c_p@entry=0xf6d80218, res_accp=res_accp@entry=0xf4616178, mbp=mbp@entry=0xf1f816a0) at beam/erl_ptab.c:927
#6  0x56716a5d in erts_ptab_list (c_p=c_p@entry=0xf6d80218, ptab=0x568688c0 <erts_proc>) at beam/erl_ptab.c:766
#7  0x5661be76 in processes_0 (A__p=0xf6d80218, BIF__ARGS=0xf74861c0) at beam/bif.c:3841
#8  0x5659978b in process_main () at beam/beam_emu.c:3690
#9  0x56638784 in sched_thread_func (vesdp=0xf608e000) at beam/erl_process.c:8021
#10 0x567a19cc in thr_wrapper (vtwd=0xffffd184) at pthread/ethread.c:114
#11 0xf7f184be in start_thread (arg=0xf4616b40) at pthread_create.c:333
#12 0xf7e2c3fe in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114
(gdb) 

See - a cool nice stacktrace!
erts_proc is a bogus value. It's a stack corruption after calling ethr_dw_atomic_cmpxchg_nob.

That's all I've got for today.

Comment 15 Peter Lemenkov 2016-02-10 10:40:25 UTC
Possible workaround:

https://github.com/erlang/otp/commit/fd7fa46

Comment 16 Peter Lemenkov 2016-02-10 13:13:53 UTC
Fixed in Rawhide already. Will do builds for (both affected) F22 and F23 shortly.

Comment 17 Fedora Update System 2016-02-10 13:50:56 UTC
erlang-17.4-6.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-a79a47efb0

Comment 18 Fedora Update System 2016-02-10 14:53:25 UTC
erlang-17.4-6.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-18e2827992

Comment 19 Randy Barlow 2016-02-10 21:42:46 UTC
Peter, thanks so much for looking into this difficult issue. You are the man!

Comment 20 Fedora Update System 2016-02-11 14:53:13 UTC
erlang-17.4-6.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-18e2827992

Comment 21 Fedora Update System 2016-02-11 15:21:41 UTC
erlang-17.4-6.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a79a47efb0

Comment 22 Fedora Update System 2016-02-21 02:21:28 UTC
erlang-17.4-6.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.

Comment 23 Fedora Update System 2016-02-21 16:27:01 UTC
erlang-17.4-6.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.