Created attachment 1521300 [details] perl_segfault.txt many perl segfault in openqa-worker ppc64le fc29 Seems to appear since update to fc29. We have this problem on two openqa fc29 machines in our lab. Is there the same problem for ppc64le workers used by https://openqa.stg.fedoraproject.org/ The last available backtrace extracted from attached investigation file: === $coredumpctl info | tee /tmp/coredumpinfo.log PID: 8096 (/usr/bin/isotov) UID: 990 (_openqa-worker) GID: 989 (_openqa-worker) Signal: 11 (SEGV) Timestamp: Thu 2019-01-17 14:35:14 CET (1h 12min ago) Command Line: /usr/bin/isotovideo: backend Executable: /usr/bin/perl Control Group: /openqa.slice/openqa-worker.slice/openqa-worker Unit: openqa-worker Slice: openqa-worker.slice Boot ID: b40fba0570024e7b9010791bb51b498f Machine ID: 085bb1198e8d4ff996c6a02a1d71366e Hostname: abanc.test.toulouse-stg.fr.ibm.com Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.990.b40fba0570024e7b9010791bb51b498f.8096.1547732114000000.lz4 (inaccessible) Message: Process 8096 (/usr/bin/isotov) of user 990 dumped core. Stack trace of thread 8209: #0 0x00007fff95c4420c Perl_csighandler (libperl.so.5.28) #1 0x00007fff95f004d8 __kernel_sigtramp_rt64 (linux-vdso64.so.1) #2 0x00007fff957bd4d0 syscall (libc.so.6) #3 0x00007fff86abd074 _ZN3tbb8internal3rml14private_worker3runEv (libtbb.so.2) #4 0x00007fff86abd1c8 _ZN3tbb8internal3rml14private_worker14thread_routineEPv (libtbb.so.2) #5 0x00007fff95af8e14 start_thread (libpthread.so.0) #6 0x00007fff957c6b08 __clone (libc.so.6) ===
Yeah, I'm seeing some like this. At a quick glance they all look to be the same, and there have been 20 since 2019-01-14, exactly 5 per day.
Sorry, to clarify, that was a reply to "Is there the same problem for ppc64le workers used by https://openqa.stg.fedoraproject.org/"
There were many changes between Fedora 28 and 29. E.g. a completely new Perl, glibc, and kernel. Good luck with finding the offending change. The only details in this bug report are that you use TBB from a thread that calls a syscall and that triggers some signal caught by a perl process. That's terribly insufficient. We had threads-tbb in Fedora, but we removed it because it was unreliable (read broken, bug #1099397). This one seems like you spawning a thread using TBB from some library linked into the perl process without diverting thread-specific signals inherited from a thread that is running the perl interpreter and having registered signal handlers from perl.
This is still happening all the way up to F30. I'm attaching a backtrace with all the debuginfo installed, though it still doesn't mean a lot to me.
Created attachment 1621302 [details] better backtrace of the crash as of F30
so, dug into this a *bit* more at least. the tbb culprit here is likely opencv. os-autoinst uses opencv via a perl library it ships called 'tinycv': https://github.com/os-autoinst/os-autoinst/tree/master/ppmclibs and opencv requires libtbb, so that's where the tbb dep comes in. So is the problem here that os-autoinst and/or tinycv should be doing some special handling of signals?
Just a generic remark: When sending a process-level signal to a multithread process it is not deterministic which thread receives the signal. E.g. if one thread sets a signal handler, and the other one just defaults to process termination, and the signal is delivers to the other thread, you are doomed. You must coordinate signal masks among all the threads to prevent from accidental killing. Or Linux provides a way of sending a signal to a specific thread, but that's not supported by the Perl interpreter. Also don't confuse Perl thread signals as implemented in "threads" Perl module and POSIX signals. Perl thread signals so not use POSIX signals at all.
Looking at it some more, this happens on x86_64 and aarch64 too, at least I'm seeing lots of crashes of isotovideo with tracebacks that run through libtbb on both. Also, upstream is apparently aware of it and even tried to fix it, but they also still see it: https://github.com/os-autoinst/os-autoinst/pull/1032
In the gdb.txt file you attached, Adam, I see this: Thread 1 (Thread 0x7fff9f4bf180 (LWP 5975)): #0 0x00007fffaf0120ac in Perl_csighandler (sig=<optimized out>, sip=<optimized out>, uap=<optimized out>) at mg.c:1510 my_perl = 0x0 So I believe your theory is correct. TBB spawns a bunch of threads, none of which call PERL_SET_CONTEXT. Signal handlers are registered that assume that they are running inside a perl interpreter. When a signal arrives, the handler tries to invoke perl functionality and dies horribly because it is actually running in a perl-unaware TBB thread. We can't (and shouldn't) make the TBB threads call PERL_SET_CONTEXT, so the only solution is to ensure that the TBB threads cannot receive the signals in question. Upstream attempted to do that with the pull request you noted. I will try to carve out some time to see if I can tell why that didn't work.
Can someone who knows how to reproduce the problem give step by step instructions? I know nothing about either os-autoinst or openqa, so some hand holding may be necessary. If that is too hard, can someone who is able to reproduce the problem install tbb-debuginfo and opencv-core-debuginfo, then contrive to run os-autoinst under gdb control with a breakpoint on tbb::internal::rml::private_server::adjust_job_count_estimate? I would like to see a backtrace from when that is first invoked.
Oh, one thing that's noted upstream but not here - it appears this crash happens when the process is exiting anyway, so it isn't actually causing any terrible problems. I was reminded of it while debugging problems with the new ppc64le worker hosts, but after looking into it more, the problems we're having there aren't to do with this crash. Reproducing is a bit tricky because you need to have enough of an openQA setup deployed to run and complete a test, which isn't really trivial. I'll try and do what you requested if I get a bit of time, Jerry.
Okay, wait, don't bother. If it is when the process is exiting, then I'm looking at the wrong end of things, and that backtrace won't help. Let me dig through the source code a little first.
No, I take that back. That backtrace still might be useful. Uh ... how did I change the priority and severity? I just added a comment. I'll put them back again. Sorry!
This message is a reminder that Fedora 30 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '30'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 30 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Upstream has just pointed me to: https://github.com/os-autoinst/os-autoinst/pull/1640 which may actually fix this! I'll try and do a build with it soon.
FEDORA-2021-aa39748257 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-aa39748257
FEDORA-2021-186bca5b58 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-186bca5b58
FEDORA-2021-aa39748257 has been pushed to the Fedora 34 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-aa39748257` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-aa39748257 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-186bca5b58 has been pushed to the Fedora 33 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-186bca5b58` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-186bca5b58 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-186bca5b58 has been pushed to the Fedora 33 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2021-aa39748257 has been pushed to the Fedora 34 stable repository. If problem still persists, please make note of it in this bug report.