Bug 1667163
| Summary: | perl segfault in openqa worker process isotovideo (seems to be related to opencv threading) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Michel Normand <normand> | ||||||
| Component: | os-autoinst | Assignee: | Adam Williamson <awilliam> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 34 | CC: | awilliam, loganjerry, ppisar | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | os-autoinst-4.6-35.20210326git24ec8f9.fc33 os-autoinst-4.6-35.20210326git24ec8f9.fc34 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-04-21 21:41:16 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Michel Normand
2019-01-17 15:26:36 UTC
Yeah, I'm seeing some like this. At a quick glance they all look to be the same, and there have been 20 since 2019-01-14, exactly 5 per day. Sorry, to clarify, that was a reply to "Is there the same problem for ppc64le workers used by https://openqa.stg.fedoraproject.org/" There were many changes between Fedora 28 and 29. E.g. a completely new Perl, glibc, and kernel. Good luck with finding the offending change. The only details in this bug report are that you use TBB from a thread that calls a syscall and that triggers some signal caught by a perl process. That's terribly insufficient. We had threads-tbb in Fedora, but we removed it because it was unreliable (read broken, bug #1099397). This one seems like you spawning a thread using TBB from some library linked into the perl process without diverting thread-specific signals inherited from a thread that is running the perl interpreter and having registered signal handlers from perl. This is still happening all the way up to F30. I'm attaching a backtrace with all the debuginfo installed, though it still doesn't mean a lot to me. Created attachment 1621302 [details]
better backtrace of the crash as of F30
so, dug into this a *bit* more at least. the tbb culprit here is likely opencv. os-autoinst uses opencv via a perl library it ships called 'tinycv': https://github.com/os-autoinst/os-autoinst/tree/master/ppmclibs and opencv requires libtbb, so that's where the tbb dep comes in. So is the problem here that os-autoinst and/or tinycv should be doing some special handling of signals? Just a generic remark: When sending a process-level signal to a multithread process it is not deterministic which thread receives the signal. E.g. if one thread sets a signal handler, and the other one just defaults to process termination, and the signal is delivers to the other thread, you are doomed. You must coordinate signal masks among all the threads to prevent from accidental killing. Or Linux provides a way of sending a signal to a specific thread, but that's not supported by the Perl interpreter. Also don't confuse Perl thread signals as implemented in "threads" Perl module and POSIX signals. Perl thread signals so not use POSIX signals at all. Looking at it some more, this happens on x86_64 and aarch64 too, at least I'm seeing lots of crashes of isotovideo with tracebacks that run through libtbb on both. Also, upstream is apparently aware of it and even tried to fix it, but they also still see it: https://github.com/os-autoinst/os-autoinst/pull/1032 In the gdb.txt file you attached, Adam, I see this:
Thread 1 (Thread 0x7fff9f4bf180 (LWP 5975)):
#0 0x00007fffaf0120ac in Perl_csighandler (sig=<optimized out>, sip=<optimized out>, uap=<optimized out>) at mg.c:1510
my_perl = 0x0
So I believe your theory is correct. TBB spawns a bunch of threads, none of which call PERL_SET_CONTEXT. Signal handlers are registered that assume that they are running inside a perl interpreter. When a signal arrives, the handler tries to invoke perl functionality and dies horribly because it is actually running in a perl-unaware TBB thread.
We can't (and shouldn't) make the TBB threads call PERL_SET_CONTEXT, so the only solution is to ensure that the TBB threads cannot receive the signals in question. Upstream attempted to do that with the pull request you noted. I will try to carve out some time to see if I can tell why that didn't work.
Can someone who knows how to reproduce the problem give step by step instructions? I know nothing about either os-autoinst or openqa, so some hand holding may be necessary. If that is too hard, can someone who is able to reproduce the problem install tbb-debuginfo and opencv-core-debuginfo, then contrive to run os-autoinst under gdb control with a breakpoint on tbb::internal::rml::private_server::adjust_job_count_estimate? I would like to see a backtrace from when that is first invoked. Oh, one thing that's noted upstream but not here - it appears this crash happens when the process is exiting anyway, so it isn't actually causing any terrible problems. I was reminded of it while debugging problems with the new ppc64le worker hosts, but after looking into it more, the problems we're having there aren't to do with this crash. Reproducing is a bit tricky because you need to have enough of an openQA setup deployed to run and complete a test, which isn't really trivial. I'll try and do what you requested if I get a bit of time, Jerry. Okay, wait, don't bother. If it is when the process is exiting, then I'm looking at the wrong end of things, and that backtrace won't help. Let me dig through the source code a little first. No, I take that back. That backtrace still might be useful. Uh ... how did I change the priority and severity? I just added a comment. I'll put them back again. Sorry! This message is a reminder that Fedora 30 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '30'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 30 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Upstream has just pointed me to: https://github.com/os-autoinst/os-autoinst/pull/1640 which may actually fix this! I'll try and do a build with it soon. FEDORA-2021-aa39748257 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-aa39748257 FEDORA-2021-186bca5b58 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-186bca5b58 FEDORA-2021-aa39748257 has been pushed to the Fedora 34 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-aa39748257` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-aa39748257 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. FEDORA-2021-186bca5b58 has been pushed to the Fedora 33 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-186bca5b58` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-186bca5b58 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. FEDORA-2021-186bca5b58 has been pushed to the Fedora 33 stable repository. If problem still persists, please make note of it in this bug report. FEDORA-2021-aa39748257 has been pushed to the Fedora 34 stable repository. If problem still persists, please make note of it in this bug report. |