Bug 1667163 - perl segfault in openqa worker process isotovideo (seems to be related to opencv threading)
Summary: perl segfault in openqa worker process isotovideo (seems to be related to ope...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: os-autoinst
Version: 34
Hardware: All
OS: Unspecified
low
medium
Target Milestone: ---
Assignee: Adam Williamson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-17 15:26 UTC by Michel Normand
Modified: 2021-04-24 20:11 UTC (History)
3 users (show)

Fixed In Version: os-autoinst-4.6-35.20210326git24ec8f9.fc33 os-autoinst-4.6-35.20210326git24ec8f9.fc34
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-21 21:41:16 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
perl_segfault.txt (25.90 KB, text/plain)
2019-01-17 15:26 UTC, Michel Normand
no flags Details
better backtrace of the crash as of F30 (66.35 KB, text/plain)
2019-09-30 23:18 UTC, Adam Williamson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github os-autoinst os-autoinst pull 1640 0 None closed signalblocker: Also block SIGCHLD 2021-04-13 15:27:22 UTC

Description Michel Normand 2019-01-17 15:26:36 UTC
Created attachment 1521300 [details]
perl_segfault.txt

many perl segfault in openqa-worker ppc64le fc29

Seems to appear since update to fc29.

We have this problem on two openqa fc29 machines in our lab.

Is there the same problem for ppc64le workers used by https://openqa.stg.fedoraproject.org/

The last available backtrace extracted from attached investigation file:
===
$coredumpctl info | tee /tmp/coredumpinfo.log
           PID: 8096 (/usr/bin/isotov)
           UID: 990 (_openqa-worker)
           GID: 989 (_openqa-worker)                                                           
        Signal: 11 (SEGV)                                                                      
     Timestamp: Thu 2019-01-17 14:35:14 CET (1h 12min ago)                                     
  Command Line: /usr/bin/isotovideo: backend
    Executable: /usr/bin/perl                                                                  
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker                      
          Unit: openqa-worker
         Slice: openqa-worker.slice                                                            
       Boot ID: b40fba0570024e7b9010791bb51b498f                                               
    Machine ID: 085bb1198e8d4ff996c6a02a1d71366e                                               
      Hostname: abanc.test.toulouse-stg.fr.ibm.com                                             
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.990.b40fba0570024e7b9010791bb51b498f.8096.1547732114000000.lz4 (inaccessible)
       Message: Process 8096 (/usr/bin/isotov) of user 990 dumped core.                        
                
                Stack trace of thread 8209:
                #0  0x00007fff95c4420c Perl_csighandler (libperl.so.5.28)                      
                #1  0x00007fff95f004d8 __kernel_sigtramp_rt64 (linux-vdso64.so.1)              
                #2  0x00007fff957bd4d0 syscall (libc.so.6)                                     
                #3  0x00007fff86abd074 _ZN3tbb8internal3rml14private_worker3runEv (libtbb.so.2)
                #4  0x00007fff86abd1c8 _ZN3tbb8internal3rml14private_worker14thread_routineEPv (libtbb.so.2)    
                #5  0x00007fff95af8e14 start_thread (libpthread.so.0)
                #6  0x00007fff957c6b08 __clone (libc.so.6)
===

Comment 1 Adam Williamson 2019-01-17 22:02:01 UTC
Yeah, I'm seeing some like this. At a quick glance they all look to be the same, and there have been 20 since 2019-01-14, exactly 5 per day.

Comment 2 Adam Williamson 2019-01-17 22:02:24 UTC
Sorry, to clarify, that was a reply to "Is there the same problem for ppc64le workers used by https://openqa.stg.fedoraproject.org/"

Comment 3 Petr Pisar 2019-01-18 09:11:48 UTC
There were many changes between Fedora 28 and 29. E.g. a completely new Perl, glibc, and kernel. Good luck with finding the offending change.

The only details in this bug report are that you use TBB from a thread that calls a syscall and that triggers some signal caught by a perl process. That's terribly insufficient.

We had threads-tbb in Fedora, but we removed it because it was unreliable (read broken, bug #1099397). This one seems like you spawning a thread  using TBB from some library linked into the perl process without diverting thread-specific signals inherited from a thread that is running the perl interpreter and having registered signal handlers from perl.

Comment 4 Adam Williamson 2019-09-30 23:17:40 UTC
This is still happening all the way up to F30. I'm attaching a backtrace with all the debuginfo installed, though it still doesn't mean a lot to me.

Comment 5 Adam Williamson 2019-09-30 23:18:49 UTC
Created attachment 1621302 [details]
better backtrace of the crash as of F30

Comment 6 Adam Williamson 2019-09-30 23:25:15 UTC
so, dug into this a *bit* more at least. the tbb culprit here is likely opencv. os-autoinst uses opencv via a perl library it ships called 'tinycv':

https://github.com/os-autoinst/os-autoinst/tree/master/ppmclibs

and opencv requires libtbb, so that's where the tbb dep comes in. So is the problem here that os-autoinst and/or tinycv should be doing some special handling of signals?

Comment 7 Petr Pisar 2019-10-01 09:04:10 UTC
Just a generic remark:

When sending a process-level signal to a multithread process it is not deterministic which thread receives the signal. E.g. if one thread sets a signal handler, and the other one just defaults to process termination, and the signal is delivers to the other thread, you are doomed. You must coordinate signal masks among all the threads to prevent from accidental killing. Or Linux provides a way of sending a signal to a specific thread, but that's not supported by the Perl interpreter. Also don't confuse Perl thread signals as implemented in "threads" Perl module and POSIX signals. Perl thread signals so not use POSIX signals at all.

Comment 8 Adam Williamson 2019-10-01 15:37:09 UTC
Looking at it some more, this happens on x86_64 and aarch64 too, at least I'm seeing lots of crashes of isotovideo with tracebacks that run through libtbb on both. Also, upstream is apparently aware of it and even tried to fix it, but they also still see it:

https://github.com/os-autoinst/os-autoinst/pull/1032

Comment 9 Jerry James 2019-10-01 16:20:31 UTC
In the gdb.txt file you attached, Adam, I see this:

Thread 1 (Thread 0x7fff9f4bf180 (LWP 5975)):
#0  0x00007fffaf0120ac in Perl_csighandler (sig=<optimized out>, sip=<optimized out>, uap=<optimized out>) at mg.c:1510
        my_perl = 0x0

So I believe your theory is correct.  TBB spawns a bunch of threads, none of which call PERL_SET_CONTEXT.  Signal handlers are registered that assume that they are running inside a perl interpreter.  When a signal arrives, the handler tries to invoke perl functionality and dies horribly because it is actually running in a perl-unaware TBB thread.

We can't (and shouldn't) make the TBB threads call PERL_SET_CONTEXT, so the only solution is to ensure that the TBB threads cannot receive the signals in question.  Upstream attempted to do that with the pull request you noted.  I will try to carve out some time to see if I can tell why that didn't work.

Comment 10 Jerry James 2019-10-02 21:53:22 UTC
Can someone who knows how to reproduce the problem give step by step instructions?  I know nothing about either os-autoinst or openqa, so some hand holding may be necessary.

If that is too hard, can someone who is able to reproduce the problem install tbb-debuginfo and opencv-core-debuginfo, then contrive to run os-autoinst under gdb control with a breakpoint on tbb::internal::rml::private_server::adjust_job_count_estimate?  I would like to see a backtrace from when that is first invoked.

Comment 11 Adam Williamson 2019-10-02 22:05:11 UTC
Oh, one thing that's noted upstream but not here - it appears this crash happens when the process is exiting anyway, so it isn't actually causing any terrible problems. I was reminded of it while debugging problems with the new ppc64le worker hosts, but after looking into it more, the problems we're having there aren't to do with this crash.

Reproducing is a bit tricky because you need to have enough of an openQA setup deployed to run and complete a test, which isn't really trivial. I'll try and do what you requested if I get a bit of time, Jerry.

Comment 12 Jerry James 2019-10-02 22:21:16 UTC
Okay, wait, don't bother.  If it is when the process is exiting, then I'm looking at the wrong end of things, and that backtrace won't help.  Let me dig through the source code a little first.

Comment 13 Jerry James 2019-10-02 22:24:16 UTC
No, I take that back.  That backtrace still might be useful.  Uh ... how did I change the priority and severity?  I just added a comment.  I'll put them back again.  Sorry!

Comment 14 Ben Cotton 2020-04-30 21:01:18 UTC
This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 15 Adam Williamson 2021-04-13 15:27:23 UTC
Upstream has just pointed me to:
https://github.com/os-autoinst/os-autoinst/pull/1640
which may actually fix this! I'll try and do a build with it soon.

Comment 16 Fedora Update System 2021-04-13 19:50:04 UTC
FEDORA-2021-aa39748257 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-aa39748257

Comment 17 Fedora Update System 2021-04-13 19:50:04 UTC
FEDORA-2021-186bca5b58 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-186bca5b58

Comment 18 Fedora Update System 2021-04-13 20:48:07 UTC
FEDORA-2021-aa39748257 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-aa39748257`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-aa39748257

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 19 Fedora Update System 2021-04-14 15:14:30 UTC
FEDORA-2021-186bca5b58 has been pushed to the Fedora 33 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-186bca5b58`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-186bca5b58

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 20 Fedora Update System 2021-04-21 21:41:16 UTC
FEDORA-2021-186bca5b58 has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 21 Fedora Update System 2021-04-24 20:11:28 UTC
FEDORA-2021-aa39748257 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.