Bug 919101
| Summary: | strace -f : straced process killed by SIGTRAP (caused by BZ#837183) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Julien Thomas <julthomas> | ||||||||
| Component: | strace | Assignee: | Jeff Law <law> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Michael Petlan <mpetlan> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 6.4 | CC: | david.west, julthomas, mcermak, mnewsome, mpetlan, ohudlick | ||||||||
| Target Milestone: | rc | Keywords: | Rebase | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | strace-4.8-9.el6 | Doc Type: | Rebase: Bug Fixes and Enhancements | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2015-07-22 06:26:51 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1153397 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Julien Thomas
2013-03-07 15:28:40 UTC
4.5.19-1.11.el6_2.1 => OK 4.5.19-1.11.el6_3.2 => killed by SIGTRAP 4.5.19-1.17.el6 => killed by SIGTRAP In 4.5.19-1.11.el6_3.2, the patch set causing the problem is actually : strace-rh849052.patch strace-rh849052-2.patch strace-rh849052-3.patch strace-rh849052-4.patch strace-rh849052-5.patch strace-rh849052-6.patch * Fri Aug 17 2012 Jeff Law <law> - 4.5.19-1.11.el6_3.1 - Use PTRACE_SETOPTIONS to set behaviour of ptrace at fork/vfork/clone syscalls rather than buggy magic breakpoint methods (#849052). In 4.5.19-1.17.el6, the patch set was renamed (?) strace-rh837183.patch strace-rh837183-2.patch strace-rh837183-3.patch strace-rh837183-4.patch strace-rh837183-5.patch strace-rh837183-6.patch * Tue Sep 11 2012 Jeff Law <law> - 4.5.19-1.15 - Additional patches related to process startup. (#837183) [...] * Fri Aug 03 2012 Jeff Law <law> - 4.5.19-1.12 - Fix semtimedop decoding for S390/S390x (#759569) - Use PTRACE_SETOPTIONS to set behaviour of ptrace at fork/vfork/clone syscalls rather than buggy magic breakpoint methods (#837183). * Thu Dec 22 2011 Jeff Law <law> - 4.5.19-1.11 - Track personality more carefully (#769118) Can you attach the output of rpm -qa so I know what nagios plugins you've got installed and the contents of /etc/nagios so I have your nagios configuration. If you have concerns about attaching the nagios configuration bits to this BZ, you can send them to me privately via law. The patchfile renaming is strictly an internal process thing. The effective content of the patchfiles should be the same. Hi Jeff, yes I diff'ed the two series, they are the same. Created attachment 706758 [details]
rpm -qa rhel 6.4 + nagios epel with nagios-common dep
Created attachment 706759 [details]
nagios minimal configuration
Created attachment 706760 [details]
mk-livestatus nagios broker source rpm
Hi Jeff,
I prepared a test machine with a RHEL 6.4 up-to-date and a simple Nagios test configuration in order to reproduce the problem. I've attached the files to the bug report. We normally do not use the nagios package from EPEL because this version has some blocking bugs for our usage. However I've installed the one from EPEL on the test machine and the problem remains.
Here is how my directory structure looks like :
/tmp/etc-nagios/
/tmp/etc-nagios/nagios.cfg
/tmp/etc-nagios/objects.cfg
/tmp/var-nagios/
Here is how I start Nagios, then strace :
rm -fv /tmp/var-nagios/*
nagios -d /tmp/etc-nagios/nagios.cfg
strace -f -s 256 -p $(</tmp/var-nagios/nagios.pid)
Here is what I do in another terminal to get bug :
ts=$(date +%s); \
echo "[$ts] SCHEDULE_FORCED_SVC_CHECK;pom;DUMMY;$ts" \
> /tmp/var-nagios/nagios.pipe
I caught something interesting while preparing the minimal test configuration... The problem does not occur if I disable the mk-livestatus broker module. I understand we could say something is doggy in that version of livestatus. However I can't reproduce the problem with strace-4.5.19-1.11.el6_2.1 from the 6.3 ISO, and this is why I started reporting this bug.
Hope this helps,
Julien
Julien, yes, that will definitely help. Thanks for the time and effort you've put into getting this into a reproducible state. I'm having a bear of a time reproducing this problem. I got it to trigger once, but subsequent attempts to trigger have been futile -- including doing things like putting the signaling sequence into a tight loop. Based on that, the actual failure point and the set of patches that Julien identified as causing problems, I can only conclude that the new code to gain control of the process is racy in some way or another. However, with a consistent, reliable way to reproduce, it's going to be virtually impossible to debug what's happening or for our QE team to verify that any fix actually fixes the problem. Julien, could you perhaps try the version of strace in DTS (developer toolset) 2.0/2.1 and see if it exhibits this problem. One of the things I've been pondering is try to get RHEL 6's strace updated from the strace-4.5.19 codebase to something more modern such as strace-4.7 or strace-4.8. Knowing if either of those versions exhibit this problem would certainly be helpful in making that decision. We recently ran into this issue when debugging our application with stock RHEL
6.4. I've isolated a consistent repro:
(0) Environment:
----
[root@host ~]# uname -a
Linux host 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@host ~]# rpm -qa redhat-release-server strace
strace-4.5.19-1.17.el6.x86_64
redhat-release-server-6Server-6.4.0.4.el6.x86_64
----
(1) In one shell execute the following one-liner:
----
[root@host ~]# python -c 'import base64,zlib;eval(compile(zlib.decompress(base64.decodestring("eJxdjc0KgzAQhO95iiEgJJD6cxW89dJbH6Ag/qwaqokkaW3fvhEvbQ+77A4z8+lltS4gTI6aXgW9kLJe+XccPZpmZmx12gTw6+VcIulvhiOB9elIYdW9kIz1NGCwVsiSAdukZ0Kxn8Bel/qZaBV5WkTrgUl9aFyoDW31IYgYVxAyOr7y/2lADxA7e7DuLiSqCjnkgYoqvah7Cp612mRt4yeufh5+6uIK7kE8gj5j8klm")),"th.py","exec"))'
PID: 25862
----
(2) In another shell, on the same stock RHEL 6.4 server, execute the following
strace command:
----
[root@host ~]# strace -f -tt -v -o /tmp/c1 -p 25862
Process 25862 attached with 2 threads - interrupt to quit
Process 26097 attached
Process 26097 detached
Process 25863 detached
[root@host ~]#
----
The python process from the first step will then immediately exit as follows:
----
Trace/breakpoint trap (core dumped)
[root@host ~]#
----
Below is some interesting content from the strace output:
----
[root@host ~]# grep SIGTRAP /tmp/c1
25863 19:52:21.138652 --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---
25862 19:52:21.198565 +++ killed by SIGTRAP (core dumped) +++
[root@host ~]# grep ^25863 /tmp/c1 | tail
25863 19:52:21.109777 select(0, NULL, NULL, NULL, {0, 28580} <unfinished ...>
25863 19:52:21.138652 --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---
[root@host ~]# grep ^25862 /tmp/c1 | tail
25862 19:52:21.109921 select(0, NULL, NULL, NULL, {0, 3195}) = 0 (Timeout)
25862 19:52:21.113356 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe99601f9d0) = 26097
25862 19:52:21.113925 select(0, NULL, NULL, NULL, {0, 100000} <unfinished ...>
25862 19:52:21.123330 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
25862 19:52:21.123377 --- SIGCHLD (Child exited) @ 0 (0) ---
25862 19:52:21.123407 select(0, NULL, NULL, NULL, {0, 90645} <unfinished ...>
25862 19:52:21.198565 +++ killed by SIGTRAP (core dumped) +++
----
Some observations from working on this repro:
o It seems that the 6.4 version of strace fails to follow child processes of
the non-main thread, unlike strace on other RHEL versions when attaching
with "-p".
o This error only occurs when attching with "-p".
o This error did not occur for forks from non-main threads.
o It seems to have something to do with attempting to follow child processes
of the main thread that manipulate the SIGTRAP handler.
David, thanks for the consistent reproducer. We're early in the RHEL 6.7 planning process right now and this bug is one we're definitely looking at. Having a consistent reproducer (or even a semi-consistent one) will help significantly. Note that strace-4.8 which comes with the developer toolset V3 does not exhibit this problem behaviour. Ideally we'll be upgrading to that version for RHEL 6.7. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1308.html |