Bug 919101 - strace -f : straced process killed by SIGTRAP (caused by BZ#837183)
Summary: strace -f : straced process killed by SIGTRAP (caused by BZ#837183)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: strace
Version: 6.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: ---
Assignee: Jeff Law
QA Contact: Michael Petlan
URL:
Whiteboard:
Depends On:
Blocks: 1153397
TreeView+ depends on / blocked
 
Reported: 2013-03-07 15:28 UTC by Julien Thomas
Modified: 2015-07-22 06:26 UTC (History)
6 users (show)

Fixed In Version: strace-4.8-9.el6
Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-22 06:26:51 UTC


Attachments (Terms of Use)
rpm -qa rhel 6.4 + nagios epel with nagios-common dep (8.35 KB, text/plain)
2013-03-07 18:33 UTC, Julien Thomas
no flags Details
nagios minimal configuration (1.93 KB, application/x-gzip)
2013-03-07 18:34 UTC, Julien Thomas
no flags Details
mk-livestatus nagios broker source rpm (337.77 KB, application/x-rpm)
2013-03-07 18:36 UTC, Julien Thomas
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1308 normal SHIPPED_LIVE strace bug fix and enhancement update 2015-07-20 17:53:42 UTC
Red Hat Bugzilla 837183 None None None 2019-06-18 05:02:26 UTC

Description Julien Thomas 2013-03-07 15:28:40 UTC
Description of problem:

Running strace -f using package from RHEL 6.4 strace-4.5.19-1.17.el6.src.rpm can make the straced program to be "killed by SIGTRAP".

Version-Release number of selected component (if applicable):

strace-4.5.19-1.17.el6

How reproducible:

I can reproduce the problem unconditionally strac'ing my nagios process when it forks in order to execute its checks.

Steps to Reproduce:

Sorry I do not have a simple test case.

1. start nagios process
2. strace -f -o /tmp/strace.nagios  -s 256 -p $(</var/nagios/nagios.pid)
  
Actual results:

1 to 5 seconds after starting the strace :

1926  --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---
1936  +++ killed by SIGTRAP +++
1935  +++ killed by SIGTRAP +++
1934  +++ killed by SIGTRAP +++
1933  +++ killed by SIGTRAP +++
1932  +++ killed by SIGTRAP +++
1931  +++ killed by SIGTRAP +++
1930  +++ killed by SIGTRAP +++
1929  +++ killed by SIGTRAP +++
1928  +++ killed by SIGTRAP +++
1927  +++ killed by SIGTRAP +++
1925  +++ killed by SIGTRAP +++
1923  +++ killed by SIGTRAP +++

Expected results:

The nagios process should not be killed.

Additional info:

The problem does not occurs with strace-4.5.19-1.11.el6_2.1 from RHEL 6.3 ISO. I had a look at the changes between that version and the strace-4.5.19-1.17.el6 from 6.4. I believe the problem is caused by the BZ#837183 patch serie. I built an RPM without those patches and the issue goes away.

strace-rh837183.patch
strace-rh837183-2.patch
strace-rh837183-3.patch
strace-rh837183-4.patch
strace-rh837183-5.patch
strace-rh837183-6.patch

Comment 1 Julien Thomas 2013-03-07 16:32:28 UTC
4.5.19-1.11.el6_2.1 => OK
4.5.19-1.11.el6_3.2 => killed by SIGTRAP
4.5.19-1.17.el6     => killed by SIGTRAP

In 4.5.19-1.11.el6_3.2, the patch set causing the problem is actually :

strace-rh849052.patch
strace-rh849052-2.patch
strace-rh849052-3.patch
strace-rh849052-4.patch
strace-rh849052-5.patch
strace-rh849052-6.patch

* Fri Aug 17 2012 Jeff Law <law@redhat.com> - 4.5.19-1.11.el6_3.1
- Use PTRACE_SETOPTIONS to set behaviour of ptrace at fork/vfork/clone
  syscalls rather than buggy magic breakpoint methods (#849052).

In 4.5.19-1.17.el6, the patch set was renamed (?)

strace-rh837183.patch
strace-rh837183-2.patch
strace-rh837183-3.patch
strace-rh837183-4.patch
strace-rh837183-5.patch
strace-rh837183-6.patch

* Tue Sep 11 2012 Jeff Law <law@redhat.com> - 4.5.19-1.15
- Additional patches related to process startup. (#837183)

[...]

* Fri Aug 03 2012 Jeff Law <law@redhat.com> - 4.5.19-1.12
- Fix semtimedop decoding for S390/S390x (#759569)
- Use PTRACE_SETOPTIONS to set behaviour of ptrace at fork/vfork/clone
  syscalls rather than buggy magic breakpoint methods (#837183).

* Thu Dec 22 2011 Jeff Law <law@redhat.com> - 4.5.19-1.11
- Track personality more carefully (#769118)

Comment 2 Jeff Law 2013-03-07 16:33:07 UTC
Can you attach the output of rpm -qa so I know what nagios plugins you've got installed and the contents of /etc/nagios so I have your nagios configuration.

If you have concerns about attaching the nagios configuration bits to this BZ, you can send them to me privately via law@redhat.com.

Comment 3 Jeff Law 2013-03-07 16:34:24 UTC
The patchfile renaming is strictly an internal process thing.  The effective content of the patchfiles should be the same.

Comment 4 Julien Thomas 2013-03-07 16:36:27 UTC
Hi Jeff, yes I diff'ed the two series, they are the same.

Comment 5 Julien Thomas 2013-03-07 18:33:11 UTC
Created attachment 706758 [details]
rpm -qa rhel 6.4 + nagios epel with nagios-common dep

Comment 6 Julien Thomas 2013-03-07 18:34:47 UTC
Created attachment 706759 [details]
nagios minimal configuration

Comment 7 Julien Thomas 2013-03-07 18:36:01 UTC
Created attachment 706760 [details]
mk-livestatus nagios broker source rpm

Comment 8 Julien Thomas 2013-03-07 18:38:16 UTC
Hi Jeff,

I prepared a test machine with a RHEL 6.4 up-to-date and a simple Nagios test configuration in order to reproduce the problem. I've attached the files to the bug report. We normally do not use the nagios package from EPEL because this version has some blocking bugs for our usage. However I've installed the one from EPEL on the test machine and the problem remains.

Here is how my directory structure looks like :

/tmp/etc-nagios/
/tmp/etc-nagios/nagios.cfg
/tmp/etc-nagios/objects.cfg
/tmp/var-nagios/

Here is how I start Nagios, then strace :

rm -fv /tmp/var-nagios/*
nagios -d /tmp/etc-nagios/nagios.cfg
strace -f -s 256 -p $(</tmp/var-nagios/nagios.pid)

Here is what I do in another terminal to get bug :

ts=$(date +%s); \
echo "[$ts] SCHEDULE_FORCED_SVC_CHECK;pom;DUMMY;$ts" \
    > /tmp/var-nagios/nagios.pipe

I caught something interesting while preparing the minimal test configuration... The problem does not occur if I disable the mk-livestatus broker module. I understand we could say something is doggy in that version of livestatus. However I can't reproduce the problem with strace-4.5.19-1.11.el6_2.1 from the 6.3 ISO, and this is why I started reporting this bug.

Hope this helps,
Julien

Comment 9 Jeff Law 2013-03-07 19:45:21 UTC
Julien, yes, that will definitely help.  Thanks for the time and effort you've put into getting this into a reproducible state.

Comment 11 Jeff Law 2014-06-18 17:35:10 UTC
I'm having a bear of a time reproducing this problem.  I got it to trigger once, but subsequent attempts to trigger have been futile -- including doing things like putting the signaling sequence into a tight loop.

Based on that, the actual failure point and the set of patches that Julien identified as causing problems, I can only conclude that the new code to gain control of the process is racy in some way or another.

However, with a consistent, reliable way to reproduce, it's going to be virtually impossible to debug what's happening or for our QE team to verify that any fix actually fixes the problem.

Julien, could you perhaps try the version of strace in DTS (developer toolset) 2.0/2.1 and see if it exhibits this problem.  One of the things I've been pondering is try to get RHEL 6's strace updated from the strace-4.5.19 codebase to something more modern such as strace-4.7 or strace-4.8.  Knowing if either of those versions exhibit this problem would certainly be helpful in making that decision.

Comment 12 david.west 2014-11-07 21:16:05 UTC
We recently ran into this issue  when debugging our application with stock RHEL
6.4.  I've isolated a consistent repro:

(0) Environment:

----
[root@host ~]# uname -a
Linux host 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@host ~]# rpm -qa redhat-release-server strace
strace-4.5.19-1.17.el6.x86_64
redhat-release-server-6Server-6.4.0.4.el6.x86_64
----

(1) In one shell execute the following one-liner:

----
[root@host ~]# python -c 'import base64,zlib;eval(compile(zlib.decompress(base64.decodestring("eJxdjc0KgzAQhO95iiEgJJD6cxW89dJbH6Ag/qwaqokkaW3fvhEvbQ+77A4z8+lltS4gTI6aXgW9kLJe+XccPZpmZmx12gTw6+VcIulvhiOB9elIYdW9kIz1NGCwVsiSAdukZ0Kxn8Bel/qZaBV5WkTrgUl9aFyoDW31IYgYVxAyOr7y/2lADxA7e7DuLiSqCjnkgYoqvah7Cp612mRt4yeufh5+6uIK7kE8gj5j8klm")),"th.py","exec"))'
PID: 25862
----

(2) In another shell, on the same  stock RHEL 6.4 server, execute the following
    strace command:

----
[root@host ~]# strace -f -tt -v -o /tmp/c1 -p 25862
Process 25862 attached with 2 threads - interrupt to quit
Process 26097 attached
Process 26097 detached
Process 25863 detached
[root@host ~]#
----

The python process from the first step will then immediately exit as follows:

----
Trace/breakpoint trap (core dumped)
[root@host ~]#
----

Below is some interesting content from the strace output:

----
[root@host ~]# grep SIGTRAP /tmp/c1
25863 19:52:21.138652 --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---
25862 19:52:21.198565 +++ killed by SIGTRAP (core dumped) +++

[root@host ~]# grep ^25863 /tmp/c1 | tail
25863 19:52:21.109777 select(0, NULL, NULL, NULL, {0, 28580} <unfinished ...>
25863 19:52:21.138652 --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---

[root@host ~]# grep ^25862 /tmp/c1 | tail
25862 19:52:21.109921 select(0, NULL, NULL, NULL, {0, 3195}) = 0 (Timeout)
25862 19:52:21.113356 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe99601f9d0) = 26097
25862 19:52:21.113925 select(0, NULL, NULL, NULL, {0, 100000} <unfinished ...>
25862 19:52:21.123330 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
25862 19:52:21.123377 --- SIGCHLD (Child exited) @ 0 (0) ---
25862 19:52:21.123407 select(0, NULL, NULL, NULL, {0, 90645} <unfinished ...>
25862 19:52:21.198565 +++ killed by SIGTRAP (core dumped) +++
----

Some observations from working on this repro:

  o It seems that the 6.4 version  of strace fails to follow child processes of
    the non-main  thread, unlike strace  on other RHEL versions  when attaching
    with "-p".

  o This error only occurs when attching with "-p".

  o This error did not occur for forks from non-main threads.

  o It seems to have something  to do with attempting to follow child processes
    of the main thread that manipulate the SIGTRAP handler.

Comment 13 Jeff Law 2014-11-14 17:20:56 UTC
David, thanks for the consistent reproducer.  We're early in the RHEL 6.7 planning process right now and this bug is one we're definitely looking at.  Having a consistent reproducer (or even a semi-consistent one) will help significantly.

Comment 14 Jeff Law 2014-11-18 18:55:41 UTC
Note that strace-4.8 which comes with the developer toolset V3 does not exhibit this problem behaviour.  Ideally we'll be upgrading to that version for RHEL 6.7.

Comment 21 errata-xmlrpc 2015-07-22 06:26:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1308.html


Note You need to log in before you can comment on or make changes to this bug.