919101 – strace -f : straced process killed by SIGTRAP (caused by BZ#837183)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 919101 - strace -f : straced process killed by SIGTRAP (caused by BZ#837183)

Summary: strace -f : straced process killed by SIGTRAP (caused by BZ#837183)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	strace
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Law
QA Contact:	Michael Petlan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1153397
TreeView+	depends on / blocked

Reported:	2013-03-07 15:28 UTC by Julien Thomas
Modified:	2015-07-22 06:26 UTC (History)
CC List:	6 users (show)
Fixed In Version:	strace-4.8-9.el6
Doc Type:	Rebase: Bug Fixes and Enhancements
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-22 06:26:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
rpm -qa rhel 6.4 + nagios epel with nagios-common dep (8.35 KB, text/plain) 2013-03-07 18:33 UTC, Julien Thomas	no flags	Details
nagios minimal configuration (1.93 KB, application/x-gzip) 2013-03-07 18:34 UTC, Julien Thomas	no flags	Details
mk-livestatus nagios broker source rpm (337.77 KB, application/x-rpm) 2013-03-07 18:36 UTC, Julien Thomas	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	837183	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Product Errata	RHBA-2015:1308	0	normal	SHIPPED_LIVE	strace bug fix and enhancement update	2015-07-20 17:53:42 UTC

Description Julien Thomas 2013-03-07 15:28:40 UTC

Description of problem:

Running strace -f using package from RHEL 6.4 strace-4.5.19-1.17.el6.src.rpm can make the straced program to be "killed by SIGTRAP".

Version-Release number of selected component (if applicable):

strace-4.5.19-1.17.el6

How reproducible:

I can reproduce the problem unconditionally strac'ing my nagios process when it forks in order to execute its checks.

Steps to Reproduce:

Sorry I do not have a simple test case.

1. start nagios process
2. strace -f -o /tmp/strace.nagios  -s 256 -p $(</var/nagios/nagios.pid)
  
Actual results:

1 to 5 seconds after starting the strace :

1926  --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---
1936  +++ killed by SIGTRAP +++
1935  +++ killed by SIGTRAP +++
1934  +++ killed by SIGTRAP +++
1933  +++ killed by SIGTRAP +++
1932  +++ killed by SIGTRAP +++
1931  +++ killed by SIGTRAP +++
1930  +++ killed by SIGTRAP +++
1929  +++ killed by SIGTRAP +++
1928  +++ killed by SIGTRAP +++
1927  +++ killed by SIGTRAP +++
1925  +++ killed by SIGTRAP +++
1923  +++ killed by SIGTRAP +++

Expected results:

The nagios process should not be killed.

Additional info:

The problem does not occurs with strace-4.5.19-1.11.el6_2.1 from RHEL 6.3 ISO. I had a look at the changes between that version and the strace-4.5.19-1.17.el6 from 6.4. I believe the problem is caused by the BZ#837183 patch serie. I built an RPM without those patches and the issue goes away.

strace-rh837183.patch
strace-rh837183-2.patch
strace-rh837183-3.patch
strace-rh837183-4.patch
strace-rh837183-5.patch
strace-rh837183-6.patch

Comment 1 Julien Thomas 2013-03-07 16:32:28 UTC

4.5.19-1.11.el6_2.1 => OK
4.5.19-1.11.el6_3.2 => killed by SIGTRAP
4.5.19-1.17.el6     => killed by SIGTRAP

In 4.5.19-1.11.el6_3.2, the patch set causing the problem is actually :

strace-rh849052.patch
strace-rh849052-2.patch
strace-rh849052-3.patch
strace-rh849052-4.patch
strace-rh849052-5.patch
strace-rh849052-6.patch

* Fri Aug 17 2012 Jeff Law <law> - 4.5.19-1.11.el6_3.1
- Use PTRACE_SETOPTIONS to set behaviour of ptrace at fork/vfork/clone
  syscalls rather than buggy magic breakpoint methods (#849052).

In 4.5.19-1.17.el6, the patch set was renamed (?)

strace-rh837183.patch
strace-rh837183-2.patch
strace-rh837183-3.patch
strace-rh837183-4.patch
strace-rh837183-5.patch
strace-rh837183-6.patch

* Tue Sep 11 2012 Jeff Law <law> - 4.5.19-1.15
- Additional patches related to process startup. (#837183)

[...]

* Fri Aug 03 2012 Jeff Law <law> - 4.5.19-1.12
- Fix semtimedop decoding for S390/S390x (#759569)
- Use PTRACE_SETOPTIONS to set behaviour of ptrace at fork/vfork/clone
  syscalls rather than buggy magic breakpoint methods (#837183).

* Thu Dec 22 2011 Jeff Law <law> - 4.5.19-1.11
- Track personality more carefully (#769118)

Comment 2 Jeff Law 2013-03-07 16:33:07 UTC

Can you attach the output of rpm -qa so I know what nagios plugins you've got installed and the contents of /etc/nagios so I have your nagios configuration.

If you have concerns about attaching the nagios configuration bits to this BZ, you can send them to me privately via law.

Comment 3 Jeff Law 2013-03-07 16:34:24 UTC

The patchfile renaming is strictly an internal process thing.  The effective content of the patchfiles should be the same.

Comment 4 Julien Thomas 2013-03-07 16:36:27 UTC

Hi Jeff, yes I diff'ed the two series, they are the same.

Comment 5 Julien Thomas 2013-03-07 18:33:11 UTC

Created attachment 706758 [details]
rpm -qa rhel 6.4 + nagios epel with nagios-common dep

Comment 6 Julien Thomas 2013-03-07 18:34:47 UTC

Created attachment 706759 [details]
nagios minimal configuration

Comment 7 Julien Thomas 2013-03-07 18:36:01 UTC

Created attachment 706760 [details]
mk-livestatus nagios broker source rpm

Comment 8 Julien Thomas 2013-03-07 18:38:16 UTC

Hi Jeff,

I prepared a test machine with a RHEL 6.4 up-to-date and a simple Nagios test configuration in order to reproduce the problem. I've attached the files to the bug report. We normally do not use the nagios package from EPEL because this version has some blocking bugs for our usage. However I've installed the one from EPEL on the test machine and the problem remains.

Here is how my directory structure looks like :

/tmp/etc-nagios/
/tmp/etc-nagios/nagios.cfg
/tmp/etc-nagios/objects.cfg
/tmp/var-nagios/

Here is how I start Nagios, then strace :

rm -fv /tmp/var-nagios/*
nagios -d /tmp/etc-nagios/nagios.cfg
strace -f -s 256 -p $(</tmp/var-nagios/nagios.pid)

Here is what I do in another terminal to get bug :

ts=$(date +%s); \
echo "[$ts] SCHEDULE_FORCED_SVC_CHECK;pom;DUMMY;$ts" \
    > /tmp/var-nagios/nagios.pipe

I caught something interesting while preparing the minimal test configuration... The problem does not occur if I disable the mk-livestatus broker module. I understand we could say something is doggy in that version of livestatus. However I can't reproduce the problem with strace-4.5.19-1.11.el6_2.1 from the 6.3 ISO, and this is why I started reporting this bug.

Hope this helps,
Julien

Comment 9 Jeff Law 2013-03-07 19:45:21 UTC

Julien, yes, that will definitely help.  Thanks for the time and effort you've put into getting this into a reproducible state.

Comment 11 Jeff Law 2014-06-18 17:35:10 UTC

I'm having a bear of a time reproducing this problem.  I got it to trigger once, but subsequent attempts to trigger have been futile -- including doing things like putting the signaling sequence into a tight loop.

Based on that, the actual failure point and the set of patches that Julien identified as causing problems, I can only conclude that the new code to gain control of the process is racy in some way or another.

However, with a consistent, reliable way to reproduce, it's going to be virtually impossible to debug what's happening or for our QE team to verify that any fix actually fixes the problem.

Julien, could you perhaps try the version of strace in DTS (developer toolset) 2.0/2.1 and see if it exhibits this problem.  One of the things I've been pondering is try to get RHEL 6's strace updated from the strace-4.5.19 codebase to something more modern such as strace-4.7 or strace-4.8.  Knowing if either of those versions exhibit this problem would certainly be helpful in making that decision.

Comment 12 david.west 2014-11-07 21:16:05 UTC

We recently ran into this issue  when debugging our application with stock RHEL
6.4.  I've isolated a consistent repro:

(0) Environment:

----
[root@host ~]# uname -a
Linux host 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@host ~]# rpm -qa redhat-release-server strace
strace-4.5.19-1.17.el6.x86_64
redhat-release-server-6Server-6.4.0.4.el6.x86_64
----

(1) In one shell execute the following one-liner:

----
[root@host ~]# python -c 'import base64,zlib;eval(compile(zlib.decompress(base64.decodestring("eJxdjc0KgzAQhO95iiEgJJD6cxW89dJbH6Ag/qwaqokkaW3fvhEvbQ+77A4z8+lltS4gTI6aXgW9kLJe+XccPZpmZmx12gTw6+VcIulvhiOB9elIYdW9kIz1NGCwVsiSAdukZ0Kxn8Bel/qZaBV5WkTrgUl9aFyoDW31IYgYVxAyOr7y/2lADxA7e7DuLiSqCjnkgYoqvah7Cp612mRt4yeufh5+6uIK7kE8gj5j8klm")),"th.py","exec"))'
PID: 25862
----

(2) In another shell, on the same  stock RHEL 6.4 server, execute the following
    strace command:

----
[root@host ~]# strace -f -tt -v -o /tmp/c1 -p 25862
Process 25862 attached with 2 threads - interrupt to quit
Process 26097 attached
Process 26097 detached
Process 25863 detached
[root@host ~]#
----

The python process from the first step will then immediately exit as follows:

----
Trace/breakpoint trap (core dumped)
[root@host ~]#
----

Below is some interesting content from the strace output:

----
[root@host ~]# grep SIGTRAP /tmp/c1
25863 19:52:21.138652 --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---
25862 19:52:21.198565 +++ killed by SIGTRAP (core dumped) +++

[root@host ~]# grep ^25863 /tmp/c1 | tail
25863 19:52:21.109777 select(0, NULL, NULL, NULL, {0, 28580} <unfinished ...>
25863 19:52:21.138652 --- SIGTRAP (Trace/breakpoint trap) @ 0 (0) ---

[root@host ~]# grep ^25862 /tmp/c1 | tail
25862 19:52:21.109921 select(0, NULL, NULL, NULL, {0, 3195}) = 0 (Timeout)
25862 19:52:21.113356 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe99601f9d0) = 26097
25862 19:52:21.113925 select(0, NULL, NULL, NULL, {0, 100000} <unfinished ...>
25862 19:52:21.123330 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
25862 19:52:21.123377 --- SIGCHLD (Child exited) @ 0 (0) ---
25862 19:52:21.123407 select(0, NULL, NULL, NULL, {0, 90645} <unfinished ...>
25862 19:52:21.198565 +++ killed by SIGTRAP (core dumped) +++
----

Some observations from working on this repro:

  o It seems that the 6.4 version  of strace fails to follow child processes of
    the non-main  thread, unlike strace  on other RHEL versions  when attaching
    with "-p".

  o This error only occurs when attching with "-p".

  o This error did not occur for forks from non-main threads.

  o It seems to have something  to do with attempting to follow child processes
    of the main thread that manipulate the SIGTRAP handler.

Comment 13 Jeff Law 2014-11-14 17:20:56 UTC

David, thanks for the consistent reproducer.  We're early in the RHEL 6.7 planning process right now and this bug is one we're definitely looking at.  Having a consistent reproducer (or even a semi-consistent one) will help significantly.

Comment 14 Jeff Law 2014-11-18 18:55:41 UTC

Note that strace-4.8 which comes with the developer toolset V3 does not exhibit this problem behaviour.  Ideally we'll be upgrading to that version for RHEL 6.7.

Comment 21 errata-xmlrpc 2015-07-22 06:26:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1308.html

Note You need to log in before you can comment on or make changes to this bug.