244162 – utrace: Failing PTRACE_DETACH after ATTACH+CONT(SIGSTOP)

Bug 244162 - utrace: Failing PTRACE_DETACH after ATTACH+CONT(SIGSTOP)

Summary: utrace: Failing PTRACE_DETACH after ATTACH+CONT(SIGSTOP)

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Roland McGrath
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	243555
TreeView+	depends on / blocked

Reported:	2007-06-14 09:09 UTC by Roland McGrath
Modified:	2008-08-02 23:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:	kernel-2.6.23-0.204.rc8.fc8.x86_64
Clone Of:
Environment:
Last Closed:	2007-10-11 00:25:03 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test case (3.13 KB, text/plain) 2007-06-14 09:09 UTC, Roland McGrath	no flags	Details
Testcase for the Comment 3. (3.20 KB, text/plain) 2007-06-14 09:37 UTC, Jan Kratochvil	no flags	Details
Updated to loop instead of abort()ing in the main code. (3.25 KB, text/plain) 2007-06-14 10:04 UTC, Jan Kratochvil	no flags	Details
Testing machine /proc/config.gz (14.29 KB, application/octet-stream) 2007-06-18 15:17 UTC, Jan Kratochvil	no flags	Details
Testcase only catching/resubmitting SIGALRM. (2.54 KB, text/plain) 2007-10-01 21:35 UTC, Jan Kratochvil	no flags	Details
Standlone raise(SIGALRM)-looping non-forking process. (1.36 KB, text/plain) 2007-10-01 21:37 UTC, Jan Kratochvil	no flags	Details
Parent/child testcase reproducing the current Rawhide GDB. (2.54 KB, text/plain) 2007-10-01 21:40 UTC, Jan Kratochvil	no flags	Details
Show Obsolete (3) View All

Description Roland McGrath 2007-06-14 09:09:12 UTC

+++ This bug was initially created as a clone of Bug #243555 +++

Description of problem:
Developed on upstream the minimal raceless PTRACE_ATTACH method.
It unfortunately fails on UTRACE, I hope it is not a race.

Version-Release number of selected component (if applicable):
kernel-2.6.18-8.1.1.el5.x86_64
kernel-2.6.21-1.3194.fc7
kernel-2.6.20-1.2948.fc6.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. gcc -o cont-sigstop-detach cont-sigstop-detach.c -ggdb2 -Wall
2. ./cont-sigstop-detach 

Actual results:
utrace bug hit
Aborted

Expected results:
[endless run on the upstream linux-2.6.20.4.x86_64]

Additional info:
The code does:
PTRACE_ATTACH
PTRACE_CONT(SIGSTOP)
waitpid() -> SIGSTOP
PTRACE_DETACH -> upstream: 0 vs. utrace: ESRCH

Comment 1 Roland McGrath 2007-06-14 09:09:13 UTC

Created attachment 156968 [details]
test case

Comment 2 Roland McGrath 2007-06-14 09:28:10 UTC

In fact, the test case is racy.  The parent swallows the child's SIGALRM
signals, and so the child can get to its abort() call and make the parent see
WIFEXITED.
Jan, can you amend the test case so that possibility is avoided?

Comment 3 Jan Kratochvil 2007-06-14 09:35:40 UTC

From the ptrace(2) info I know from you this cannot happen.
If the parent swallows the SIGALRM it will "redeliver" it by the PTRACE_CONT 4th
parameter "SIG".
You can see it there by some:
  if (sig != SIGSTOP) printf("sig=%d\n",(int)sig);
as it prints:
  sig=14
  ...
and never fails the assertions on linux-2.6.20.4.x86_64.
Therefore I hope NOTABUG. ;-)

Comment 4 Jan Kratochvil 2007-06-14 09:37:23 UTC

Created attachment 156970 [details]
Testcase for the Comment 3.

Comment 5 Roland McGrath 2007-06-14 09:48:26 UTC

Hmm, I misread the code.  Still, I am seeing the child sometimes get through to
its abort call.  I see this on a vanilla upstream kernel built with PREEMPT too.

Comment 7 Jan Kratochvil 2007-06-14 10:04:59 UTC

Created attachment 156975 [details]
Updated to loop instead of abort()ing in the main code.

The tested 2.6.20.4.x86_64 is PREEMPT=n and even non-SMP.
Good hint i should test my code/testcases even on PREEMPT/SMP kernels.

Testcase no longer prints message on each attachment as the ptrace(2)-caught
SIGALRM is only about 1 per second even (indicated now) in this silent loop.

Comment 8 Jan Kratochvil 2007-06-18 15:17:24 UTC

Created attachment 157288 [details]
Testing machine /proc/config.gz

FYI unable to reach the abort() point on 2.6.22-rc4-git7.x86_64 on:
dual Opteron 1000MHz(?)+2000MHz
May the behavior change on the upstream kernel versions?

# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_DEBUG_PREEMPT=y

Comment 10 Jan Kratochvil 2007-06-26 19:42:34 UTC

Comment on attachment 157288 [details]
Testing machine /proc/config.gz

Reproduced on a different machine/config, its config upon request.

Comment 11 Jan Kratochvil 2007-06-26 19:46:34 UTC

Reproduced the child process will abort() while being ptrace(2)d.
On the other hand the isolated child process runs forever on the same host.

Roland,
do you agree it is an upstream ptrace(2) bug if it cannot attach/detach a
process without affecting its behavior?  This behavior change is IMO not a race
category.

I would try to verify if it is an upstream regression and isolate it etc. in
such case.

Comment 12 Roland McGrath 2007-07-11 09:23:22 UTC

That behavior does seem wrong to me and I want to understand how it can happen
upstream.  The SIGARLM should get delivered every time and not swallowed, and
when it's delivered the handler generates another SIGALRM.  So upon returning
from the handler (i.e. returning from the sigreturn/rt_sigreturn syscall), there
should be a pending SIGALRM that was just unblocked by sigreturn and gets
delivered to ptrace.  Unless the ptrace'ing parent goes away between a wait call
and a ptrace call, it should not be possible for the child ever to actually
return from raise.

Comment 13 Christopher Brown 2007-09-16 21:52:35 UTC

Hello Roland,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

There hasn't been much activity on this bug for a while - do you want it left
open or can it be closed?

Cheers
Chris

Comment 14 Jan Kratochvil 2007-10-01 21:33:55 UTC

The original utrace Bug is now fixed, verified as fixed on:
  kernel-2.6.23-0.204.rc8.fc8.x86_64
  kernel-2.6.22.9-91.fc7.x86_64
  kernel-2.6.22.7-57.fc6.x86_64

Questionable was the Roland's Comment #5 - the testcase from Comment #1 could
reach the abort ().
I found it is just due to the parent handling - it does PTRACE_CONT(SIGSTOP) on
the child process before checking its state and sometimes the SIGALRM may be
delivered that time and it gets lost due to it.

Comment 15 Jan Kratochvil 2007-10-01 21:35:54 UTC

Created attachment 212881 [details]
Testcase only catching/resubmitting SIGALRM.

This testcase has no race, it does not try to PTRACE_CONT(sig) without first
catching the `sig' signal using WAITPID.

Comment 16 Jan Kratochvil 2007-10-01 21:37:30 UTC

Created attachment 212891 [details]
Standlone raise(SIGALRM)-looping non-forking process.

Testcase useful for being attached to by external GDB.
Unfortunately the race is so sparse that one never hits the race in a
reasonable time during the external GDB startup/shutdown overhead.

Comment 17 Jan Kratochvil 2007-10-01 21:40:22 UTC

Created attachment 212901 [details]
Parent/child testcase reproducing the current Rawhide GDB.

This testcase emulates the current Rawhide GDB's behavior as implemented since
gdb-6.6-27.fc8:

* Mon Sep 17 2007 Jan Kratochvil <jan.kratochvil> - 6.6-27
- Fix attaching to stopped processes and/or pending signals.

Unfortunately there is a race for already Stopped processes; this
Comment/Testcase could be submitted as a GDB bug instead.  The Stopped process
being attached to by GDB may "run a bit" despite it during the attachment
operation.

signalgdb: /tmp/signalgdb.c:93: main: Assertion `((((__extension__ ({ union {
__typeof(status) __in; int __i; } __u; __u.__in = (status); __u.__i; }))) &
0xff) == 0x7f)' failed.
Aborted

Comment 18 Jan Kratochvil 2007-10-04 15:21:28 UTC

It does not appear to be fixed.
And probably it is even no kernel/utrace bug, just the testcase is too racy.
To be rechecked.

Comment 19 Jan Kratochvil 2007-10-11 00:25:03 UTC

It was a real kernel utrace bug which was fixed by Roland before.
The testcase is now fixed to be reliable and without races.
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/ptrace-cont-sigstop-detach.c?cvsroot=systemtap

Note You need to log in before you can comment on or make changes to this bug.