+++ This bug was initially created as a clone of Bug #243555 +++ Description of problem: Developed on upstream the minimal raceless PTRACE_ATTACH method. It unfortunately fails on UTRACE, I hope it is not a race. Version-Release number of selected component (if applicable): kernel-2.6.18-8.1.1.el5.x86_64 kernel-2.6.21-1.3194.fc7 kernel-2.6.20-1.2948.fc6.x86_64 How reproducible: Always. Steps to Reproduce: 1. gcc -o cont-sigstop-detach cont-sigstop-detach.c -ggdb2 -Wall 2. ./cont-sigstop-detach Actual results: utrace bug hit Aborted Expected results: [endless run on the upstream linux-2.6.20.4.x86_64] Additional info: The code does: PTRACE_ATTACH PTRACE_CONT(SIGSTOP) waitpid() -> SIGSTOP PTRACE_DETACH -> upstream: 0 vs. utrace: ESRCH
Created attachment 156968 [details] test case
In fact, the test case is racy. The parent swallows the child's SIGALRM signals, and so the child can get to its abort() call and make the parent see WIFEXITED. Jan, can you amend the test case so that possibility is avoided?
From the ptrace(2) info I know from you this cannot happen. If the parent swallows the SIGALRM it will "redeliver" it by the PTRACE_CONT 4th parameter "SIG". You can see it there by some: if (sig != SIGSTOP) printf("sig=%d\n",(int)sig); as it prints: sig=14 ... and never fails the assertions on linux-2.6.20.4.x86_64. Therefore I hope NOTABUG. ;-)
Created attachment 156970 [details] Testcase for the Comment 3.
Hmm, I misread the code. Still, I am seeing the child sometimes get through to its abort call. I see this on a vanilla upstream kernel built with PREEMPT too.
Created attachment 156975 [details] Updated to loop instead of abort()ing in the main code. The tested 2.6.20.4.x86_64 is PREEMPT=n and even non-SMP. Good hint i should test my code/testcases even on PREEMPT/SMP kernels. Testcase no longer prints message on each attachment as the ptrace(2)-caught SIGALRM is only about 1 per second even (indicated now) in this silent loop.
Created attachment 157288 [details] Testing machine /proc/config.gz FYI unable to reach the abort() point on 2.6.22-rc4-git7.x86_64 on: dual Opteron 1000MHz(?)+2000MHz May the behavior change on the upstream kernel versions? # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_DEBUG_PREEMPT=y
Comment on attachment 157288 [details] Testing machine /proc/config.gz Reproduced on a different machine/config, its config upon request.
Reproduced the child process will abort() while being ptrace(2)d. On the other hand the isolated child process runs forever on the same host. Roland, do you agree it is an upstream ptrace(2) bug if it cannot attach/detach a process without affecting its behavior? This behavior change is IMO not a race category. I would try to verify if it is an upstream regression and isolate it etc. in such case.
That behavior does seem wrong to me and I want to understand how it can happen upstream. The SIGARLM should get delivered every time and not swallowed, and when it's delivered the handler generates another SIGALRM. So upon returning from the handler (i.e. returning from the sigreturn/rt_sigreturn syscall), there should be a pending SIGALRM that was just unblocked by sigreturn and gets delivered to ptrace. Unless the ptrace'ing parent goes away between a wait call and a ptrace call, it should not be possible for the child ever to actually return from raise.
Hello Roland, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage There hasn't been much activity on this bug for a while - do you want it left open or can it be closed? Cheers Chris
The original utrace Bug is now fixed, verified as fixed on: kernel-2.6.23-0.204.rc8.fc8.x86_64 kernel-2.6.22.9-91.fc7.x86_64 kernel-2.6.22.7-57.fc6.x86_64 Questionable was the Roland's Comment #5 - the testcase from Comment #1 could reach the abort (). I found it is just due to the parent handling - it does PTRACE_CONT(SIGSTOP) on the child process before checking its state and sometimes the SIGALRM may be delivered that time and it gets lost due to it.
Created attachment 212881 [details] Testcase only catching/resubmitting SIGALRM. This testcase has no race, it does not try to PTRACE_CONT(sig) without first catching the `sig' signal using WAITPID.
Created attachment 212891 [details] Standlone raise(SIGALRM)-looping non-forking process. Testcase useful for being attached to by external GDB. Unfortunately the race is so sparse that one never hits the race in a reasonable time during the external GDB startup/shutdown overhead.
Created attachment 212901 [details] Parent/child testcase reproducing the current Rawhide GDB. This testcase emulates the current Rawhide GDB's behavior as implemented since gdb-6.6-27.fc8: * Mon Sep 17 2007 Jan Kratochvil <jan.kratochvil> - 6.6-27 - Fix attaching to stopped processes and/or pending signals. Unfortunately there is a race for already Stopped processes; this Comment/Testcase could be submitted as a GDB bug instead. The Stopped process being attached to by GDB may "run a bit" despite it during the attachment operation. signalgdb: /tmp/signalgdb.c:93: main: Assertion `((((__extension__ ({ union { __typeof(status) __in; int __i; } __u; __u.__in = (status); __u.__i; }))) & 0xff) == 0x7f)' failed. Aborted
It does not appear to be fixed. And probably it is even no kernel/utrace bug, just the testcase is too racy. To be rechecked.
It was a real kernel utrace bug which was fixed by Roland before. The testcase is now fixed to be reliable and without races. http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/ptrace-cont-sigstop-detach.c?cvsroot=systemtap