Bug 470249
Summary: | ptrace: PTRACE_DETACH,SIGALRM kills the tracee | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jan Kratochvil <jan.kratochvil> | ||||||||||
Component: | kernel | Assignee: | Roland McGrath <roland> | ||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 9 | CC: | dvlasenk, kernel-maint, quintela | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2009-07-14 14:36:09 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Jan Kratochvil
2008-11-06 13:42:50 UTC
ptrace(PTRACE_ATTACH, PID, 0, 0) = 0 wait4(PID, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], 0, NULL) = PID ptrace(PTRACE_GETREGS, PID, 0, HEX) = 0 ptrace(0x4200 /* PTRACE_??? */, PID, 0, 0x2) = 0 ptrace(0x4200 /* PTRACE_??? */, PID, 0, 0x3e) = 0 ioctl(0, TIOCSPGRP, [PID]) = -1 EPERM (Operation not permitted) ptrace(PTRACE_DETACH, PID, 0x1, SIG_0) = 0 ptrace(PTRACE_ATTACH, PID, 0, 0) = 0 wait4(PID, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGALRM}], 0, NULL) = PID why we got SIGALRM here? What tracee was doing so that we got SIGALRM not SIGSTOP? From now on it is messed up: ptrace(PTRACE_CONT, PID, 0x1, SIG_0) = 0 wait4(PID, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], 0, NULL) = PID Whoa, we got our SIGSTOP - too late ptrace(PTRACE_DETACH, PID, 0x1, SIGALRM) = 0 ptrace(PTRACE_ATTACH, PID, 0, 0) = -1 ESRCH (No such process) More confusing things. I tried to model this in a testcase but so far I do not know what to do to get that first SIGALRM. Actually, the problem here may be that ptrace(PTRACE_DETACH, PID, 0x1, SIGALRM) = 0 is acting on a non-stopped process. In my testcase I see it too. Will test on a non-utrace kernel. I looked into attach-into-signal source. Now I see. tracee signals itself in an endless loop while we attach/detach. Working on creating a testcase. I am on 2.6.25.10-86.fc9.x86_64 and even on it I see somewhat strange behavior. Basically, on PTRACE_ATTACH we are usually getting tracee SIGSTOPed. If tracee did this: sa.sa_flags = SA_RESTART; sa.sa_handler = fire_again; sigaction (SIGQUIT, &sa, NULL); raise (SIGQUIT); at this moment it is getting SIGQUITs all the time. Logically, in order to not disrupt its execution, kernel must remember about SIGQUIT and on next ptrace(PTRACE_CONT,..., 0) - note zero! - kernel must nevertheless deliver SIGQUIT, not let tracee continue without signal. And this indeed happens. Just not always. Sometimes PTRACE_ATTACH stops tracee *and sees SIGQUIT*, not SIGSTOP! In this case, PTRACE_CONT must be issued with SIGQUIT too, or it will be lost. This does not feel exactly right, but maybe it's a known and old glitch. More glitches: - sometimes SIGSTOP got delivered out-of-order, after "strange SIGQUIT after attach" and ptrace(PTRACE_CONT,..., SIGQUIT). Yes, CONT with SIGQUIT results in SIGSTOP! this is clearly wrong. - after final detach with SIGPIPE (meant to kill tracee) I am getting EPERM (!) on subsequent attach attempt. (I also sometimes get ESRCH when it dies too fast, that is ok) Anyway, I am attaching the testcase, and will try in on other kernels, looking for more ways it (mis)behaves. Created attachment 322897 [details]
Testcase
(In reply to comment #1) > ptrace(PTRACE_ATTACH, PID, 0, 0) = 0 > wait4(PID, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], 0, NULL) = PID > ptrace(PTRACE_GETREGS, PID, 0, HEX) = 0 > ptrace(0x4200 /* PTRACE_??? */, PID, 0, 0x2) = 0 > ptrace(0x4200 /* PTRACE_??? */, PID, 0, 0x3e) = 0 > ioctl(0, TIOCSPGRP, [PID]) = -1 EPERM (Operation not permitted) > ptrace(PTRACE_DETACH, PID, 0x1, SIG_0) = 0 > ptrace(PTRACE_ATTACH, PID, 0, 0) = 0 > wait4(PID, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGALRM}], 0, NULL) = PID > > why we got SIGALRM here? What tracee was doing so that we got SIGALRM not > SIGSTOP? During PTRACE_ATTACH SIGSTOP is generated (activated). But on waitpid() we may get some other pending signal sooner than this SIGSTOP. Still if we got some other signal we may waitpid() again and sooner or later we will receive that SIGSTOP from our PTRACE_ATTACH. The received non-SIGSTOP signals needs to be redelivered or they will otherwise get lost (and in this case the testcase stops looping). > From now on it is messed up: ... > Whoa, we got our SIGSTOP - too late > > ptrace(PTRACE_CONT, PID, 0x1, SIG_0) = 0 > wait4(PID, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], 0, NULL) = PID This is that SIGSTOP from PTRACE_ATTACH we did not waitpid() on before. > ptrace(PTRACE_DETACH, PID, 0x1, SIGALRM) = 0 > ptrace(PTRACE_ATTACH, PID, 0, 0) = -1 ESRCH (No such process) > > More confusing things. > > I tried to model this in a testcase but so far I do not know what to do to get > that first SIGALRM. It is a bit random, if you want to get SIGALRM before SIGSTOP after PTRACE_ATTACH and it does not happen - just PTRACE_DETACH it and try it again. (In reply to comment #2) > Actually, the problem here may be that > > ptrace(PTRACE_DETACH, PID, 0x1, SIGALRM) = 0 > > is acting on a non-stopped process. In my testcase I see it too. PTRACE_DETACH should act only on a stopped process. After PTRACE_ATTACH one needs to be sure the process is stopped before using other ptrace(2) calls. I would rather waitpid() in a loop in the testcase to be sure to get SIGSTOP before acting with ptrace(2). But in fact it may be enough to get a first signal - possibly SIGARLM - which may guarantee the process is stopped already. Not sure. Some such (correct) assumption was used by Daniel Jacobowitz in http://sourceware.org/ml/gdb-patches/2008-05/msg00022.html in linux_nat_post_attach_wait there. (In reply to comment #3) > Working on creating a testcase. I am on 2.6.25.10-86.fc9.x86_64 and even on it > I see somewhat strange behavior. I do not say 2.6.25.10-86.fc9.x86_64 is correct (I find it just good enough for GDB work). One should check the upstream kernels for the "right" (compatible) behavior. > Basically, on PTRACE_ATTACH we are usually getting tracee SIGSTOPed. If tracee > did this: > > sa.sa_flags = SA_RESTART; > sa.sa_handler = fire_again; > sigaction (SIGQUIT, &sa, NULL); > raise (SIGQUIT); > > at this moment it is getting SIGQUITs all the time. Logically, in order to not > disrupt its execution, kernel must remember about SIGQUIT and on next > ptrace(PTRACE_CONT,..., 0) - note zero! - kernel must nevertheless deliver > SIGQUIT, not let tracee continue without signal. The signal is never lost. Just sometimes it is delivered back to the inferior and sometimes it gets "stolen" by the tracer's waitpid(). If the tracer "steals" it the tracer must also "return" it back such as by using PTRACE_DETACH,SIGARLM. > And this indeed happens. Just not always. Sometimes PTRACE_ATTACH stops tracee > *and sees SIGQUIT*, not SIGSTOP! In this case, PTRACE_CONT must be issued with > SIGQUIT too, or it will be lost. Right. > More glitches: > > - sometimes SIGSTOP got delivered out-of-order, after "strange SIGQUIT after > attach" and ptrace(PTRACE_CONT,..., SIGQUIT). I would not do any ptrace(2) calls before being sure I got SIGSTOP at a waitpid() loop. This may not be required but it starts to depend also on the signals order (signal number specifies some signal priority). It would be nice to make an affirmation why one does not need to receive SIGSTOP (only SIGARLM for example) after PTRACE_ATTACH and why one can be sure the tracee is already stopped to perform the other ptrace() syscalls. > Yes, CONT with SIGQUIT results in SIGSTOP! this is clearly wrong. If SIGSTOP is pending out there the situation gets complicated. If you PTRACE_ATTACH and PTRACE_DETACH without ever receiving SIGSTOP by waitpid() the tracee gets `T (Stopped)' after the detachment. > - after final detach with SIGPIPE (meant to kill tracee) I am getting EPERM (!) > on subsequent attach attempt. (I also sometimes get ESRCH when it dies too > fast, that is ok) In your testcase you do PTRACE_DETACH possibly with still pending SIGSTOP which IMO-needlessly complicates it all a lot. > Anyway, I am attaching the testcase, and will try in on other kernels, looking > for more ways it (mis)behaves. The kernel version string there `2.6.25.10-86' is ambiguous. Expecting it should have been `2.6.25.10-86.fc9.x86_64'. > ptrace (PTRACE_DETACH, child, (void *) 1, (void *) SIGPIPE); ADDR value 1 is useless on Linux, some other OSes probably require it. FYI it is still missing on http://sourceware.org/systemtap/wiki/utrace/tests . Created attachment 322985 [details]
Updated testcase
>> at this moment it is getting SIGQUITs all the time. Logically, in order to not >> disrupt its execution, kernel must remember about SIGQUIT and on next >> ptrace(PTRACE_CONT,..., 0) - note zero! - kernel must nevertheless deliver >> SIGQUIT, not let tracee continue without signal. > >The signal is never lost. Just sometimes it is delivered back to the inferior >and sometimes it gets "stolen" by the tracer's waitpid(). If the tracer >"steals" it the tracer must also "return" it back such as by using >PTRACE_DETACH,SIGARLM. The new testcase checks for WSTOPSIG (status) == SIGSTOP and if it is not, we PTRACE_CONT tracee and wait again. Seems to work. >> - after final detach with SIGPIPE (meant to kill tracee) I am getting EPERM (!) >> on subsequent attach attempt. (I also sometimes get ESRCH when it dies too >> fast, that is ok) > >In your testcase you do PTRACE_DETACH possibly with still pending SIGSTOP which >IMO-needlessly complicates it all a lot. New testcase doesnt have this complication but still gets EPERM sometimes. I run it like this: # while sleep 0.05; do ./a.out || exit; done .<3>!....<3>!............<3>!.....[3]!.<3>!.<3>!..<3>!..EPERM!.<3>!..... <3>!...............<3>!..<3>!.....<3>!....<3>!............<3>!........<3>! .<3>!..........<3>!...EPERM!............<3>!.....<3>!....<3>!.<3>!........ ......<3>!.<3>!........<3>!..............<3>!.<3>!..<3>!......<3>!.<3>!.<3>! ....<3>!..........<3>!........................<3>!...<3>!............... .........<3>!..<3>!......<3>!...<3>!......<3>! See that "EPERM!"? Can you take a look at the testcase code and guess why it might be? The above is on vanilla 2.6.26. > FYI it is still missing on http://sourceware.org/systemtap/wiki/utrace/tests Yes, it is. The testcase is not ready yet. Created attachment 323104 [details]
Testcase is updated again
Kernel 2.6.26.6-79.fc9 exhibit the following differences relative to vanilla: (1) After first PTRACE_DETACH + PTRACE_ATTACH + PTRACE_CONT cycle, it loses SIGQUIT, allowing tracee to exit. Apparently PTRACE_ATTACH fails to notice and save pending signals. Passing SIGQUIT to PTRACE_CONT works around this problem. (In real-world use this workaround won't work, we would not know what signal, if any, we happened to collide with on PTRACE_ATTACH) (2) with the above workaround, second cycle of PTRACE_DETACH(SIGPIPE) + PTRACE_ATTACH + PTRACE_CONT, tracee exits after PTRACE_CONT. (It is extected to be killed by SIGPIPE instead.) Unlike case (1), passing SIGPIPE to PTRACE_CONT does not help. In short, this testcase seems to successfully catch the bug described in comment #1 - it's case (1) above. added testcase to utrace tests: http://sourceware.org/systemtap/wiki/utrace/tests Created attachment 323370 [details]
Crude, not completely working fix
Working on 2.6.27.5-32.fc9.x86_64 as a base for fixing
We need to fix the problem of PTRACE_DETACH not inducing signals in tracee even if last parameter is not 0. In the testcase we do ptrace(PTRACE_DETACH, ..., SIGPIPE) expecting tracee to be killed. tracehook_get_signal() is used to do signal injection by calling utrace_get_signal(), but it does not call it if process is not traced, and after PTRACE_DETACH it indeed is not traced anymore!
My cryde fix pushes "if (unlikely(task_utrace_flags(task)))" test into utrace_get_signal(). If it is not true, but task->exit_code != 0, then we are in PTRACE_DETACHed process with signal pending. So it creates a signal. The fix is crude (for one, this signal's siginfo would not take into account PTRACE_SETSIGINFO data), but for testing it is good enough.
Unfortunately, testcase still sometimes fails. Good run looks like this in dmesg:
[3081] ptrace_attach
[3081] signal_pending:0 utrace_control(action:2) [RESUME:5 INTERRUPT:2]
[3081] utrace_control returned zero
[3081] ptrace_resumed: task->exit_code:3
[3081] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2]
[3081] utrace_control returned zero
[3081] signal_pending:0 utrace_control(action:2) [RESUME:5 INTERRUPT:2]
[3081] utrace_control returned zero
[3081] ptrace_resumed: task->exit_code:3
[3081] ptrace_detach(3)
[3081] utrace_get_signal: not traced but exit_code:3
creating sig 3
[3081] ptrace_attach
[3081] signal_pending:0 utrace_control(action:2) [RESUME:5 INTERRUPT:2]
[3081] utrace_control returned zero
[3081] ptrace_resumed: task->exit_code:3
[3081] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2]
[3081] utrace_control returned zero
[3081] ptrace_detach(13)
[3081] utrace_get_signal: not traced but exit_code:13
creating sig 13
and bad one is:
[3082] ptrace_attach
[3082] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2]
[3082] utrace_control returned zero
[3082] signal_pending:0 utrace_control(action:2) [RESUME:5 INTERRUPT:2]
[3082] utrace_control returned zero
[3082] ptrace_resumed: task->exit_code:3
[3082] ptrace_detach(3)
[3082] utrace_get_signal: not traced but exit_code:3
creating sig 3
[3082] ptrace_attach
[3082] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2]
[3082] utrace_control returned zero
[3082] ptrace_detach(13)
[3082] ptrace_attach
[3082] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2]
[3082] utrace_control returned zero
The difference is that even though PTRACE_DETACH set up needed things to trigger utrace_get_signal() run, namely:
printk("[%d] ptrace_detach(%u)\n", child->pid, data);
child->exit_code = data;
if (data) {
set_tsk_thread_flag(child, TIF_SIGPENDING);
}
child managed to not notice it, not call utrace_get_signal(), and thus to lose signal. Need to figure out how it happens.
NB: it is not limited to "killing" signals, SIGQUIT which tracee catches is also sometimes lost: [2599] ptrace_attach [2599] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2] [2599] utrace_control returned zero [2599] signal_pending:0 utrace_control(action:2) [RESUME:5 INTERRUPT:2] [2599] utrace_control returned zero [2599] ptrace_resumed: task->exit_code:3 [2599] ptrace_detach(3) [2599] ptrace_attach [2599] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2] [2599] utrace_control returned zero This the fragment of testcase where we see wrong behavior, and which I will be referring to below while looking at dmesg: ptrace (PTRACE_CONT, child, (void *) 1, (void *) SIGQUIT); assert_perror (errno); pid = waitpid (child, &status, 0); assert (pid == child); assert (WIFSTOPPED (status)); assert (WSTOPSIG (status) == SIGQUIT); ptrace (PTRACE_DETACH, child, (void *) 1, (void *) SIGPIPE); Correct behavior (instrumented kernel's dmesg): tracee stops: [4434] utrace_stop: going to sleep, TIF_SIGPENDING:0 tracer detaches it: [4434] ptrace_detach(13) [4434] ptrace_detach: TIF_SIGPENDING:1 tracee generates signal and dies: [4434] utrace_stop: returning 0, TIF_SIGPENDING:-1 [4434] utrace_get_signal: not traced but exit_code:13 creating sig 1 Bad behavior where signal generated by PTRACE_DETACH is lost: tracee stops: [4445] utrace_stop: going to sleep, TIF_SIGPENDING:0 tracer detaches it: [4445] ptrace_detach(13) [4445] ptrace_detach: TIF_SIGPENDING:1 tracer attaches before tracee has a chance to wake up: [4445] ptrace_attach tracee wakes up and finds itself being stopped by attach: [4445] utrace_stop: returning 0, TIF_SIGPENDING:-1 [4445] utrace_stop: going to sleep, TIF_SIGPENDING:0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this is it. we are going to stop again (ATTACH-generated stop), and we lose TIF_SIGPENDING! [4445] signal_pending:0 utrace_control(action:5) [RESUME:5 INTERRUPT:2] [4445] utrace_control returned zero [4445] utrace_stop: returning 0, TIF_SIGPENDING:0 Inserting an usleep(20*1000) between detach and attach in testcase ensures that this does not happen. Ah, forgot two last lines of C code in the above comment, it should be: ... ptrace (PTRACE_DETACH, child, (void *) 1, (void *) SIGPIPE); assert_perror (errno); ptrace (PTRACE_ATTACH, child, (void *) 0, (void *) 0); It looks like the bug is just that PTRACE_DETACH with a signal is broken. I thought we already had a test case for that, but we do not seem to in fact. A simpler test should suffice for that bug. NB: bugs 456333 (RHEL 5.2) and 454404 (F9) are "PTRACE_DETACH(..., SIGSTOP) does not stop". Seems to be the same underlying problem. This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |