Bug 674640
| Summary: | ptrace: PTRACE_CONT for PTRACE_ATTACH may fail | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Jan Kratochvil <jan.kratochvil> | ||||||||
| Component: | kernel | Assignee: | Oleg Nesterov <onestero> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
| Severity: | low | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | 6.0 | CC: | roland | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| URL: | http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/detach-stopped-then-cont.c?cvsroot=systemtap | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 674764 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2011-02-04 18:16:51 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 456333, 674764 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Jan Kratochvil
2011-02-02 19:28:42 UTC
It is a part of investigation why gdb.threads/attachstop-mt.exp occasionally FAILs when this patch (its linux-nat.c part) is not applied: http://pkgs.fedoraproject.org/gitweb/?p=gdb.git;a=blob_plain;f=gdb-rhel5-compat.patch;hb=master (In reply to comment #0) > > Created attachment 476640 [details] > Testcase. > > That PTRACE_CONT sometimes fails if executed very quickly after a previous > PTRACE_DETACH(SIGSTOP). Jan, the test-case looks wrong. It does ptrace (PTRACE_DETACH, child, NULL, (void *) SIGSTOP); this resumes the tracee, it should stop again but it can be running after the subsequent PTRACE_ATTCH/tkill by the time you call ptrace (PTRACE_CONT). You need wait() somewhere in between to ensure it is stopped. (In reply to comment #3) > You need wait() somewhere in between to ensure it is stopped. In this case it is the child process so the parent could wait on it. But in the real world case it is a foreign PTRACE_ATTACHed process. Debugger cannot wait on a foreign non-ptraced process. Do you suggest wait by the debugger detaching it or attaching it? It cannot wait while PTRACE_DETACHing as it is a foreign process. While PTRACE_ATTACHing one cannot wait earlier than after PTRACE_CONT as otherwise the wait could hang - if the process was already T(job)-stopped and the stop notification was already eaten before PTRACE_ATTACH. This is described in the GDB CVS comments in that function linux_nat_post_attach_wait. PTRACE_CONT is there exactly to turn the foreign process into a wait()able one. Personally I think PTRACE_DETACH should be synchronous for the userland - all the operations should complete before the syscall returns. Is it possible? (In reply to comment #4) > > (In reply to comment #3) > > You need wait() somewhere in between to ensure it is stopped. > > In this case it is the child process so the parent could wait on it. But in > the real world case it is a foreign PTRACE_ATTACHed process. Debugger cannot > wait on a foreign non-ptraced process. Sure, I only meant this particular test-case. > While PTRACE_ATTACHing one cannot wait earlier than after PTRACE_CONT as > otherwise the wait could hang - if the process was already T(job)-stopped and > the stop notification was already eaten before PTRACE_ATTACH. Yes, I understand. > Personally I think PTRACE_DETACH should be synchronous for the userland - all > the operations should complete before the syscall returns. Is it possible? Oh. It is sooooo hard to change the existing behaviour. Of course technically this is possible, but I don't think we should do this. Up to Roland ;) Yes, ptrace is ugly. Perhaps we can do something in kernel to help gdb. But this is not a bug, although I agree very much this known problem with ->exit_code is very annoying. IIRC, I already tried to discuss this on lkml a long ago. Perhaps ptrace_attach() could set ->exit_code = SIGSTOP if it attaches to TASK_STOPPED thread. I'll try to think more. As a confirmed Bug checked it in the testsuite: detach-stopped-then-cont.c http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/detach-stopped-then-cont.c?cvsroot=systemtap (In reply to comment #6) > > As a confirmed Bug checked it in the testsuite: > detach-stopped-then-cont.c I disagree. Yes, this is the known problem, it is not trivial to attach to the may-be-it-was-stopped thread, and I am going to send the (upstream) patch which hopefully can help. But we can't assume that PTRACE_CONT (or any other request) can work right after PTRACE_ATTACH, this needs wait() to ensure the tracee has already stopped. I do not think we can change PTRACE_ATTACH semantics in this respect. Btw, the test-case (and I assume gdb?) does tkill(SIGSTOP) right after PTRACE_ATTACH. Why? This looks absolutely pointless, PTRACE_ATTACH does this if it suceeds. IOW, whatever we do, I think the test-case should be fixed anyway. Created attachment 476852 [details] wait-after-PTRACE_ATTACH hang Attached different testcase hangs on RHEL-4 as it waits after PTRACE_ATTACH. FAIL kernel-2.6.9-89.33.1.EL.x86_64 PASS kernel-2.6.35.10-74.fc14.x86_64 It does not hang on F14. (In reply to comment #7) > IOW, whatever we do, I think the test-case should be fixed anyway. The detach-stopped-then-cont.c testcase is just a copy of current FSF GDB since 2008-05-01. > Yes, this is the known problem, it is not trivial to attach to > the may-be-it-was-stopped thread, So could you advice how can GDB attach to a foreign process which can be in any state? (stopped/unstopped) Without PTRACE_CONT it hangs on older (RHEL-4) kernels. Or does GDB need to version-check the kernel? > this needs wait() to ensure the tracee has already stopped. One cannot call wait() as in some cases it would hang indefinitely as shown by this new testcase. > Btw, the test-case (and I assume gdb?) does tkill(SIGSTOP) right > after PTRACE_ATTACH. Why? This looks absolutely pointless, > PTRACE_ATTACH does this if it suceeds. To prevent the hang reproduced by this new testcase (on older=RHEL-4 kernels). > IOW, whatever we do, I think the test-case should be fixed anyway. We talk here about fixing GDB and it must remain backward compatible. It can be done but you could suggest how. (In reply to comment #8) > > > this needs wait() to ensure the tracee has already stopped. > > One cannot call wait() as in some cases it would hang indefinitely as shown by > this new testcase. Yes, and this is what I am trying to fix. Please wait a bit, I'll send the patch to lkml for discussion. > > IOW, whatever we do, I think the test-case should be fixed anyway. > > We talk here about fixing GDB and it must remain backward compatible. Oh, yes, I see. But, once again, otoh I see no possibility to change the kernel so that PTRACE_ATTACH does implicit wait. Imho, too radical change. But. Lets discuss this on lkml. (In reply to comment #8) > > Created attachment 476852 [details] > wait-after-PTRACE_ATTACH hang > > Attached different testcase hangs on RHEL-4 as it waits after PTRACE_ATTACH. > FAIL kernel-2.6.9-89.33.1.EL.x86_64 > PASS kernel-2.6.35.10-74.fc14.x86_64 > It does not hang on F14. Oh, too many test-cases, I am totally confused ;) But if I read it correctly this was fixed by 90bc8d8b1a38f1ab131a2399a202e1889db95de8 (In reply to comment #10) > (In reply to comment #8) > > > > Created attachment 476852 [details] > > wait-after-PTRACE_ATTACH hang > > > > Attached different testcase hangs on RHEL-4 as it waits after PTRACE_ATTACH. > > FAIL kernel-2.6.9-89.33.1.EL.x86_64 > > PASS kernel-2.6.35.10-74.fc14.x86_64 > > It does not hang on F14. > > Oh, too many test-cases, I am totally confused ;) > > But if I read it correctly this was fixed by > 90bc8d8b1a38f1ab131a2399a202e1889db95de8 Argh! forgot to mention... This particular case was fixed, but do_wait() still can hang after PTRACE_ATTACH, of course. So. I sent the patch upstream, it should ensure that it is
always "safe" to use wait() after ptrace(PTRACE_ATTACH).
If we are going to change the kernel, I do not see the better
solution but I am open to any suggestion ;)
However. This is very old problem, and I still can't understand
why gdb can't do something like
ptrace(ATTACH);
// look into /fs/proc/tid/status
if (this_thread_is_stopped()) {
// OK, it is already stopped, we don't need wait()
// but we need another SIGSTOP in case the previous
// one was the reason of this stop
tkill(SIGSTOP);
ptrace(PTRACE_CONT);
}
wait(&status);
// now it should be stopped or we raced with SIGCONT
// from somwhere, check status.
...
what do you think?
(In reply to comment #12) > So. I sent the patch upstream, it should ensure that it is > always "safe" to use wait() after ptrace(PTRACE_ATTACH). OK, thanks. > However. This is very old problem, and I still can't understand > why gdb can't do something like > > ptrace(ATTACH); > > // look into /fs/proc/tid/status > if (this_thread_is_stopped()) { My fault, GDB does so (see linux_nat_post_attach_wait). There is still a problem that it is racy this way - /proc/PID/status is changing very shortly after PTRACE_DETACH. Unfortunately it stays racy if the code wants to be backward compatible with older kernels even when it runs on new/fixed kernel. The testcase detach-stopped-then-cont.c is invalid, removing it now. (In reply to comment #13) > > There is still a problem that it is racy this way - /proc/PID/status is > changing very shortly after PTRACE_DETACH. Hmm... it shouldn't afaics, can you tell more? To clarify. Yes, ptrace(PTRACE_CONT) in the code above can fail, SIGCONT or SIGKILL can resume the TASK_STOPPED tracee, but this is always true and the tracer should check status after do_wait(). But if you meant that the tracee can stop _after_ this_thread_is_stopped(), this case should be fine. > Unfortunately it stays racy if the > code wants to be backward compatible with older kernels even when it runs on > new/fixed kernel. Yes, yes, I understand. > The testcase detach-stopped-then-cont.c is invalid, removing it now. OK, thanks. Created attachment 477055 [details] Cannot reliably attach on old kernels. (In reply to comment #14) > (In reply to comment #13) > > There is still a problem that it is racy this way - /proc/PID/status is > > changing very shortly after PTRACE_DETACH. > > Hmm... it shouldn't afaics, can you tell more? Attached a testcase. If it prints `1' it Aborts - on a timeout. FAIL kernel-2.6.9-89.33.1.EL.x86_64 = RHEL-4 = x86-64-4as-8z-v1.ss.eng.bos.redhat.com PASS kernel-2.6.35.10-74.fc14.x86_64 PASS kernel-vanilla-2.6.38-0.rc3.git3.1.fc15.x86_64 But the FAIL reproducibility is in about a minute so the PASS results can also mean it just did not get reproduced. So in fact it may be an already fixed kernel bug present only in RHEL-4. It does: [debugger] PTRACE_ATTACH [some other process] tkill (SIGSTOP) [debugger] pid_is_stopped? No: R (running) [debuggee's parent] waitpid -> SIGSTOP [debugger] waitpid -> HANG! I was not able to reproduce it with: [debugger] PTRACE_DETACH(SIGSTOP) [debugger] PTRACE_ATTACH [debugger] pid_is_stopped? [...] [...] and also not with: [debugger] tkill (SIGSTOP) [debugger] PTRACE_DETACH(0) [debugger] PTRACE_ATTACH [debugger] pid_is_stopped? [...] [...] > To clarify. Yes, ptrace(PTRACE_CONT) in the code above can fail, > SIGCONT or SIGKILL can resume the TASK_STOPPED tracee, but this > is always true and the tracer should check status after do_wait(). OK, so I can ignored PTRACE_CONT failure (as also GDB does). (In such case I may file a different bug occuring later.) > But if you meant that the tracee can stop _after_ this_thread_is_stopped(), > this case should be fine. The problem is if it stops and also its notification gets eaten - like in this new testcase. (In reply to comment #15) > > Created attachment 477055 [details] > Cannot reliably attach on old kernels. > > FAIL kernel-2.6.9-89.33.1.EL.x86_64 > = RHEL-4 = x86-64-4as-8z-v1.ss.eng.bos.redhat.com > PASS kernel-2.6.35.10-74.fc14.x86_64 > PASS kernel-vanilla-2.6.38-0.rc3.git3.1.fc15.x86_64 > > But the FAIL reproducibility is in about a minute so the PASS results can also > mean it just did not get reproduced. So in fact it may be an already fixed > kernel bug present only in RHEL-4. Yes, I hope this was already fixed... > It does: > [debugger] PTRACE_ATTACH > [some other process] tkill (SIGSTOP) > [debugger] pid_is_stopped? No: R (running) > [debuggee's parent] waitpid -> SIGSTOP This is already wrong. Hmm. it seems, this should be fixed by f5b40e363ad6041a96e3da32281d8faa191597b9 But, this patch was backported to rhel4? Strange... > [debugger] waitpid -> HANG! This is clear. > > But if you meant that the tracee can stop _after_ this_thread_is_stopped(), > > this case should be fine. > > The problem is if it stops and also its notification gets eaten - like in this > new testcase. Yes. But let me clarify just in case. Starting from 90bc8d8b1a38f1ab131a2399a202e1889db95de8 (v2.6.29-7169-g90bc8d8) debugger and parent do not share the exit code, so the parent can't "steal" it from the debugger. (but this can happen in rhel4/5) However, it can be cleared by the previous debugger. See the test-case in the patch I sent upstream. So far there is no RHEL-6 kernel Bug discussed here so closing it, thanks for all the info. => PTRACE_CONT can fail during the GDB-style-attach and it is OK. (I guess I will file a different Bug as there is RHEL5->RHEL6 GDB testcase gdb.threads/attachstop-mt.exp regression.) |