Bug 674640

Summary:

ptrace: PTRACE_CONT for PTRACE_ATTACH may fail

Product:

Red Hat Enterprise Linux 6

Reporter:

Jan Kratochvil <jan.kratochvil>

Component:

kernel

Assignee:

Oleg Nesterov <onestero>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

low

Docs Contact:

Priority:

low

Version:

6.0

CC:

roland

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

URL:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/detach-stopped-then-cont.c?cvsroot=systemtap

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

674764 (view as bug list)

Environment:

Last Closed:

2011-02-04 18:16:51 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

456333, 674764

Attachments:

Description	Flags
Testcase.	none
wait-after-PTRACE_ATTACH hang	none
Cannot reliably attach on old kernels.	none

Description Jan Kratochvil 2011-02-02 19:28:42 UTC

Created attachment 476640 [details]
Testcase.

Description of problem:
GDB is using PTRACE_ATTACH+kill(SIGSTOP)+PTRACE_CONT to safely attach to both SIGSTOPped+unstopped processes.
That PTRACE_CONT sometimes fails if executed very quickly after a previous PTRACE_DETACH(SIGSTOP).
Is this PTRACE_CONT ESRCH a problem?  GDB does not check the call.
(If ESRCH is OK I will file a different bug on later ops.)

GDB code using it in linux_nat_post_attach_wait:
http://sourceware.org/cgi-bin/cvsweb.cgi/src/gdb/linux-nat.c.diff?r1=1.80&r2=1.81&cvsroot=src

Version-Release number of selected component (if applicable):
FAIL kernel-2.6.32-71.14.1.el6.x86_64
FAIL kernel-2.6.35.10-74.fc14.x86_64
FAIL kernel-2.6.18-164.el5.x86_64

How reproducible:
Always (in several internal loops).

Steps to Reproduce:
./detach-stopped-then-cont;echo $?

Actual results:
1

Expected results:
0

Additional info:

Comment 2 Jan Kratochvil 2011-02-02 20:19:57 UTC

It is a part of investigation why
  gdb.threads/attachstop-mt.exp
occasionally FAILs when this patch (its linux-nat.c part) is not applied:
http://pkgs.fedoraproject.org/gitweb/?p=gdb.git;a=blob_plain;f=gdb-rhel5-compat.patch;hb=master

Comment 3 Oleg Nesterov 2011-02-02 21:23:41 UTC

(In reply to comment #0)
>
> Created attachment 476640 [details]
> Testcase.
> 
> That PTRACE_CONT sometimes fails if executed very quickly after a previous
> PTRACE_DETACH(SIGSTOP).

Jan, the test-case looks wrong. It does

   ptrace (PTRACE_DETACH, child, NULL, (void *) SIGSTOP);

this resumes the tracee, it should stop again but it can be
running after the subsequent PTRACE_ATTCH/tkill by the time
you call ptrace (PTRACE_CONT).

You need wait() somewhere in between to ensure it is stopped.

Comment 4 Jan Kratochvil 2011-02-02 21:38:53 UTC

(In reply to comment #3)
> You need wait() somewhere in between to ensure it is stopped.

In this case it is the child process so the parent could wait on it.  But in the real world case it is a foreign PTRACE_ATTACHed process.  Debugger cannot wait on a foreign non-ptraced process.

Do you suggest wait by the debugger detaching it or attaching it?

It cannot wait while PTRACE_DETACHing as it is a foreign process.

While PTRACE_ATTACHing one cannot wait earlier than after PTRACE_CONT as otherwise the wait could hang - if the process was already T(job)-stopped and the stop notification was already eaten before PTRACE_ATTACH.  This is described in the GDB CVS comments in that function linux_nat_post_attach_wait.  PTRACE_CONT is there exactly to turn the foreign process into a wait()able one.


Personally I think PTRACE_DETACH should be synchronous for the userland - all the operations should complete before the syscall returns.  Is it possible?

Comment 5 Oleg Nesterov 2011-02-02 22:19:39 UTC

(In reply to comment #4)
>
> (In reply to comment #3)
> > You need wait() somewhere in between to ensure it is stopped.
> 
> In this case it is the child process so the parent could wait on it.  But in
> the real world case it is a foreign PTRACE_ATTACHed process.  Debugger cannot
> wait on a foreign non-ptraced process.

Sure, I only meant this particular test-case.

> While PTRACE_ATTACHing one cannot wait earlier than after PTRACE_CONT as
> otherwise the wait could hang - if the process was already T(job)-stopped and
> the stop notification was already eaten before PTRACE_ATTACH.

Yes, I understand.

> Personally I think PTRACE_DETACH should be synchronous for the userland - all
> the operations should complete before the syscall returns.  Is it possible?

Oh. It is sooooo hard to change the existing behaviour. Of course technically
this is possible, but I don't think we should do this. Up to Roland ;)

Yes, ptrace is ugly. Perhaps we can do something in kernel to help gdb.
But this is not a bug, although I agree very much this known problem with
->exit_code is very annoying. IIRC, I already tried to discuss this on
lkml a long ago.

Perhaps ptrace_attach() could set ->exit_code = SIGSTOP if it attaches
to TASK_STOPPED thread. I'll try to think more.

Comment 6 Jan Kratochvil 2011-02-03 08:41:27 UTC

As a confirmed Bug checked it in the testsuite:
detach-stopped-then-cont.c

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/detach-stopped-then-cont.c?cvsroot=systemtap

Comment 7 Oleg Nesterov 2011-02-03 19:04:18 UTC

(In reply to comment #6)
>
> As a confirmed Bug checked it in the testsuite:
> detach-stopped-then-cont.c
 
I disagree.

Yes, this is the known problem, it is not trivial to attach to
the may-be-it-was-stopped thread, and I am going to send the
(upstream) patch which hopefully can help. But we can't assume
that PTRACE_CONT (or any other request) can work right after
PTRACE_ATTACH, this needs wait() to ensure the tracee has already
stopped. I do not think we can change PTRACE_ATTACH semantics in
this respect.

Btw, the test-case (and I assume gdb?) does tkill(SIGSTOP) right
after PTRACE_ATTACH. Why? This looks absolutely pointless,
PTRACE_ATTACH does this if it suceeds.

IOW, whatever we do, I think the test-case should be fixed anyway.

Comment 8 Jan Kratochvil 2011-02-03 19:25:06 UTC

Created attachment 476852 [details]
wait-after-PTRACE_ATTACH hang

Attached different testcase hangs on RHEL-4 as it waits after PTRACE_ATTACH.
FAIL kernel-2.6.9-89.33.1.EL.x86_64
PASS kernel-2.6.35.10-74.fc14.x86_64
It does not hang on F14.

(In reply to comment #7)
> IOW, whatever we do, I think the test-case should be fixed anyway.

The detach-stopped-then-cont.c testcase is just a copy of current FSF GDB since 2008-05-01.


> Yes, this is the known problem, it is not trivial to attach to
> the may-be-it-was-stopped thread,

So could you advice how can GDB attach to a foreign process which can be in any state? (stopped/unstopped)  Without PTRACE_CONT it hangs on older (RHEL-4) kernels.  Or does GDB need to version-check the kernel?


> this needs wait() to ensure the tracee has already stopped.

One cannot call wait() as in some cases it would hang indefinitely as shown by this new testcase.


> Btw, the test-case (and I assume gdb?) does tkill(SIGSTOP) right
> after PTRACE_ATTACH. Why? This looks absolutely pointless,
> PTRACE_ATTACH does this if it suceeds.

To prevent the hang reproduced by this new testcase (on older=RHEL-4 kernels).


> IOW, whatever we do, I think the test-case should be fixed anyway.

We talk here about fixing GDB and it must remain backward compatible.  It can be done but you could suggest how.

Comment 9 Oleg Nesterov 2011-02-03 19:38:39 UTC

(In reply to comment #8)
>
> > this needs wait() to ensure the tracee has already stopped.
> 
> One cannot call wait() as in some cases it would hang indefinitely as shown by
> this new testcase.

Yes, and this is what I am trying to fix. Please wait a bit,
I'll send the patch to lkml for discussion.


> > IOW, whatever we do, I think the test-case should be fixed anyway.
> 
> We talk here about fixing GDB and it must remain backward compatible.

Oh, yes, I see.

But, once again, otoh I see no possibility to change the kernel
so that PTRACE_ATTACH does implicit wait. Imho, too radical change.
But. Lets discuss this on lkml.

Comment 10 Oleg Nesterov 2011-02-03 19:43:34 UTC

(In reply to comment #8)
>
> Created attachment 476852 [details]
> wait-after-PTRACE_ATTACH hang
> 
> Attached different testcase hangs on RHEL-4 as it waits after PTRACE_ATTACH.
> FAIL kernel-2.6.9-89.33.1.EL.x86_64
> PASS kernel-2.6.35.10-74.fc14.x86_64
> It does not hang on F14.

Oh, too many test-cases, I am totally confused ;)

But if I read it correctly this was fixed by
90bc8d8b1a38f1ab131a2399a202e1889db95de8

Comment 11 Oleg Nesterov 2011-02-03 19:50:48 UTC

(In reply to comment #10)
> (In reply to comment #8)
> >
> > Created attachment 476852 [details]
> > wait-after-PTRACE_ATTACH hang
> > 
> > Attached different testcase hangs on RHEL-4 as it waits after PTRACE_ATTACH.
> > FAIL kernel-2.6.9-89.33.1.EL.x86_64
> > PASS kernel-2.6.35.10-74.fc14.x86_64
> > It does not hang on F14.
> 
> Oh, too many test-cases, I am totally confused ;)
> 
> But if I read it correctly this was fixed by
> 90bc8d8b1a38f1ab131a2399a202e1889db95de8

Argh! forgot to mention... This particular case was fixed,
but do_wait() still can hang after PTRACE_ATTACH, of course.

Comment 12 Oleg Nesterov 2011-02-03 21:02:31 UTC

So. I sent the patch upstream, it should ensure that it is
always "safe" to use wait() after ptrace(PTRACE_ATTACH).
If we are going to change the kernel, I do not see the better
solution but I am open to any suggestion ;)

However. This is very old problem, and I still can't understand
why gdb can't do something like

   ptrace(ATTACH);

   // look into /fs/proc/tid/status
   if (this_thread_is_stopped()) {
       // OK, it is already stopped, we don't need wait()
       // but we need another SIGSTOP in case the previous
       // one was the reason of this stop
       tkill(SIGSTOP);
       ptrace(PTRACE_CONT);
   }

   wait(&status);
   
   // now it should be stopped or we raced with SIGCONT
   // from somwhere, check status.

   ...

what do you think?

Comment 13 Jan Kratochvil 2011-02-03 21:15:47 UTC

(In reply to comment #12)
> So. I sent the patch upstream, it should ensure that it is
> always "safe" to use wait() after ptrace(PTRACE_ATTACH).

OK, thanks.


> However. This is very old problem, and I still can't understand
> why gdb can't do something like
> 
>    ptrace(ATTACH);
> 
>    // look into /fs/proc/tid/status
>    if (this_thread_is_stopped()) {

My fault, GDB does so (see linux_nat_post_attach_wait).

There is still a problem that it is racy this way - /proc/PID/status is changing very shortly after PTRACE_DETACH.  Unfortunately it stays racy if the code wants to be backward compatible with older kernels even when it runs on new/fixed kernel.

The testcase detach-stopped-then-cont.c is invalid, removing it now.

Comment 14 Oleg Nesterov 2011-02-03 21:35:34 UTC

(In reply to comment #13)
> 
> There is still a problem that it is racy this way - /proc/PID/status is
> changing very shortly after PTRACE_DETACH.

Hmm... it shouldn't afaics, can you tell more?

To clarify. Yes, ptrace(PTRACE_CONT) in the code above can fail,
SIGCONT or SIGKILL can resume the TASK_STOPPED tracee, but this
is always true and the tracer should check status after do_wait().

But if you meant that the tracee can stop _after_ this_thread_is_stopped(),
this case should be fine.

> Unfortunately it stays racy if the
> code wants to be backward compatible with older kernels even when it runs on
> new/fixed kernel.

Yes, yes, I understand.

> The testcase detach-stopped-then-cont.c is invalid, removing it now.

OK, thanks.

Comment 15 Jan Kratochvil 2011-02-04 16:17:43 UTC

Created attachment 477055 [details]
Cannot reliably attach on old kernels.

(In reply to comment #14)
> (In reply to comment #13)
> > There is still a problem that it is racy this way - /proc/PID/status is
> > changing very shortly after PTRACE_DETACH.
> 
> Hmm... it shouldn't afaics, can you tell more?

Attached a testcase.

If it prints `1' it Aborts - on a timeout.

FAIL kernel-2.6.9-89.33.1.EL.x86_64
 = RHEL-4 = x86-64-4as-8z-v1.ss.eng.bos.redhat.com
PASS kernel-2.6.35.10-74.fc14.x86_64
PASS kernel-vanilla-2.6.38-0.rc3.git3.1.fc15.x86_64

But the FAIL reproducibility is in about a minute so the PASS results can also mean it just did not get reproduced.  So in fact it may be an already fixed kernel bug present only in RHEL-4.

It does:
[debugger]           PTRACE_ATTACH
[some other process] tkill (SIGSTOP)
[debugger]           pid_is_stopped? No: R (running)
[debuggee's parent]  waitpid -> SIGSTOP
[debugger]           waitpid -> HANG!

I was not able to reproduce it with:
[debugger]           PTRACE_DETACH(SIGSTOP)
[debugger]           PTRACE_ATTACH
[debugger]           pid_is_stopped? [...]
[...]

and also not with:
[debugger]           tkill (SIGSTOP)
[debugger]           PTRACE_DETACH(0)
[debugger]           PTRACE_ATTACH
[debugger]           pid_is_stopped? [...]
[...]


> To clarify. Yes, ptrace(PTRACE_CONT) in the code above can fail,
> SIGCONT or SIGKILL can resume the TASK_STOPPED tracee, but this
> is always true and the tracer should check status after do_wait().

OK, so I can ignored PTRACE_CONT failure (as also GDB does).
(In such case I may file a different bug occuring later.)


> But if you meant that the tracee can stop _after_ this_thread_is_stopped(),
> this case should be fine.

The problem is if it stops and also its notification gets eaten - like in this new testcase.

Comment 16 Oleg Nesterov 2011-02-04 18:07:59 UTC

(In reply to comment #15)
>
> Created attachment 477055 [details]
> Cannot reliably attach on old kernels.
>
> FAIL kernel-2.6.9-89.33.1.EL.x86_64
>  = RHEL-4 = x86-64-4as-8z-v1.ss.eng.bos.redhat.com
> PASS kernel-2.6.35.10-74.fc14.x86_64
> PASS kernel-vanilla-2.6.38-0.rc3.git3.1.fc15.x86_64
> 
> But the FAIL reproducibility is in about a minute so the PASS results can also
> mean it just did not get reproduced.  So in fact it may be an already fixed
> kernel bug present only in RHEL-4.

Yes, I hope this was already fixed...

> It does:
> [debugger]           PTRACE_ATTACH
> [some other process] tkill (SIGSTOP)
> [debugger]           pid_is_stopped? No: R (running)
> [debuggee's parent]  waitpid -> SIGSTOP

This is already wrong. Hmm. it seems, this should be fixed by
f5b40e363ad6041a96e3da32281d8faa191597b9

But, this patch was backported to rhel4? Strange...

> [debugger]           waitpid -> HANG!

This is clear.

> > But if you meant that the tracee can stop _after_ this_thread_is_stopped(),
> > this case should be fine.
> 
> The problem is if it stops and also its notification gets eaten - like in this
> new testcase.

Yes. But let me clarify just in case. Starting from
90bc8d8b1a38f1ab131a2399a202e1889db95de8 (v2.6.29-7169-g90bc8d8)
debugger and parent do not share the exit code, so the parent can't
"steal" it from the debugger. (but this can happen in rhel4/5)

However, it can be cleared by the previous debugger. See the test-case
in the patch I sent upstream.

Comment 17 Jan Kratochvil 2011-02-04 18:16:51 UTC

So far there is no RHEL-6 kernel Bug discussed here so closing it, thanks for all the info.

=> PTRACE_CONT can fail during the GDB-style-attach and it is OK.

(I guess I will file a different Bug as there is RHEL5->RHEL6 GDB testcase gdb.threads/attachstop-mt.exp regression.)