Description of problem: When using ptrace() SINGLESTEP to go through a SIGTRAP handler of a child process the child signal handler is reset to SIG_DLF. Version-Release number of selected component (if applicable): 2.6.19-1.2895.fc6 How reproducible: Always Steps to Reproduce: 1. gcc -Werror -Wall -g -O -o ptrace_step_sig ptrace_step_sig.c 2. ./ptrace_step_sig 3. Actual results: SIGTRAP handler reset (2)! child not properly exited Expected results: child properly exiting and SIGTRAP handler not being reset. Additional info: This comes from the Frysk project. http://sourceware.org/bugzilla/show_bug.cgi?id=3997 This does not happen with some older kernels like 2.6.17-1.2174_FC5. This does not happen to other handlers like SIGHUP. This does not happen when using ptrace() CONT into the signal handler (as the attached testcase shows).
Created attachment 147568 [details] testcase
Happens on vanilla 2.6.18.6 from kernel.org, too: $ gcc -o ptrace_step_sig ptrace_step_sig.c $ ./ptrace_step_sig SIGTRAP handler reset (2)! child not properly exited $ uname -a Linux ac.ebbert.com 2.6.18.6-32smp #6 SMP Thu Dec 21 20:46:58 EST 2006 i686 athlon i386 GNU/Linux Does not happen on 2.6.16.35
Just as a bit of a blog, and as notes to myself, here's what's happening so far: Presumably (I haven't checked yet, so it's "presumably") as a result of the ptrace (PTRACE_SINGLESTEP, pid, 0, SIGTRAP); in the testcase, kernel/utrace.c:utrace_signal_handler_singlestep() is called. Something in there (again, I haven't followed that path yet) results in a call to arch/i386/kernel/traps.c:do_debug() which calls arch/i386/kernel/ptrace.c:send_sigtrap(SIGTRAP,...) which calls kernel/signal.c:force_sig_info() which then sets action->sa.sa_handler = SIG_DFL; if the current action is blocked--the handler up to that point was correctly pointing at the testcase handler; A comment in kernel/signal.c reads: /* * Force a signal that the process can't ignore: if necessary * we unblock the signal and change any SIG_IGN to SIG_DFL. * * Note: If we unblock the signal, we always reset it to SIG_DFL, * since we do not want to have a signal handler that was blocked * be invoked when user space had explicitly blocked it. * * We don't want to have recursive SIGSEGV's etc, for example. */ so I guess the behaviour is deliberate. It will take me more poking to figure out what, if anything, should be done about this. I'm going to guess though that since PTRACE_SINGLESTEP results in the child looking like it's been stopped by a SIGTRAP, and in the testcase a non-SIG_DFL handler is being set by the child on SIGTRAP, there's a bit of confusion.
Nice research, yes, very much sounds like the change was deliberate. What's the history of that change? Was the testcase ever posted to lkml with a heads-up this is broken?
(In reply to comment #4) > Nice research, yes, very much sounds like the change was deliberate. What's the > history of that change? > > Was the testcase ever posted to lkml with a heads-up this is broken? I don't believe anybody did yet.
Sorry, I did not look at this test case before now. ptrace (PTRACE_SINGLESTEP, pid, 0, SIGTRAP) is never going to work with a SIGTRAP signal handler unless it uses the SA_NODEFER bit in sa_flags. Entering a signal handler blocks the signal being handled (unless you use SA_NODEFER). Blocked signals do not get reported to ptrace, or dealt with any other way, until they are unblocked. For this reason, machine traps that generate signals use the force_sig* calls in the kernel to ensure that if the trap's signal won't be handled by a debugger or signal handler because it's blocked, it doesn't just resume executing the machine code that caused the trap, but resets the handler and unblocks the signal to ensure the process crashes (unless the debugger swallows the signal later). This is the only safe thing to do, even when there is a debugger that might well swallow the signal it cannot be sure that it won't be part of a spinning loop including the debugger instead of just a spinning loop with the signal handler. So even when ptrace would in fact eat the signal later, it's already reset the handler (and unblocked the signal) earlier before it could take that risk. This will never change with the ptrace interface, because the signal is the only way that debugger-requested stepping is delivered. Even with the current utrace world via a different interface, the signal is used the same way and the same issues apply. In newer utrace refinements in a while, the debugger-requested single-stepping will be reported without using a signal, so things using the new style will not be affected by this (and by some other problems involved with debugger-induced signals). This still won't affect ptrace, which will use the signal as it always has.
If you still see any issues, e.g. using SA_NODEFER, then you might be seeing bug 205659. But the behavior of your test case as described is not a bug.
Indeed setting action.sa_flags = SA_NODEFER makes the stepping into the sig trap handler work as expected. So the fact that this worked on older kernels was a bug?
Yes, older kernels would unblock a signal in force_sig_info and then run the user handler, which is wrong as the user handler should never be called when the user blocked the signal.
That does make sense. So I guess the only way around this is figuring out the trap signal handler we want to step into and setting a breakpoint there. But that might then also not work because we are then already in the user handler, so the trap event generated by it will be blocked. Would it make sense to treat the sig trap handler as if it was registered with SA_NODEFER if the process is traced?
That doesn't work. Signal delivery using ptrace needs to be an atomic. That is why ptrace was modified so that ptrace(step,signal) would stop at the first instruction of the handler - previously it behaved like ptrace(continue,signal). The atomic requirement is to ensure that the handler state isn't modified by another thread while the debugger is attempting to query it. For this specific case, I gather that the testcase didn't set up the handler correctly. With that fixed the behavior is as expected and no further changes are required. (In reply to comment #10) > That does make sense. > > So I guess the only way around this is figuring out the trap signal handler we > want to step into and setting a breakpoint there. But that might then also not > work because we are then already in the user handler, so the trap event > generated by it will be blocked. > > Would it make sense to treat the sig trap handler as if it was registered with > SA_NODEFER if the process is traced?
(In reply to comment #11) > For this specific case, I gather that the testcase didn't set up the handler > correctly. With that fixed the behavior is as expected and no further changes > are required. The testcase did set up the handler correctly. It just didn't specify SA_NODEFER (which a user program being traced would mot likely also not do). So the real question now is how to we simulate stepping into the user trap handler using ptrace. As described in comment #10: > So I guess the only way around this is figuring out the trap signal handler we > want to step into and setting a breakpoint there. But that might then also not > work because we are then already in the user handler, so the trap event > generated by it will be blocked. > > Would it make sense to treat the sig trap handler as if it was registered with > SA_NODEFER if the process is traced? And that last sentence should probably be amended by: "and if a ptrace cont, step or signal is done into the trap handler"