126089 – Kernel doesn't let GDB stepi out of a signal trampoline

Bug 126089 - Kernel doesn't let GDB stepi out of a signal trampoline

Summary: Kernel doesn't let GDB stepi out of a signal trampoline

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Peter Martuccelli
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	85326 85327 85328 (view as bug list)
Depends On:	126095 126699 126911 126913 127384
Blocks:	116894 117972 127692 127693
TreeView+	depends on / blocked

Reported:	2004-06-15 21:33 UTC by Andrew Cagney
Modified:	2007-11-30 22:07 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:24:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
New GDB tests sigbpt.exp that illustrate the problems (9.93 KB, patch) 2004-06-21 01:59 UTC, Andrew Cagney	no flags	Details \| Diff
upstream 2.6 patch for i386 sigreturn to trap on return when single-stepped into (1.76 KB, patch) 2004-06-21 23:09 UTC, Roland McGrath	no flags	Details \| Diff
Second example of instruction being skipped (85 bytes, text/plain) 2004-06-21 23:27 UTC, Andrew Cagney	no flags	Details
View All

Description Andrew Cagney 2004-06-15 21:33:23 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; NetBSD macppc; en-GB; rv:1.4.1)
Gecko/20040217

Description of problem:
On amd64, when single-stepping the signal trampoline syscall
instruction the inferior runs away:

<signal handler called>
1: x/i $pc  0x2a95822667 <__restore_rt+7>:      syscall 
(gdb) stepi

Program exited normally.
(gdb) KFAIL: gdb.base/sigstep.exp: stepi out of signal trampoline
(program exited) (PRMS: gdb/1639)

There are two problems related to the kernel restoring the trapped
function's registers:

1. when ptrace(SINGLE_STEP), after the registers have been restored,
the inferior is allowed to run free instead of being stopped

2. when ptrace(CONTINUE), the restored registers included an enabled
h/w single-step bit, causing the inferior to stop after one instruction

Linux, as the above illustrates, has at least the first problem.


Version-Release number of selected component (if applicable):
kernel-2.4.21-15.EL

How reproducible:
Always

Steps to Reproduce:
1. run gdb on a program throwing a signal
2. breakpoint the signal handler
3. try to stepi back to main
    

Actual Results:  <signal handler called>
1: x/i $pc  0x2a95822667 <__restore_rt+7>:      syscall 
(gdb) stepi

Program exited normally.
(gdb) KFAIL: gdb.base/sigstep.exp: stepi out of signal trampoline
(program exited) (PRMS: gdb
/1639)



Expected Results:  <signal handler called>
1: x/i $pc  0x2a95822667 <__restore_rt+7>:      syscall 
(gdb) stepi
.... thrower ...
(gdb)

Additional info:

Probably applies to all architectures

Comment 5 Roland McGrath 2004-06-17 08:19:15 UTC

I have determined that this is not a kernel bug at all; the x86-64
kernel behaves consistently with the i386 kernel here.
The way this works on i386 is that gdb does a ptrace POKEDATA on the
part of the signal handler frame that contains the saved EFLAGS
register, and sets the trace flag (0x100 bit) in the word that was
already saved there.  On x86-64, gdb is not doing this.
The code in gdb/i386-linux-nat.c:child_resume can be copied and
tweaked only slightly to treat x86-64 the same way.

Comment 6 Andrew Cagney 2004-06-17 18:02:29 UTC

hack is a polite term for that code.

It assumes a specific trap mechanism, and the whole point of vsyscall
is that we get away from that assumption - let the kernel determine an
arbitrary trap mechanism.

Given the kernel is (well I _hope_ it is ...) already masking out
other bits in that reestored register state, there's nothing stopping
it also setting information.

Comment 7 Roland McGrath 2004-06-20 01:39:34 UTC

Both the hardware architecture and the kernel behavior in this area
are exactly the same on x86-64 and on i386.  Any change in kernel
behavior should be done consistently on both platforms.  Given that
gdb has always coped with the situation on i386 before, upstream
acceptance of changes in this area may not be forthcoming.  

PTRACE_SINGLESTEP works by setting the TF bit in the tracee's flags
register and that is all.  The very same effect is achieved if a
thread sets its own TF bit--if someone is tracing it, it sees a
SIGTRAP with all other details identical to having used PTRACE_SINGLESTEP.

The proposal, then, is that when the TF bit is set in the flags
register when the system call instruction executes, the TF bit should
be set again after the action of the sigreturn system call to change
all the registers including the flags, has taken place.  Thus, the
restored thread will immediately stop with a single-step trap at the
PC restored by the system call.  This might be restricted to the case
when the thread is traced, which is the only one gdb is concerned with.

I wonder what other platforms do about this issue; I am not familiar
off hand with the hardware mechanisms used for single-stepping on
other processors, so the issue may be different.  Does gdb either fail
to handle the problem, as on x86-64, or have to address it specially,
as on i386, on any other Linux platform?

I need the whole picture about how this is and should be handled
across processors because I can pursue a change in behavior upstream.

Comment 8 Andrew Cagney 2004-06-21 01:25:13 UTC

Unfortunatly, as more careful testing has revealed, that i386 hack
didn't actually fix the problems - it just lessened the inpact :-(

There look to be two underlying problems here, and they both appear to
be present in on all architectures (at least the ones I've looked at -
ia64, amd64, i386, PPC64) (and for that matter many OSs):

- When single stepping a system-call, an extra instruction after the
system call is executed

This occures because the kernel/isa does not realise that the trapped
system-call instruction should also be counted as asingle-steped
instruction, and consequently resumes the process allowing a further
instruction to be executed.

The easiest way to fix this is to add code checking for single-step
mode in the system-call return path.  Something like the pseudo code:

    /* Single stepped the system-call, stop
       immediatly.  */
    if (frame->srr1 & PSL_SE) {
        frame->srr1 &= ~PSL_SE;
        trapsignal(p, SIGTRAP, EXC_TRC);
    }

For the i386, that probably means modifying entry.S:system_call.  Outch!


- when single-stepping sigreturn, the single-step bit isn't
propogated.  The easiest way to do this is modify sys_sigreturn so
that it either sets/propogates the single-step bit or updates
single-step according to ptrace-sstep.  Something like the psuedo code:

   sstep = (tf->srr1 & PSL_SE);
   *tf = sc.sc_frame;
   /* Propogate the single-step bit.  */
   tf->srr1 = (tf->srr1 & ~PSL_SE) | sstep;

The i386 would probably need to add a check of ptrace-sstep to
restore_sigcontext.  I'd note that the function already contains:
  regs->eflags = (regs->eflags & ~0x40DD5) | (tmpflags & 0x40DD5);
which I think should propogate the TF flag, but it doesn't.  My
reading of the ISA manual is that the TF bit gets cleared during the
system-call trap?

Can you please CC me in any up-stream discussion.

Comment 9 Andrew Cagney 2004-06-21 01:59:33 UTC

Created attachment 101282 [details]
New GDB tests sigbpt.exp that illustrate the problems

The test case tries to single-step out of a signal handler back to the
instruction that caused the segv.  Because, for sigreturn, two instructions are
executed (sigreturn and fault) the fault re-occures :-(

Comment 10 Andrew Cagney 2004-06-21 21:31:27 UTC

*** Bug 85327 has been marked as a duplicate of this bug. ***

Comment 11 Andrew Cagney 2004-06-21 21:33:05 UTC

*** Bug 85328 has been marked as a duplicate of this bug. ***

Comment 12 Andrew Cagney 2004-06-21 21:34:51 UTC

*** Bug 85326 has been marked as a duplicate of this bug. ***

Comment 13 Roland McGrath 2004-06-21 23:06:15 UTC

You said there are two problems and one of them is that
single-stepping the syscall instruction in general executes the
following instruction.
I cannot reproduce this for an arbitrary system call.  Unless you have
an example where the system call is not sigreturn, then I don't see
any reason to think that this independent problem exists at all.
Please report exactly what evidence you saw on each architecture you
tested.  

From the evidence I know of, the only issue of concern is what happens
on single-stepping into the sigreturn/rt_sigreturn system call.  On
the x86 & x86-64, setting the TF bit indeed means to execute the
following one instruction before trapping; I don't know other
architectures but probably their single-step flags are similar.
This means that what a single-stepped sigreturn wants to do is not
modify the restored state at all, but in fact just restore the given
state as the saved trap state for an immediate single-step stop.
This may be easy to implement in the same way for all machines, i.e.
the sigreturn syscall code just takes the SIGTRAP directly before
returning to user mode at all.  Though the plan is the same across
machines, a change in a machine-dependent function using that
machine's appropriate single-step flag check is required for each one.

Comment 15 Roland McGrath 2004-06-21 23:09:21 UTC

Created attachment 101317 [details]
upstream 2.6 patch for i386 sigreturn to trap on return when single-stepped into

I have tested this i386 patch with the sigbpt.c program and it avoids the
second SIGSEGV being taken on stepi through the signal return.

Comment 16 Andrew Cagney 2004-06-21 23:27:36 UTC

Created attachment 101319 [details]
Second example of instruction being skipped

As previous testcase and below example illustrate, "ret" is skipped.

Compile with: gcc -static -g -o tomago nothing.c

Red Hat Enterprise Linux AS release 3 (Taroon Update 1)

(gdb) 
0x0804d870 in getpid ()
1: x/i $pc  0x804d870 <getpid>: mov    $0x14,%eax
(gdb) disassemble 
Dump of assembler code for function getpid:
0x0804d870 <getpid+0>:	mov    $0x14,%eax
0x0804d875 <getpid+5>:	int    $0x80
0x0804d877 <getpid+7>:	ret    
End of assembler dump.
(gdb) stepi
0x0804d875 in getpid ()
1: x/i $pc  0x804d875 <getpid+5>:	int    $0x80
(gdb) 
0x0804820d in main () at nothing.c:7
7	      kill (getpid (), 0);
1: x/i $pc  0x804820d <main+29>:	add    $0x4,%esp
(gdb) 

Can you please CC in any upstream e-mail.

Comment 17 Roland McGrath 2004-06-22 00:10:25 UTC

Previously you mentioned mostly the x86-64 case.  This last comment
shows the stepped-after-syscall problem on i386, where I see it in
upstream 2.6 as well.  I do not see any such problem on x86-64.  As I
asked before, please give info about all architectures you can test.

Comment 18 Andrew Cagney 2004-06-23 20:37:37 UTC

Lets concentrate on i386 then.  Once its all working for that
architecture arguing similar changes in the others will hopefully be
easier (I'll find out about the other architectures shortly).

Comment 19 Roland McGrath 2004-06-24 22:33:16 UTC

I have created bug 126699 for the i386 issue with single-stepping
system calls generally.  We still need to know what the issues are on
other architectures, and if any are questionable then those should
have their own bugs for each specific architecture's problems.

Comment 21 Ernie Petrides 2004-09-20 22:45:39 UTC

A fix for this problem was committed to the RHEL3 U4
patch pool yesterday (in kernel version 2.4.21-20.8.EL).

Comment 22 Andrew Cagney 2004-09-21 14:45:47 UTC

Each architecture requires a separate fix - the i386 fix being tracked
by BZ 126699 (which this bug depends on).   Until that is done, this
bug isn't fixed.

It's in NEEDINFO since that is as close as we can get to
need-other-bugs-fixed.

Comment 23 Elena Zannoni 2004-09-24 15:45:16 UTC

i386 is fixed.
ppc64 is fixed
x86-64 is fixed
ia64 is not fixed
s390 is not fixed

What Andrew meant is that until they are all fixed, this bug cannot be
closed.

Comment 24 Jim Paradis 2004-09-25 00:39:07 UTC

There was some discussion back and forth.  The patch that Ernie refers
to in Comment #21 took care of all x86_64 syscalls except sigreturn;
thus single-stepping out of a signal handler was still problematic.  I
posted a patch to rhkernel-list that achieved the desired result, but
there was some question as to whether it was the *correct* thing to
do. Upon closer study, I determined that it was and posted a
clarification to rhkernel-list, along with a reiteration of my patch.

Comment 25 Ernie Petrides 2004-09-28 08:34:31 UTC

A fix for the final part of the problem has just been committed to the
RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.12.EL).

Comment 26 Ernie Petrides 2004-09-28 08:42:19 UTC

Hang on ... it looks like Jim incorrectly listed this bug as being
fixed by his x86_64 patch.  I think he should have listed bug 126911.

So, I'm reverting this one back to ASSIGNED and will modify the other.

Sorry about that.  -ernie

Comment 27 Ernie Petrides 2004-10-01 11:29:26 UTC

Okay, it looks like fixes for all archs have now been committed
as of the U4 beta-candidate kernel (version 2.4.21-21.EL).

Comment 28 David Woodhouse 2004-10-01 11:32:40 UTC

We should double-check that sigreturn is working as expected on all
architectures we care about.

Comment 29 Elena Zannoni 2004-10-14 17:51:28 UTC

Right, and it's not. ia64 is still buggered. I am reopening this bug,
since it's a catchall. It should be closed/modified only when all the
areches are working.

Comment 30 Elena Zannoni 2004-10-14 17:52:34 UTC

Please look at the list of bugs that this bug depends on, they shold
be fixed before this can be closed.

Comment 31 Ernie Petrides 2004-10-14 18:00:26 UTC

Elena, I *did* just what you said in comment #30.  All 5 bugs
are in MODIFIED state.  Should the ia64 bug (126913) also be
reverted to ASSIGNED state now?

Comment 32 Elena Zannoni 2004-10-14 18:09:51 UTC

Sorry, bug 126913 depends on 126095 which was postponed. I added the
dependency here explicitly, instead of relying on nested dependencies.

Comment 36 Red Hat Bugzilla 2007-03-18 22:21:30 UTC

User jparadis's account has been closed

Comment 37 RHEL Program Management 2007-10-19 19:24:09 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.