Bug 468089 - utrace: PTRACE_O_TRACEVFORKDONE does not work
utrace: PTRACE_O_TRACEVFORKDONE does not work
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Roland McGrath
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks: 468395
  Show dependency treegraph
Reported: 2008-10-22 14:17 EDT by Jan Kratochvil
Modified: 2009-07-14 12:44 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-07-14 12:44:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Fix (relative (1.17 KB, patch)
2008-11-04 09:01 EST, Denys Vlasenko
no flags Details | Diff

  None (edit)
Description Jan Kratochvil 2008-10-22 14:17:02 EDT
Description of problem:
Unable to debug GDB with GDB.  Probably also some other programs but I hit just this case.

Version-Release number of selected component (if applicable):
kernel-2.6.27-17.fc10.x86_64 BAD
kernel- BAD
kernel- BAD
kernel- OK
kernel- OK

How reproducible:

Steps to Reproduce:
wget -q -O o_tracevforkdone.c http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/o_tracevforkdone.c?cvsroot=systemtap; gcc -o o_tracevforkdone o_tracevforkdone.c -Wall -ggdb2 -D_GNU_SOURCE; ./o_tracevforkdone; echo $?

Actual results:

Expected results:

Additional info:
Upstream GDB CVS HEAD severly crashes on such kernels:
$ ./gdb -nx -ex r --args ./gdb -nx -ex r true
warning: Unexpected waitpid result 00117f when waiting for vfork-done
linux-nat.c:1945: internal-error: unknown ptrace event 5
Comment 1 Chuck Ebbert 2008-10-23 01:09:10 EDT
Should be fixed in, already in updates-testing. And same as bug #464520 in F8 AFAICT...
Comment 2 Chuck Ebbert 2008-10-23 01:11:23 EDT
Actually this is a new bug introduced by the latest utrace update.
Comment 3 Lennart Poettering 2008-10-23 15:38:50 EDT
Please please please with cream on top, give us a working debugger! How are we supposed to fix bugs for F10 if gdb is completely broken?
Comment 4 Jan Kratochvil 2008-10-24 11:11:23 EDT
Verified as fixed also in Rawhide - kernel-
Comment 5 Jan Kratochvil 2008-10-25 03:51:25 EDT
The problem is still present in both F9 and F10.
It is just x86_64 specific:
kernel- BAD
kernel- BAD
Comment 6 Denys Vlasenko 2008-10-27 10:55:20 EDT
Hmmmm.... it works for me on
Comment 9 Denys Vlasenko 2008-10-30 14:10:02 EDT
On affected machines, when I run o_tracevforkdone.c, syscall_trace_leave() is not called. Thus after vfork parent does not get notified.

I built instrumented kernel and am comparing how this testcase runs on my laptop and on affected machine.

Good dmesg output:
[3473] utrace_set_events: new:0 old:0
[3473] utrace_set_events: new:0 old:0
[3473] ptrace_report_clone 1
[3473] utrace_set_events: new:80 old:0
[3473] utrace_set_events 2: new:80 old:0
[3473] thread_info->flags:889 TIF_SYSCALL_TRACE:0
[3473] ptrace_report_clone 2: ret:0 engine->flags:1f8d
[3473] A thread_info->flags:889
[3473] B thread_info->flags:80000889
 [3474] sys_exit: thread_info->flags:80
[3473] syscall_trace_leave: 31th bit is set
[3473] syscall_trace_leave: TIF_SYSCALL_TRACE is set
[3473] tracehook_report_syscall_exit 1
[3473] tracehook_report_syscall_exit 2
[3473] utrace_set_events: new:0 old:80
[3473] ptrace_report_syscall_exit 1
[3473] ptrace_report_syscall_exit 2
[3473] utrace_reset: clear TIF_SYSCALL_TRACE
[3473] sys_exit: thread_info->flags:80000080

The interesting things are after we see vfork child exiting. Parent calls syscall_trace_leave() and all is well. Then it exits.

[2393] utrace_set_events: new:0 old:0
[2393] utrace_set_events: new:0 old:0
[2393] ptrace_report_clone 1
[2393] utrace_set_events: new:80 old:0
[2393] utrace_set_events 2: new:80 old:0
[2393] thread_info->flags:1 TIF_SYSCALL_TRACE:0
[2393] ptrace_report_clone 2: ret:0 engine->flags:1f8d
[2393] A thread_info->flags:1
[2393] B thread_info->flags:80000001
 [2394] sys_exit: thread_info->flags:0
[2393] sys_exit: thread_info->flags:80000001

syscall_trace_leave was not called. Look at bit 0 (TIF_SYSCALL_TRACE)
still being set in thread_info->flags when parent exits after vfork!
 How entry_64.S managed to not notice it last time it was exiting a syscall (vfork)?!
NB: adding more syscalls still does not make it stop.
Need to look into entry_64.S - how does it miss TIF_SYSCALL_TRACE bit?
Comment 10 Denys Vlasenko 2008-10-31 13:15:04 EDT
More instrumentation revealed the following. On x86_64 in entry_64.S there are two paths out of syscalls - "sysret" path and "iretd" (or "int") path.
"sysret" path fails to check TIF_SYSCALL_TRACE bit in thread_info->flags, but this isn't a bug, because "sysret" path should not be taken in TIF_SYSCALL_TRACE case, we should be in "int" path. Please compare these, and notice that "good" trace has "tracesys" instead of "system_call_after_swapgs" in stack dump. That means that somehow this part of entry_64.S code prior to entry to vfork syscall:

        testl $_TIF_WORK_SYSCALL_ENTRY,threadinfo_flags(%rcx)
        jnz tracesys

did not trigger in "bad" case and we erroneously did not go to tracesys label but went into "sysret" fastpath which then have lost TIF_SYSCALL_TRACE bit.

Good output:
[3482] utrace_set_events: new:0 old:0
[3482] utrace_set_events: new:0 old:0
[3482] ptrace_report_clone 1         
[3482] utrace_set_events: new:80 old:0
[3482] utrace_set_events 2: new:80 old:0
[3482] thread_info->flags:81 TIF_SYSCALL_TRACE:0
[3482] ptrace_report_clone 2: ret:0 engine->flags:1f8d
[3482] A thread_info->flags:81                        
[3482] B thread_info->flags:80000081                  
Pid: 3482, comm: o_tracevforkdon Not tainted #14
Call Trace:                                                  
 [<ffffffff8103f15d>] ptrace_report_clone+0x1c0/0x21f        
 [<ffffffff81069db3>] utrace_report_clone+0xa3/0x115         
 [<ffffffff81036644>] do_fork+0x16b/0x254                    
 [<ffffffff8100c377>] ? tracesys+0xd0/0xd5                   
 [<ffffffff8100a5c4>] sys_vfork+0x20/0x22                    
 [<ffffffff8100c5a7>] ptregscall_common+0x67/0xb0            
[3483] sys_exit: thread_info->flags:80                       
In int_very_careful!                                         
[3482] syscall_trace_leave: 31th bit is set                  
[3482] syscall_trace_leave: TIF_SYSCALL_TRACE is set         
[3482] tracehook_report_syscall_exit 1                       
[3482] tracehook_report_syscall_exit 2                       
[3482] utrace_set_events: new:0 old:80                       
[3482] ptrace_report_syscall_exit 1                          
[3482] ptrace_report_syscall_exit 2                          
In int_very_careful!                                         
[3482] utrace_reset: clear TIF_SYSCALL_TRACE                 
[3482] sys_exit: thread_info->flags:80000080                 

[2403] utrace_set_events: new:0 old:0
[2403] utrace_set_events: new:0 old:0
[2403] ptrace_report_clone 1         
[2403] utrace_set_events: new:80 old:0
[2403] utrace_set_events 2: new:80 old:0
[2403] thread_info->flags:809 TIF_SYSCALL_TRACE:0
[2403] ptrace_report_clone 2: ret:0 engine->flags:1f8d
[2403] A thread_info->flags:809                       
[2403] B thread_info->flags:80000809                  
Pid: 2403, comm: o_tracevforkdon Not tainted #13
Call Trace:                                                  
 [<ffffffff8103f15d>] ptrace_report_clone+0x1c0/0x21f        
 [<ffffffff81069db3>] utrace_report_clone+0xa3/0x115         
 [<ffffffff81036644>] do_fork+0x16b/0x254                    
 [<ffffffff8100c11a>] ? system_call_after_swapgs+0x8a/0x8f
 [<ffffffff8100a5c4>] sys_vfork+0x20/0x22
 [<ffffffff8100c5a7>] ptregscall_common+0x67/0xb0
[2404] sys_exit: thread_info->flags:0
In sysret_check!
In sysret_signal!
[2403] sys_exit: thread_info->flags:80000001

NB: 31th bit of thread_info->flags is set by instrumentation code. Bit 0 is TIF_SYSCALL_TRACE. Other bits vary from run to run and we generally are not interested in those. First sys_exit in the trace (with different PID) belongs to exiting vfork child.
Comment 11 Denys Vlasenko 2008-11-03 13:17:56 EST
in fork.c in do_fork() lies pre-utrace point where PTRACE_O_TRACEVFORKDONE was handled:

                if (clone_flags & CLONE_VFORK) {
==>                     tracehook_report_vfork_done(p, nr);

But it's a nop for utrace kernels. utrace seems to attempt handling it on syscall exit, but fails as described above.

I managed to reproduce the bug on my machine by switching off selinux and syscall auditing.

Attempted another "route of attack" on this bug by making _TIF_SYSCALL_TRACE bit be present in _TIF_WORK_MASK:

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK                                                  \
        (0x0000FFFF & ~(/*_TIF_SYSCALL_TRACE |*/ _TIF_SYSCALL_AUDIT |   \
                        _TIF_SINGLESTEP | _TIF_SECCOMP | _TIF_SYSCALL_EMU))

This makes "return from syscall" code path notice _TIF_SYSCALL_TRACE bit which was just set by vfork() syscall we return from.

Unfortunately, in this particular case vfork() returns because [recall when vfork() returns...] child has died (not execed), and this makes parent receive SIGCHLD, which collides with, and overrides, PTRACE_O_TRACEVFORKDONE notification - we get SIGCHLD instead of SIGTRAP.

We also race with clearing of PT_VFORKING and if ptrace_report_signal wins, sometimes this WARN_ON triggers:

static u32 ptrace_report_syscall_exit(...)
        if (!(engine->flags & UTRACE_EVENT(SYSCALL_ENTRY))) {
                int ret = utrace_set_events(
                        task, engine,
                        engine->flags & ~UTRACE_EVENT(SYSCALL_EXIT));
===>            WARN_ON(!(task->ptrace & PT_VFORKING));
                task->ptrace &= ~PT_VFORKING;
Comment 12 Denys Vlasenko 2008-11-04 09:01:04 EST
Created attachment 322425 [details]
Fix (relative

syscall entry/return code path has two major branches, "syscall/sysret" for the most usual case and "int" branch which is slower but allows for more manipulations with CPU state.

syscall tracing uses "int" branch. "syscall/sysret" branch does not handle syscall entry/exit tracing because it is never entered with syscall tracing on. This includes vfork syscall too. This appears to be by design, because 32-bit version does the same.

But in the case of vfork + PTRACE_O_TRACEVFORKDONE, vfork syscall turns ON syscall tracing bit (_TIF_SYSCALL_TRACE) when it is unblocked.

The bug is that even though bit is set, "syscall/sysret" branch does not handle it, and PTRACE_O_TRACEVFORKDONE is not delivered to tracer.

This patch fixes the issue by checking for _TIF_SYSCALL_TRACE bit in thread flags in sysret_signal part of entry_64.S, and switching to "int" syscall return code path.

Since vfork can be unblocked by child execing, *or exiting*, in the second case we (parent-which-is-being-traced) are getting usual SIGCHLD, and since we are traced, we need to report *both* PTRACE_O_TRACEVFORKDONE and the fact that we got SIGCHLD. It seems that the intention is to report PTRACE_O_TRACEVFORKDONE first, and then (after tracer PTRACE_CONTs or PTRACE_SYSCALLs us) report SIGCHLD.

This second notification is sometimes lost. I ran o_tracevforkdone test in the loop:

while sleep 0.05; do echo -n .; ./o_tracevforkdone || { echo BAD!; exit; }; done

and it loses SIGCHLD in ~1% of cases. It happens here:

      l = ptrace (PTRACE_CONT, child, NULL, NULL);
      assert_perror (errno);
      assert (l == 0);
      errno = 0;
      pid = waitpid (child, &status, 0);
      assert_perror (errno);
      assert (pid == child);
      assert (WIFSTOPPED (status)); <====

I checked status and it indicates that process has exited with exit code 42. Definitely the case of SIGCHLD being lost.
Comment 13 Denys Vlasenko 2008-11-04 10:22:57 EST
Might be related to bug 469684 "signals not delivered when tracing system calls"
Comment 14 Roland McGrath 2008-11-16 16:51:04 EST
I modified the o_tracevforkdone test case (in ptrace-tests cvs) so it also tests using PTRACE_SYSCALL on the child to enter the vfork system call.  This means the parent task will already have TIF_SYSCALL_TRACE set.  Both variants of the test fail, which would seem to contradict the whole comment#10 theory.

I don't think there is any evidence of a problem in the system call return path.

The problem I see is that the traced thread doing clone does not stop in as many distinct places as ptrace wants it to.  In utrace, there is no stopping point between the vfork-parent block and the quiesce/signal callback.  In non-vfork clones, there is no stopping point after the clone callback before the quiesce/signal callbacks either.  So ptrace gets the multiple callbacks but does not actually stop (enter TASK_TRACED) until report_signal or report_quiesce returns UTRACE_STOP, so the intermediate values of ->exit_code are missed.
This also explains bug 469684, I think (no stop between syscall-exit and quiesce).

I'll need to rework the ptrace code to handle the multiple stops differently.
Comment 15 Denys Vlasenko 2008-11-17 13:31:21 EST
> I don't think there is any evidence of a problem in the system call return

I observed that bug doesn't manifest itself when syscall audit is on (normal state of Fedora 9 with selinux enabled). And the difference is that with syscall audit we always go through "int" path in entry_64.S

This is how we fail to catch TIF_SYSCALL_TRACE on exit from vfork:

        movq    %rsp,%gs:pda_oldrsp
        movq    %gs:pda_kernelstack,%rsp
        SAVE_ARGS 8,1
        movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
        movq  %rcx,RIP-ARGOFFSET(%rsp)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%rcx)
        jnz tracesys

and if SYSCALL_AUDIT is on, we jump to tracesys, and eventually go thru "int" path, where vforkdone notification will be caught. (I do not know why same does not work with SYSCALL_TRACE, this is not what I am talking about here). Let's see what will happen if we did not jump:

        cmpq $__NR_syscall_max,%rax
        ja badsys
        movq %r10,%rcx
        call *sys_call_table(,%rax,8)  # XXX:    rip relative

Here we do vfork. When we return, SYSCALL_TRACE is set by vfork. Now let's see whether we will notice this:

        movq %rax,RAX-ARGOFFSET(%rsp)
        movl $_TIF_ALLWORK_MASK,%edi
        movl TI_flags(%rcx),%edx
        andl %edi,%edx
        jnz  sysret_careful

TIF_ALLWORK_MASK contain all bits including SYSCALL_TRACE, we jump here

        bt $TIF_NEED_RESCHED,%edx
        jnc sysret_signal

Let's assume NEED_RESCHED is not set and we jump, skipping these instructions:

        pushq %rdi
        call schedule
        popq  %rdi
        jmp sysret_check

        bt $TIF_SYSCALL_AUDIT,%edx
        jc sysret_audit

We do not jump here

        leaq do_notify_resume(%rip),%rax
        leaq -ARGOFFSET(%rsp),%rdi # &pt_regs -> arg1
        xorl %esi,%esi # oldset -> arg2
        call ptregscall_common
        movl $_TIF_WORK_MASK,%edi   <=========== NB!
        jmp int_with_check

Obviously we jump here. NB: %edi == TIF_WORK_MASK, which is:
#define _TIF_WORK_MASK                                                  \
        (0x0000FFFF &                                                   \
         ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|                       \
It does not contain SYSCALL_TRACE bit.
Ok, now we are here:

        movl TI_flags(%rcx),%edx
        andl %edi,%edx
        jnz   int_careful

Here we test thread flags against %edi, *and %edi does not have SYSCALL_TRACE bit*. So we do not jump to int_careful (and that is the place where we eventually would "testl $_TIF_WORK_SYSCALL_EXIT,%edx" and catch SYSCALL_TRACE). Instead, we continue:

        andl    $~TS_COMPAT,TI_status(%rcx)
        jmp   retint_swapgs
retint_swapgs:          /* return to user-space */
        jmp restore_args
        RESTORE_ARGS 0,8,0

Bingo. SYSCALL_TRACE was not caught.

The patch in comment #12 prevents this, and I did run-test it.

However, I do not insist on my fix being correct. Maybe it's better to fix it another way (say, make vfork not abuse SYSCALL_TRACE?)

I merely insist that in current code SYSCALL_TRACE is indeed not caught. :)
Comment 16 Denys Vlasenko 2009-01-14 08:48:22 EST
kernel-2.6.29-0.28.rc1.fc11.x86_64 appears to have it fixed. I ran a looping test of o_tracevforkdone testcase and it does not fail.
Comment 17 Bug Zapper 2009-06-09 23:03:12 EDT
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
Comment 18 Bug Zapper 2009-07-14 12:44:46 EDT
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.