Description of problem: OpenVZ linux kernel team has found deadlock between ptrace and coredump code, no root privileges required. 2.6.18-128.1.10.el5 affected, exploit in attach. --- SysRq-T trace: expl_zap3 R ffff81003fa6cd70 0 8645 10409 8646 (NOTLB) 0000000000000068 ffffffff8006358b ffff81007e13f980 0000000000000000 ffff81003fa6cd70 ffff81003fa6cd70 0000013e867d5d52 0000027bcf133f75 ffff81003fa6cf80 ffffffff8046b780 ffffffff802fa680 ffff81003d0f1ec8 Call Trace: [<ffffffff8006358b>] __sched_text_start+0x11b/0xfcb [<ffffffff800863c2>] task_rq_lock+0x26/0x45 [<ffffffff80088be2>] sys_sched_yield+0xb1/0xb8 [<ffffffff800c8fa8>] ptrace_start+0x3bd/0x465 [<ffffffff80029223>] do_wait+0xafd/0xb99 [<ffffffff800c9cfb>] sys_ptrace+0x48/0x1f7 [<ffffffff80060477>] ptregscall_common+0x67/0xac [<ffffffff80060166>] system_call+0x7e/0x83 expl_zap3 D ffff81007ff395c0 0 8647 8645 8646 (NOTLB) ffff81007f6abc58 0000000000000086 0000000000000006 ffff81007b944c30 ffff81007ff395c0 ffff81007ff33500 0000004ed2555339 0000009d55ffa665 ffff81007ff397c8 ffffffff8046b780 0000004ed2554f5e ffff81007ff33500 Call Trace: [<ffffffff80086900>] __activate_task+0x92/0x157 [<ffffffff800492ac>] try_to_wake_up+0x3ce/0x3e0 [<ffffffff80064691>] wait_for_completion+0x79/0xa2 [<ffffffff80088417>] default_wake_function+0x0/0xe [<ffffffff800ecbb5>] do_coredump+0x341/0x8a1 [<ffffffff800a32e5>] ub_slab_uncharge+0xd0/0xdb [<ffffffff800a3493>] do_ub_siginfo_uncharge+0x42/0x55 [<ffffffff80096391>] recalc_sigpending+0xe/0x25 [<ffffffff8002c00a>] get_signal_to_deliver+0x434/0x46b [<ffffffff8005d8f6>] do_notify_resume+0xd0/0x7e3 [<ffffffff80096f63>] __group_send_sig_info+0x89/0x94 [<ffffffff8005d26a>] group_send_sig_info+0x76/0x83 [<ffffffff80088a0d>] vcpu_put+0x8e/0x16e [<ffffffff8004ea1d>] sys_kill+0x19f/0x1b2 [<ffffffff80088a0d>] vcpu_put+0x8e/0x16e [<ffffffff800601ef>] sysret_signal+0x1c/0x27 [<ffffffff80060477>] ptregscall_common+0x67/0xac
Created attachment 346615 [details] Proposed patch from OpenVZ
Created attachment 346742 [details] fix do_coredump() vs ptrace_start() deadlock This is not as simple as I thought... I suspect the patch from openvz is not exactly right. PF_SIGNALED is always set when the thread is killed by the fatal signal, if we check this flag in ptrace_start() I'm afraid we can have a false positive when the exiting tracee calls tracehook_report_exit(). Also. There is no guarantee the tracee must have PF_SIGNALED when we are going to deadlock. Suppose the tracee just exits and sleeps in TASK_TRACED because of PTRACE_EVENT_EXIT. The tracer calls ptrace_start(). After that another thread which shares the same ->mm starts the coredump. zap_process() wakes up the tracee, it calls exit_mm()->wait_for_completion() and sleeps in D state but without PF_SIGNALED. We could add the SIGNAL_GROUP_EXIT check: --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -933,7 +933,8 @@ ptrace_start(long pid, long request, */ wait_task_inactive(child); while (child->state != TASK_TRACED && child->state != TASK_STOPPED) { - if (child->exit_state) { + if (child->exit_state || + (current->signal->flags & SIGNAL_GROUP_EXIT)) { __ptrace_state_free(state); goto out_tsk; } If we race with the coredumping thread which shares the same ->mm, the tracer should be killed by SIGKILL too. In that case we can just return, the error code does not matter because we will never return to the user-space. (This patch could also help if the rt tracer preempts the tracee, we can spin forever in this case. At least, with this check we can kill the tracer. However, without fixing wait_task_inactive() this doesn't really help). The patch above should fix this deadlock, but unfortunately it does not solve all problems. Note that with this test-case the tracer, tracee, and the coredumping thread share the same ->mm. This is because it was written originally to exploit another problem fixed by 5ecfbae093f0c37311e89b29bfc0c9d586eace87. But what if the tracer does not participate in coredumping? In that case the tracer is not killed, and we still have problems. The tracer will spin until the coredump completes. So I'd suggest this patch. Untested, not even compiled. Just for review/discussion. Also. Upstream checks mm->core_state in may_ptrace_stop() to prevent another deadlock (and it is still needed afaics, despite the fact schedule() now checks signal_pending_state()). I wonder if RHEL needs something like this check. utrace_quiescent() checks sigkill_pending(), but if SIGKILL was already dequeued it can return false. This means that if the tracer, tracee, and the coredumping thread share the same ->mm we can deadlock. Fortunately, sigkill_pending() == F also means we can send the private SIGKILL to the tracee and wake it up, but I guess this won't be obvious to admin.
This issue has been addressed in following products: Red Hat Enterprise Linux 5 Via RHSA-2009:1193 https://rhn.redhat.com/errata/RHSA-2009-1193.html