Description of problem: While running the kerneltier1 tests against 2.16.18-119.el5. The scrashme test induced a BUG_ON in the kernel. Version-Release number of selected component (if applicable): 2.16.18-119.el5 How reproducible: Unknown Steps to Reproduce: 1. This happened while running the scrashme test in RHTS Actual results: ------------[ cut here ]------------ kernel BUG at kernel/exit.c:1129! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:10.0/0000:05:02.1/irq Modules linked in: testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport e1000 k8_edac sg i2c_nforce2 ide_cd edac_mc k8temp pcspkr cdrom hwmon serio_raw i2c_core dm_snapshot dm_zero dm_mirror dm_log dm_mod mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 7 EIP: 0060:[<c04266cb>] Not tainted VLI EFLAGS: 00010246 (2.6.18-119.el5PAE #1) EIP is at do_wait+0x42b/0x97c eax: 00000010 ebx: f47c7550 ecx: 00000000 edx: 00000001 esi: bfff7714 edi: f25e0f88 ebp: 00000000 esp: f25e0f5c ds: 007b es: 007b ss: 0068 Process scrashme (pid: 18296, ti=f25e0000 task=ede31550 task.ti=f25e0000) Stack: 00000004 ffffffff ede31550 00000000 f47c760c 00000007 00000000 ede31550 c041e89d f7bf7c8c f7bf7c8c ffffffff bfff7714 bfff7688 f25e0000 c0426c43 00000000 00000000 c0426c59 ffffffff 00000000 00000000 00000000 c0404f17 Call Trace: [<c041e89d>] default_wake_function+0x0/0xc [<c0426c43>] sys_wait4+0x27/0x2a [<c0426c59>] sys_waitpid+0x13/0x17 [<c0404f17>] syscall_call+0x7/0xb ======================= Code: 00 87 83 90 00 00 00 83 f8 10 74 16 83 f8 20 0f 84 a0 03 00 00 0f 0b 66 04 e3 36 63 c0 e9 93 03 00 00 83 bb 98 00 00 00 ff 75 08 <0f> 0b 69 04 e3 36 63 c0 83 bb 4c 04 00 00 00 0f 84 f1 00 00 00 EIP: [<c04266cb>] do_wait+0x42b/0x97c SS:ESP 0068:f25e0f5c Additional info:
I know nothing about utrace, most probably I missed something... But I suspect check_dead_utrace() can race with do_wait(). do_wait()->eligible_child() ensures that p->exit_signal != -1, then wait_task_zombie() hits BUG_ON(p->exit_signal == -1). Isn't it possible that check_dead_utrace() does do_notify_parent(tsk) in between? It can set ->exit_signal = -1 if the parent ignores SIGCHLD. do_wait() checks tracehook_inhibit_wait_zombie(), but note that check_dead_utrace() does "tsk->utrace_flags = flags" first, this clears UTRACE_ACTION_NOREAP. And both do_wait() and check_dead_utrace() take tasklist for reading. Oleg.
Oh, I have no idea what the scrashme test does, but it looks like my wild guess was correct, the race do exists. I think it is trivial to write the "real" exploit, but since I am lazy... This script #!/usr/bin/stap global CDU probe kernel.function("check_dead_utrace") { CDU[pid()] = 0; } probe kernel.function("check_dead_utrace").return { delete CDU[pid()]; } probe kernel.function("do_notify_parent") { if (!(pid() in CDU)) next; tout = gettimeofday_ms() + 50; while (tout > gettimeofday_ms()) ; } inserts the 50ms delay into do_notify_parent() if it is called from check_dead_utrace(). Now, this program void *tfunc(void *arg) { for (;;) wait(0); return 0; } int main(void) { signal(SIGCHLD, SIG_IGN); pthread_t thr; if (pthread_create(&thr, NULL, tfunc, NULL)) perror("pthread_create"); if (!fork()) { if (ptrace(PTRACE_TRACEME, 0, NULL, NULL)) perror("ptrace"); return 0; } sleep(1); return 0; } crashes the kernel reliably after 3-4 invocations (triggers BUG_ON at kernel/utrace.c:345). Perhaps something like the patch below can help, but a) I am not sure write_lock(tasklist) under utrace->lock is safe, and b) I can't test it because I still can't understand how can I install the patched kernel on the machine powered by overcomplicated RedHat software ;) Will try tomorrow. Oleg. --- a/kernel/utrace.c +++ b/kernel/utrace.c @@ -269,13 +269,13 @@ check_dead_utrace(struct task_struct *ts * parent can't see it in EXIT_ZOMBIE momentarily and reap * it. If tsk was the group_leader, an exec by another * thread can release_task it despite our NOREAP. Holding - * tasklist_lock for reading excludes de_thread until we + * tasklist_lock for writing excludes de_thread until we * decide what to do. */ - read_lock(&tasklist_lock); + write_lock_irq(&tasklist_lock); if (tsk->exit_signal == -1) { /* Self-reaping thread. */ exit_state = xchg(&tsk->exit_state, EXIT_DEAD); - read_unlock(&tasklist_lock); + write_unlock_irq(&tasklist_lock); BUG_ON(exit_state != EXIT_ZOMBIE); exit_state = EXIT_DEAD; /* Reap it below. */ @@ -306,7 +306,7 @@ check_dead_utrace(struct task_struct *ts * group leader in an exec by another thread, * which will call release_task itself. */ - read_unlock(&tasklist_lock); + write_unlock_irq(&tasklist_lock); } /* @@ -345,7 +345,7 @@ check_dead_utrace(struct task_struct *ts BUG_ON(exit_state != EXIT_ZOMBIE); exit_state = EXIT_DEAD; /* Reap it below. */ } - read_unlock(&tasklist_lock); /* See comment above. */ + write_unlock_irq(&tasklist_lock); /* See comment above. */ } if (exit_state == EXIT_DEAD) /*
Oleg, scrashme - is a system call fuzzer. Written by Dave Jones. http://www.codemonkey.org.uk/projects/ But you mentioned something interesting in Comment #5 The original issue This was opened for was kernel BUG at kernel/exit.c:1129! not kernel BUG at kernel/utrace.c:345! But I do have another bug: https://bugzilla.redhat.com/show_bug.cgi?id=428693 That is for "kernel BUG at kernel/utrace.c:345!" Do you think these two bugs could be really the same thing?
Yes, I think these two bugs (and BUG 455025 btw) are the same thing, see Comment #2. Suppose we have two threads, A and B. A forks the child C and ptraces it. SIGCHLD is ignored. C exist, does check_dead_utrace(), clears ->utrace_flags B calls sys_wait4(), sees C with ->exit_signal == SIGCHLD, checks tracehook_inhibit_wait_zombie() == F, and calls wait_task_zombie. C continues, calls do_notify_parent(), it sets ->exit_signal = -1 B continues, xchg(->exit_state, EXIT_DEAD) == EXIT_ZOMBIE, and hits BUG_ON(->exit_signal == -1) at line 1129. I think the patch I sent is correct, but it introduces the locking order: before this patch if was possible to take utrace->lock under read_lock(tasklist_lock) (nobody does so afaics, otherwise the patch is wrong), after this patch utrace->lock under tasklist is forbidden.
OK, finally I tested the patch, seems to work. At least, I can't crash the kernel any longer with my test-case.
*** Bug 455025 has been marked as a duplicate of this bug. ***
I think the lock order change is probably OK, but I would like to be conservative and avoid opening any such potential can of worms if we can find a different fix. I take it from comment #7 that the only case is when ptrace_report_death returns UTRACE_ACTION_DETACH, and the subsequent check_dead_utrace gets to the /* Normal solo zombie. */ path. Simultaneously, wait_task_zombie is trying to reap the ptrace'd zombie natural child. But since it's a natural child and the tracing natural parent ignores SIGCHLD, it should never see this child for reaping. One possibility is to have the /* Normal solo zombie. */ path also xchg exit_state to EXIT_DEAD, before changing utrace_flags. This prevents any wait_task_zombie from examining the task. Then, if after do_notify_parent it should remain a zombie, we can xchg it back to EXIT_ZOMBIE.
Created attachment 323278 [details] untested patch
> I think the lock order change is probably OK, but I would like to be > conservative and avoid opening any such potential can of worms if we can find a > different fix. Me too ;) My first intent was just to remove the "bogus" BUG_ON()s, then I tried to "ping" the exiting thread with EXIT_DEAD, but failed. And, I can't prove this right now, but I think that upstream kernel relies on the fact that ->exit_signal is stable under tasklist. Hmm, see below. > One possibility is to have the /* Normal solo zombie. */ path also xchg > exit_state to EXIT_DEAD, before changing utrace_flags. This prevents any > wait_task_zombie from examining the task. Then, if after do_notify_parent it > should remain a zombie, we can xchg it back to EXIT_ZOMBIE. From the patch: if (exit_state == EXIT_ZOMBIE) { + read_lock(&tasklist_lock); do_notify_parent(tsk, tsk->exit_signal); - /* * If SIGCHLD was ignored, that set tsk->exit_signal = -1 * to tell us to reap it immediately. + * Otherwise, leave it as a zombie. */ - if (tsk->exit_signal == -1) { - exit_state = xchg(&tsk->exit_state, EXIT_DEAD); - BUG_ON(exit_state != EXIT_ZOMBIE); - exit_state = EXIT_DEAD; /* Reap it below. */ + if (tsk->exit_signal != -1) { + exit_state = xchg(&tsk->exit_state, EXIT_ZOMBIE); + BUG_ON(exit_state != EXIT_DEAD); + exit_state = EXIT_ZOMBIE; } - read_unlock(&tasklist_lock); /* See comment above. */ + read_unlock(&tasklist_lock); I suspect this adds another race. We send the notification under read_lock(tasklist) and tsk->state = EXIT_DEASD. What if the parent does not ignore SIGCHLD ? it recieves the signal and does do_wait() before we restore EXIT_ZOMBIE. In that case do_wait() returns ECHILD, not good. So, in some sense the patch I sent is more conservative... But yes! I agree very much with you, I hate this write_lock() too.
In the upstream kernel, do_wait_thread (do_wait) is the only place that examines ->exit_signal without write_lock_irq(&tasklist_lock). So all the others are actually excluded just by read_lock(&tasklist_lock) already. The same is true in the RHEL5 code. But the do_wait ones are what's biting us, so I think it is indeed a problem to be calling do_notify_parent with only read_lock in my patch. I too thought about just nixing the BUG_ON in wait_task_zombie. But when it hits (and some windows where it doesn't), it means that wait_task_zombie->release_task is racing with check_dead_utrace. So we'd then hit BUG_ON(exit_state != EXIT_ZOMBIE); in check_dead_utrace instead. If we kludged that up to call release_task only if it didn't lose the race, then we'd be safe from crashes, but still wind up with the wrong behavior because do_wait would return a PID that should have self-reaped because of SIGCHLD=SIG_IGN. What we really want is that tracehook_inhibit_wait_zombie remains true (no reap, no ECHILD) until do_notify_parent is called with write_lock_irq(&tasklist_lock) held (so that any wait is both woken up and ordered after the change). I'm trying to come up with a way to leave tsk->utrace_flags=UTRACE_ACTION_NOREAP when we unlock/rcu_utrace_free, and then use write_lock_irq in the exit_state==EXIT_ZOMBIE case at the end of check_dead_utrace. But clearing UTRACE_ACTION_NOREAP inside there has the same locking order issues, since we have to worry about a new utrace_attach to the zombie setting utrace_flags.
Well. Given that this bug is the 5.3 blocker, what should we do now? Should I send the "s/read_lock/write_lock_irq/" patch for now?
If you can't think of another approach satisfying the requirements described in comment #13, then we should try what you have.
It is not clear to me how can we preserve tracehook_inhibit_wait_zombie() until we finish with the zombie. and we need write_lock() anyway afaics. And I can't see how can we drop utrace->lock before we take tasklist. And if we take tasklist, we must take it for writing, unless I missed something we can't do read_lock() + read_unlock() + write_lock() + do_notify_parent() + ... I sent the patch which does s/read_lock/write_lock/: http://post-office.corp.redhat.com/archives/rhkernel-list/2008-November/msg00679.html
in kernel-2.6.18-125.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html