Bug 466774 - [RHEL5.3] kernel kernel BUG at kernel/exit.c:1129!
[RHEL5.3] kernel kernel BUG at kernel/exit.c:1129!
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
high Severity high
: rc
: ---
Assigned To: Oleg Nesterov
Mike Gahagan
: 455025 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2008-10-13 11:25 EDT by Jeff Burke
Modified: 2009-01-20 15:03 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-20 15:03:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
untested patch (3.15 KB, patch)
2008-11-11 19:13 EST, Roland McGrath
no flags Details | Diff

  None (edit)
Description Jeff Burke 2008-10-13 11:25:15 EDT
Description of problem:
 While running the kerneltier1 tests against 2.16.18-119.el5. The scrashme test induced a BUG_ON in the kernel.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. This happened while running the scrashme test in RHTS
Actual results:

------------[ cut here ]------------

kernel BUG at kernel/exit.c:1129!

invalid opcode: 0000 [#1]


last sysfs file: /devices/pci0000:00/0000:00:10.0/0000:05:02.1/irq

Modules linked in: testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport e1000 k8_edac sg i2c_nforce2 ide_cd edac_mc k8temp pcspkr cdrom hwmon serio_raw i2c_core dm_snapshot dm_zero dm_mirror dm_log dm_mod mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

CPU:    7

EIP:    0060:[<c04266cb>]    Not tainted VLI

EFLAGS: 00010246   (2.6.18-119.el5PAE #1) 

EIP is at do_wait+0x42b/0x97c

eax: 00000010   ebx: f47c7550   ecx: 00000000   edx: 00000001

esi: bfff7714   edi: f25e0f88   ebp: 00000000   esp: f25e0f5c

ds: 007b   es: 007b   ss: 0068

Process scrashme (pid: 18296, ti=f25e0000 task=ede31550 task.ti=f25e0000)

Stack: 00000004 ffffffff ede31550 00000000 f47c760c 00000007 00000000 ede31550 

       c041e89d f7bf7c8c f7bf7c8c ffffffff bfff7714 bfff7688 f25e0000 c0426c43 

       00000000 00000000 c0426c59 ffffffff 00000000 00000000 00000000 c0404f17 

Call Trace:

 [<c041e89d>] default_wake_function+0x0/0xc

 [<c0426c43>] sys_wait4+0x27/0x2a

 [<c0426c59>] sys_waitpid+0x13/0x17

 [<c0404f17>] syscall_call+0x7/0xb


Code: 00 87 83 90 00 00 00 83 f8 10 74 16 83 f8 20 0f 84 a0 03 00 00 0f 0b 66 04 e3 36 63 c0 e9 93 03 00 00 83 bb 98 00 00 00 ff 75 08 <0f> 0b 69 04 e3 36 63 c0 83 bb 4c 04 00 00 00 0f 84 f1 00 00 00 

EIP: [<c04266cb>] do_wait+0x42b/0x97c SS:ESP 0068:f25e0f5c

Additional info:
Comment 2 Oleg Nesterov 2008-10-24 08:56:19 EDT
I know nothing about utrace, most probably I missed something...

But I suspect check_dead_utrace() can race with do_wait().

do_wait()->eligible_child() ensures that p->exit_signal != -1, then
wait_task_zombie() hits BUG_ON(p->exit_signal == -1).

Isn't it possible that check_dead_utrace() does do_notify_parent(tsk)
in between? It can set ->exit_signal = -1 if the parent ignores SIGCHLD.

do_wait() checks tracehook_inhibit_wait_zombie(), but note that
check_dead_utrace() does "tsk->utrace_flags = flags" first, this clears

And both do_wait() and check_dead_utrace() take tasklist for reading.

Comment 5 Oleg Nesterov 2008-10-29 15:58:35 EDT
Oh, I have no idea what the scrashme test does, but it looks like my
wild guess was correct, the race do exists. I think it is trivial to
write the "real" exploit, but since I am lazy...

This script


        global CDU

        probe kernel.function("check_dead_utrace")
                CDU[pid()] = 0;

        probe kernel.function("check_dead_utrace").return
                delete CDU[pid()];

        probe kernel.function("do_notify_parent")
                if (!(pid() in CDU)) next;

                tout = gettimeofday_ms() + 50;
                while (tout > gettimeofday_ms())

inserts the 50ms delay into do_notify_parent() if it is called from

Now, this program

        void *tfunc(void *arg)
                for (;;)
                return 0;

        int main(void)
                signal(SIGCHLD, SIG_IGN);

                pthread_t thr;
                if (pthread_create(&thr, NULL, tfunc, NULL))

                if (!fork()) {
                        if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
                        return 0;


                return 0;

crashes the kernel reliably after 3-4 invocations (triggers BUG_ON at

Perhaps something like the patch below can help, but a) I am not
sure write_lock(tasklist) under utrace->lock is safe, and b) I can't
test it because I still can't understand how can I install the patched
kernel on the machine powered by overcomplicated RedHat software ;)

Will try tomorrow.


--- a/kernel/utrace.c
+++ b/kernel/utrace.c
@@ -269,13 +269,13 @@ check_dead_utrace(struct task_struct *ts
                 * parent can't see it in EXIT_ZOMBIE momentarily and reap
                 * it.  If tsk was the group_leader, an exec by another
                 * thread can release_task it despite our NOREAP.  Holding
-                * tasklist_lock for reading excludes de_thread until we
+                * tasklist_lock for writing excludes de_thread until we
                 * decide what to do.
-               read_lock(&tasklist_lock);
+               write_lock_irq(&tasklist_lock);
                if (tsk->exit_signal == -1) { /* Self-reaping thread.  */
                        exit_state = xchg(&tsk->exit_state, EXIT_DEAD);
-                       read_unlock(&tasklist_lock);
+                       write_unlock_irq(&tasklist_lock);
                        BUG_ON(exit_state != EXIT_ZOMBIE);
                        exit_state = EXIT_DEAD; /* Reap it below.  */
@@ -306,7 +306,7 @@ check_dead_utrace(struct task_struct *ts
                         * group leader in an exec by another thread,
                         * which will call release_task itself.
-                       read_unlock(&tasklist_lock);
+                       write_unlock_irq(&tasklist_lock);
@@ -345,7 +345,7 @@ check_dead_utrace(struct task_struct *ts
                        BUG_ON(exit_state != EXIT_ZOMBIE);
                        exit_state = EXIT_DEAD; /* Reap it below.  */
-               read_unlock(&tasklist_lock); /* See comment above.  */
+               write_unlock_irq(&tasklist_lock); /* See comment above.  */
        if (exit_state == EXIT_DEAD)
Comment 6 Jeff Burke 2008-10-29 16:10:20 EDT
 scrashme - is a system call fuzzer. Written by Dave Jones.

 But you mentioned something interesting in Comment #5
The original issue This was opened for was 
kernel BUG at kernel/exit.c:1129!
kernel BUG at kernel/utrace.c:345!

But I do have another bug:
That is for "kernel BUG at kernel/utrace.c:345!"
Do you think these two bugs could be really the same thing?
Comment 7 Oleg Nesterov 2008-10-30 10:49:07 EDT
Yes, I think these two bugs (and BUG 455025 btw) are the same thing,
see Comment #2.

Suppose we have two threads, A and B. A forks the child C and ptraces
it. SIGCHLD is ignored.

C exist, does check_dead_utrace(), clears ->utrace_flags

B calls sys_wait4(), sees C with ->exit_signal == SIGCHLD,
checks tracehook_inhibit_wait_zombie() == F, and calls

C continues, calls do_notify_parent(), it sets ->exit_signal = -1

B continues, xchg(->exit_state, EXIT_DEAD) == EXIT_ZOMBIE,
and hits BUG_ON(->exit_signal == -1) at line 1129.

I think the patch I sent is correct, but it introduces the locking
order: before this patch if was possible to take utrace->lock under read_lock(tasklist_lock) (nobody does so afaics, otherwise the patch
is wrong), after this patch utrace->lock under tasklist is forbidden.
Comment 8 Oleg Nesterov 2008-10-30 12:59:25 EDT
OK, finally I tested the patch, seems to work.

At least, I can't crash the kernel any longer
with my test-case.
Comment 9 Anton Arapov 2008-11-05 07:53:25 EST
*** Bug 455025 has been marked as a duplicate of this bug. ***
Comment 10 Roland McGrath 2008-11-11 19:12:05 EST
I think the lock order change is probably OK, but I would like to be conservative and avoid opening any such potential can of worms if we can find a different fix.

I take it from comment #7 that the only case is when ptrace_report_death returns UTRACE_ACTION_DETACH, and the subsequent check_dead_utrace gets to the /* Normal solo zombie.  */ path.  Simultaneously, wait_task_zombie is trying to reap the ptrace'd zombie natural child.  But since it's a natural child and the tracing natural parent ignores SIGCHLD, it should never see this child for reaping.

One possibility is to have the /* Normal solo zombie. */ path also xchg exit_state to EXIT_DEAD, before changing utrace_flags.  This prevents any wait_task_zombie from examining the task.  Then, if after do_notify_parent it should remain a zombie, we can xchg it back to EXIT_ZOMBIE.
Comment 11 Roland McGrath 2008-11-11 19:13:13 EST
Created attachment 323278 [details]
untested patch
Comment 12 Oleg Nesterov 2008-11-12 09:23:25 EST
> I think the lock order change is probably OK, but I would like to be
> conservative and avoid opening any such potential can of worms if we can find a
> different fix.

Me too ;)

My first intent was just to remove the "bogus" BUG_ON()s, then I tried
to "ping" the exiting thread with EXIT_DEAD, but failed.

And, I can't prove this right now, but I think that upstream kernel relies
on the fact that ->exit_signal is stable under tasklist. Hmm, see below.

> One possibility is to have the /* Normal solo zombie. */ path also xchg
> exit_state to EXIT_DEAD, before changing utrace_flags.  This prevents any
> wait_task_zombie from examining the task.  Then, if after do_notify_parent it
> should remain a zombie, we can xchg it back to EXIT_ZOMBIE.

From the patch:

                if (exit_state == EXIT_ZOMBIE) {
        +               read_lock(&tasklist_lock);
                        do_notify_parent(tsk, tsk->exit_signal);
                         * If SIGCHLD was ignored, that set tsk->exit_signal = -1
                         * to tell us to reap it immediately.
        +                * Otherwise, leave it as a zombie.
        -               if (tsk->exit_signal == -1) {
        -                       exit_state = xchg(&tsk->exit_state, EXIT_DEAD);
        -                       BUG_ON(exit_state != EXIT_ZOMBIE);
        -                       exit_state = EXIT_DEAD; /* Reap it below.  */
        +               if (tsk->exit_signal != -1) {
        +                       exit_state = xchg(&tsk->exit_state, EXIT_ZOMBIE);
        +                       BUG_ON(exit_state != EXIT_DEAD);
        +                       exit_state = EXIT_ZOMBIE;
        -               read_unlock(&tasklist_lock); /* See comment above.  */
        +               read_unlock(&tasklist_lock);

I suspect this adds another race. We send the notification under
read_lock(tasklist) and tsk->state = EXIT_DEASD.

What if the parent does not ignore SIGCHLD ? it recieves the signal and
does do_wait() before we restore EXIT_ZOMBIE. In that case do_wait()
returns ECHILD, not good.

So, in some sense the patch I sent is more conservative...

But yes! I agree very much with you, I hate this write_lock() too.
Comment 13 Roland McGrath 2008-11-13 22:39:30 EST
In the upstream kernel, do_wait_thread (do_wait) is the only place that examines ->exit_signal without write_lock_irq(&tasklist_lock).  So all the others are actually excluded just by read_lock(&tasklist_lock) already.  The same is true in the RHEL5 code.  But the do_wait ones are what's biting us, so I think it is indeed a problem to be calling do_notify_parent with only read_lock in my patch.

I too thought about just nixing the BUG_ON in wait_task_zombie.  But when it hits (and some windows where it doesn't), it means that wait_task_zombie->release_task is racing with check_dead_utrace.  So we'd then hit BUG_ON(exit_state != EXIT_ZOMBIE); in check_dead_utrace instead.  If we kludged that up to call release_task only if it didn't lose the race, then we'd be safe from crashes, but still wind up with the wrong behavior because do_wait would return a PID that should have self-reaped because of SIGCHLD=SIG_IGN.

What we really want is that tracehook_inhibit_wait_zombie remains true (no reap, no ECHILD) until do_notify_parent is called with write_lock_irq(&tasklist_lock) held (so that any wait is both woken up and ordered after the change).  I'm trying to come up with a way to leave tsk->utrace_flags=UTRACE_ACTION_NOREAP when we unlock/rcu_utrace_free, and then use write_lock_irq in the exit_state==EXIT_ZOMBIE case at the end of check_dead_utrace.  But clearing UTRACE_ACTION_NOREAP inside there has the same locking order issues, since we have to worry about a new utrace_attach to the zombie setting utrace_flags.
Comment 14 Oleg Nesterov 2008-11-18 10:17:25 EST
Well. Given that this bug is the 5.3 blocker, what should we do now?

Should I send the "s/read_lock/write_lock_irq/" patch for now?
Comment 15 Roland McGrath 2008-11-24 02:05:04 EST
If you can't think of another approach satisfying the requirements described in comment #13, then we should try what you have.
Comment 16 Oleg Nesterov 2008-11-24 12:35:29 EST
It is not clear to me how can we preserve tracehook_inhibit_wait_zombie()
until we finish with the zombie. and we need write_lock() anyway afaics.
And I can't see how can we drop utrace->lock before we take tasklist.
And if we take tasklist, we must take it for writing, unless I missed
something we can't do read_lock() + read_unlock() + write_lock() +
do_notify_parent() + ...

I sent the patch which does s/read_lock/write_lock/:
Comment 19 Don Zickus 2008-12-02 17:19:27 EST
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 22 errata-xmlrpc 2009-01-20 15:03:05 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.