Bug 466774 - [RHEL5.3] kernel kernel BUG at kernel/exit.c:1129!
Summary: [RHEL5.3] kernel kernel BUG at kernel/exit.c:1129!
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Oleg Nesterov
QA Contact: Mike Gahagan
URL: http://rhts.redhat.com/testlogs/32199...
Whiteboard:
: 455025 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-13 15:25 UTC by Jeff Burke
Modified: 2009-01-20 20:03 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:03:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
untested patch (3.15 KB, patch)
2008-11-12 00:13 UTC, Roland McGrath
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Jeff Burke 2008-10-13 15:25:15 UTC
Description of problem:
 While running the kerneltier1 tests against 2.16.18-119.el5. The scrashme test induced a BUG_ON in the kernel.

Version-Release number of selected component (if applicable):
2.16.18-119.el5

How reproducible:
Unknown

Steps to Reproduce:
1. This happened while running the scrashme test in RHTS
  
Actual results:

------------[ cut here ]------------

kernel BUG at kernel/exit.c:1129!

invalid opcode: 0000 [#1]

SMP 

last sysfs file: /devices/pci0000:00/0000:00:10.0/0000:05:02.1/irq

Modules linked in: testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport e1000 k8_edac sg i2c_nforce2 ide_cd edac_mc k8temp pcspkr cdrom hwmon serio_raw i2c_core dm_snapshot dm_zero dm_mirror dm_log dm_mod mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

CPU:    7

EIP:    0060:[<c04266cb>]    Not tainted VLI

EFLAGS: 00010246   (2.6.18-119.el5PAE #1) 

EIP is at do_wait+0x42b/0x97c

eax: 00000010   ebx: f47c7550   ecx: 00000000   edx: 00000001

esi: bfff7714   edi: f25e0f88   ebp: 00000000   esp: f25e0f5c

ds: 007b   es: 007b   ss: 0068

Process scrashme (pid: 18296, ti=f25e0000 task=ede31550 task.ti=f25e0000)

Stack: 00000004 ffffffff ede31550 00000000 f47c760c 00000007 00000000 ede31550 

       c041e89d f7bf7c8c f7bf7c8c ffffffff bfff7714 bfff7688 f25e0000 c0426c43 

       00000000 00000000 c0426c59 ffffffff 00000000 00000000 00000000 c0404f17 

Call Trace:

 [<c041e89d>] default_wake_function+0x0/0xc

 [<c0426c43>] sys_wait4+0x27/0x2a

 [<c0426c59>] sys_waitpid+0x13/0x17

 [<c0404f17>] syscall_call+0x7/0xb

 =======================

Code: 00 87 83 90 00 00 00 83 f8 10 74 16 83 f8 20 0f 84 a0 03 00 00 0f 0b 66 04 e3 36 63 c0 e9 93 03 00 00 83 bb 98 00 00 00 ff 75 08 <0f> 0b 69 04 e3 36 63 c0 83 bb 4c 04 00 00 00 0f 84 f1 00 00 00 

EIP: [<c04266cb>] do_wait+0x42b/0x97c SS:ESP 0068:f25e0f5c


Additional info:

Comment 2 Oleg Nesterov 2008-10-24 12:56:19 UTC
I know nothing about utrace, most probably I missed something...

But I suspect check_dead_utrace() can race with do_wait().

do_wait()->eligible_child() ensures that p->exit_signal != -1, then
wait_task_zombie() hits BUG_ON(p->exit_signal == -1).

Isn't it possible that check_dead_utrace() does do_notify_parent(tsk)
in between? It can set ->exit_signal = -1 if the parent ignores SIGCHLD.

do_wait() checks tracehook_inhibit_wait_zombie(), but note that
check_dead_utrace() does "tsk->utrace_flags = flags" first, this clears
UTRACE_ACTION_NOREAP.

And both do_wait() and check_dead_utrace() take tasklist for reading.

Oleg.

Comment 5 Oleg Nesterov 2008-10-29 19:58:35 UTC
Oh, I have no idea what the scrashme test does, but it looks like my
wild guess was correct, the race do exists. I think it is trivial to
write the "real" exploit, but since I am lazy...

This script

        #!/usr/bin/stap

        global CDU

        probe kernel.function("check_dead_utrace")
        {
                CDU[pid()] = 0;
        }

        probe kernel.function("check_dead_utrace").return
        {
                delete CDU[pid()];
        }

        probe kernel.function("do_notify_parent")
        {
                if (!(pid() in CDU)) next;

                tout = gettimeofday_ms() + 50;
                while (tout > gettimeofday_ms())
                        ;
        }

inserts the 50ms delay into do_notify_parent() if it is called from
check_dead_utrace().

Now, this program

        void *tfunc(void *arg)
        {
                for (;;)
                        wait(0);
                return 0;
        }

        int main(void)
        {
                signal(SIGCHLD, SIG_IGN);

                pthread_t thr;
                if (pthread_create(&thr, NULL, tfunc, NULL))
                        perror("pthread_create");

                if (!fork()) {
                        if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
                                perror("ptrace");
                        return 0;
                }

                sleep(1);

                return 0;
        }

crashes the kernel reliably after 3-4 invocations (triggers BUG_ON at
kernel/utrace.c:345).

Perhaps something like the patch below can help, but a) I am not
sure write_lock(tasklist) under utrace->lock is safe, and b) I can't
test it because I still can't understand how can I install the patched
kernel on the machine powered by overcomplicated RedHat software ;)

Will try tomorrow.

Oleg.

--- a/kernel/utrace.c
+++ b/kernel/utrace.c
@@ -269,13 +269,13 @@ check_dead_utrace(struct task_struct *ts
                 * parent can't see it in EXIT_ZOMBIE momentarily and reap
                 * it.  If tsk was the group_leader, an exec by another
                 * thread can release_task it despite our NOREAP.  Holding
-                * tasklist_lock for reading excludes de_thread until we
+                * tasklist_lock for writing excludes de_thread until we
                 * decide what to do.
                 */
-               read_lock(&tasklist_lock);
+               write_lock_irq(&tasklist_lock);
                if (tsk->exit_signal == -1) { /* Self-reaping thread.  */
                        exit_state = xchg(&tsk->exit_state, EXIT_DEAD);
-                       read_unlock(&tasklist_lock);
+                       write_unlock_irq(&tasklist_lock);
 
                        BUG_ON(exit_state != EXIT_ZOMBIE);
                        exit_state = EXIT_DEAD; /* Reap it below.  */
@@ -306,7 +306,7 @@ check_dead_utrace(struct task_struct *ts
                         * group leader in an exec by another thread,
                         * which will call release_task itself.
                         */
-                       read_unlock(&tasklist_lock);
+                       write_unlock_irq(&tasklist_lock);
        }
 
        /*
@@ -345,7 +345,7 @@ check_dead_utrace(struct task_struct *ts
                        BUG_ON(exit_state != EXIT_ZOMBIE);
                        exit_state = EXIT_DEAD; /* Reap it below.  */
                }
-               read_unlock(&tasklist_lock); /* See comment above.  */
+               write_unlock_irq(&tasklist_lock); /* See comment above.  */
        }
        if (exit_state == EXIT_DEAD)
                /*

Comment 6 Jeff Burke 2008-10-29 20:10:20 UTC
Oleg,
 scrashme - is a system call fuzzer. Written by Dave Jones.
http://www.codemonkey.org.uk/projects/

 But you mentioned something interesting in Comment #5
The original issue This was opened for was 
kernel BUG at kernel/exit.c:1129!
not
kernel BUG at kernel/utrace.c:345!

But I do have another bug:
https://bugzilla.redhat.com/show_bug.cgi?id=428693
That is for "kernel BUG at kernel/utrace.c:345!"
Do you think these two bugs could be really the same thing?

Comment 7 Oleg Nesterov 2008-10-30 14:49:07 UTC
Yes, I think these two bugs (and BUG 455025 btw) are the same thing,
see Comment #2.

Suppose we have two threads, A and B. A forks the child C and ptraces
it. SIGCHLD is ignored.

C exist, does check_dead_utrace(), clears ->utrace_flags

B calls sys_wait4(), sees C with ->exit_signal == SIGCHLD,
checks tracehook_inhibit_wait_zombie() == F, and calls
wait_task_zombie.

C continues, calls do_notify_parent(), it sets ->exit_signal = -1

B continues, xchg(->exit_state, EXIT_DEAD) == EXIT_ZOMBIE,
and hits BUG_ON(->exit_signal == -1) at line 1129.


I think the patch I sent is correct, but it introduces the locking
order: before this patch if was possible to take utrace->lock under read_lock(tasklist_lock) (nobody does so afaics, otherwise the patch
is wrong), after this patch utrace->lock under tasklist is forbidden.

Comment 8 Oleg Nesterov 2008-10-30 16:59:25 UTC
OK, finally I tested the patch, seems to work.

At least, I can't crash the kernel any longer
with my test-case.

Comment 9 Anton Arapov 2008-11-05 12:53:25 UTC
*** Bug 455025 has been marked as a duplicate of this bug. ***

Comment 10 Roland McGrath 2008-11-12 00:12:05 UTC
I think the lock order change is probably OK, but I would like to be conservative and avoid opening any such potential can of worms if we can find a different fix.

I take it from comment #7 that the only case is when ptrace_report_death returns UTRACE_ACTION_DETACH, and the subsequent check_dead_utrace gets to the /* Normal solo zombie.  */ path.  Simultaneously, wait_task_zombie is trying to reap the ptrace'd zombie natural child.  But since it's a natural child and the tracing natural parent ignores SIGCHLD, it should never see this child for reaping.

One possibility is to have the /* Normal solo zombie. */ path also xchg exit_state to EXIT_DEAD, before changing utrace_flags.  This prevents any wait_task_zombie from examining the task.  Then, if after do_notify_parent it should remain a zombie, we can xchg it back to EXIT_ZOMBIE.

Comment 11 Roland McGrath 2008-11-12 00:13:13 UTC
Created attachment 323278 [details]
untested patch

Comment 12 Oleg Nesterov 2008-11-12 14:23:25 UTC
> I think the lock order change is probably OK, but I would like to be
> conservative and avoid opening any such potential can of worms if we can find a
> different fix.

Me too ;)

My first intent was just to remove the "bogus" BUG_ON()s, then I tried
to "ping" the exiting thread with EXIT_DEAD, but failed.

And, I can't prove this right now, but I think that upstream kernel relies
on the fact that ->exit_signal is stable under tasklist. Hmm, see below.

> One possibility is to have the /* Normal solo zombie. */ path also xchg
> exit_state to EXIT_DEAD, before changing utrace_flags.  This prevents any
> wait_task_zombie from examining the task.  Then, if after do_notify_parent it
> should remain a zombie, we can xchg it back to EXIT_ZOMBIE.

From the patch:

                if (exit_state == EXIT_ZOMBIE) {
        +               read_lock(&tasklist_lock);
                        do_notify_parent(tsk, tsk->exit_signal);
        -
                        /*
                         * If SIGCHLD was ignored, that set tsk->exit_signal = -1
                         * to tell us to reap it immediately.
        +                * Otherwise, leave it as a zombie.
                         */
        -               if (tsk->exit_signal == -1) {
        -                       exit_state = xchg(&tsk->exit_state, EXIT_DEAD);
        -                       BUG_ON(exit_state != EXIT_ZOMBIE);
        -                       exit_state = EXIT_DEAD; /* Reap it below.  */
        +               if (tsk->exit_signal != -1) {
        +                       exit_state = xchg(&tsk->exit_state, EXIT_ZOMBIE);
        +                       BUG_ON(exit_state != EXIT_DEAD);
        +                       exit_state = EXIT_ZOMBIE;
                        }
        -               read_unlock(&tasklist_lock); /* See comment above.  */
        +               read_unlock(&tasklist_lock);

I suspect this adds another race. We send the notification under
read_lock(tasklist) and tsk->state = EXIT_DEASD.

What if the parent does not ignore SIGCHLD ? it recieves the signal and
does do_wait() before we restore EXIT_ZOMBIE. In that case do_wait()
returns ECHILD, not good.

So, in some sense the patch I sent is more conservative...

But yes! I agree very much with you, I hate this write_lock() too.

Comment 13 Roland McGrath 2008-11-14 03:39:30 UTC
In the upstream kernel, do_wait_thread (do_wait) is the only place that examines ->exit_signal without write_lock_irq(&tasklist_lock).  So all the others are actually excluded just by read_lock(&tasklist_lock) already.  The same is true in the RHEL5 code.  But the do_wait ones are what's biting us, so I think it is indeed a problem to be calling do_notify_parent with only read_lock in my patch.

I too thought about just nixing the BUG_ON in wait_task_zombie.  But when it hits (and some windows where it doesn't), it means that wait_task_zombie->release_task is racing with check_dead_utrace.  So we'd then hit BUG_ON(exit_state != EXIT_ZOMBIE); in check_dead_utrace instead.  If we kludged that up to call release_task only if it didn't lose the race, then we'd be safe from crashes, but still wind up with the wrong behavior because do_wait would return a PID that should have self-reaped because of SIGCHLD=SIG_IGN.

What we really want is that tracehook_inhibit_wait_zombie remains true (no reap, no ECHILD) until do_notify_parent is called with write_lock_irq(&tasklist_lock) held (so that any wait is both woken up and ordered after the change).  I'm trying to come up with a way to leave tsk->utrace_flags=UTRACE_ACTION_NOREAP when we unlock/rcu_utrace_free, and then use write_lock_irq in the exit_state==EXIT_ZOMBIE case at the end of check_dead_utrace.  But clearing UTRACE_ACTION_NOREAP inside there has the same locking order issues, since we have to worry about a new utrace_attach to the zombie setting utrace_flags.

Comment 14 Oleg Nesterov 2008-11-18 15:17:25 UTC
Well. Given that this bug is the 5.3 blocker, what should we do now?

Should I send the "s/read_lock/write_lock_irq/" patch for now?

Comment 15 Roland McGrath 2008-11-24 07:05:04 UTC
If you can't think of another approach satisfying the requirements described in comment #13, then we should try what you have.

Comment 16 Oleg Nesterov 2008-11-24 17:35:29 UTC
It is not clear to me how can we preserve tracehook_inhibit_wait_zombie()
until we finish with the zombie. and we need write_lock() anyway afaics.
And I can't see how can we drop utrace->lock before we take tasklist.
And if we take tasklist, we must take it for writing, unless I missed
something we can't do read_lock() + read_unlock() + write_lock() +
do_notify_parent() + ...

I sent the patch which does s/read_lock/write_lock/:
http://post-office.corp.redhat.com/archives/rhkernel-list/2008-November/msg00679.html

Comment 19 Don Zickus 2008-12-02 22:19:27 UTC
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 22 errata-xmlrpc 2009-01-20 20:03:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.