Bug 452706 - kernel BUG at kernel/signal.c:369! (attempt to free tsk->signal twice)
Summary: kernel BUG at kernel/signal.c:369! (attempt to free tsk->signal twice)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.7
Hardware: All
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: Vitaly Mayatskikh
QA Contact: Martin Jenner
URL:
Whiteboard:
: 455179 (view as bug list)
Depends On:
Blocks: 461297 466214
TreeView+ depends on / blocked
 
Reported: 2008-06-24 15:23 UTC by Vitaly Mayatskikh
Modified: 2009-05-18 19:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:12:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
proposed patch (1.04 KB, patch)
2008-07-09 15:41 UTC, Vitaly Mayatskikh
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Vitaly Mayatskikh 2008-06-24 15:23:53 UTC
A new issue was found during stress testing of kernel:

Comment #39 from  Cai Qian (qcai)  on Tue Jun 24 04:14:16 -0400 2008

There is a Kernel panic found on a i386 machine (veritas2.rhts.bos.redhat.com),

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3321562

vmcore can be found at
http://porkchop.devel.redhat.com/qa/qa/qcai/vmcores/

kernel BUG at kernel/signal.c:369!
invalid operand: 0000 [#1]
SMP 
Modules linked in: nfs nfsd exportfs lockd nfs_acl netconsole netdump md5 ipv6
parport_pc lp parport autofs4 sunrpc loop joydev button battery ac uhci_hcd
ehci_hcd hw_random e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
ata_piix libata qla2200 qla2xxx scsi_transport_fc megaraid_mbox megaraid_mm
sd_mod scsi_mod
CPU:    2
EIP:    0060:[<0212a6ab>]    Not tainted VLI
EFLAGS: 00010046   (2.6.9-67.0.20.ELhugemem) 
EIP is at __exit_signal+0x16/0x120
eax: 7bf6c0b0   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 7bf6c0b0   edi: 00000000   ebp: 00000000   esp: 81fa7ec4
ds: 007b   es: 007b   ss: 0068
Process init (pid: 1, threadinfo=81fa7000 task=04461630)
Stack: 7bf6c0b0 0000303d 00000000 00000000 0212353c 7bf6c0b0 0000303d 00000000 
       00000000 02125051 04464630 00000000 00000005 8075c1a0 00000005 00000002 
       00000005 021ab0a1 00000000 7bf6c0b0 7bf6c15c 04461630 00000005 021255f2 
Call Trace:
 [<0212353c>] release_task+0x5c/0xfa
 [<02125051>] wait_task_zombie+0x40c/0x422
 [<021ab0a1>] task_has_perm+0x20/0x24
 [<021255f2>] do_wait+0x183/0x3b8
 [<0211e8b0>] default_wake_function+0x0/0xc
 [<0216c358>] sys_select+0x434/0x43e
 [<0211e8b0>] default_wake_function+0x0/0xc
 [<021258ba>] sys_wait4+0x27/0x2a
 [<021258d0>] sys_waitpid+0x13/0x17
Code: 51 1a 00 89 d8 e8 c7 ff ff ff 5b b8 00 fb 38 02 e9 09 52 1a 00 55 57 56 89
c6 53 8b 98 ec 04 00 00 8b a8 f0 04 00 00 85 db 75 08 <0f> 0b 71 01 23 27 2e 02
8b 03 85 c0 75 08 0f 0b 73 01 23 27 2e 

Pid: 1, comm:                 init
EIP: 0060:[<0212a6ab>] CPU: 0
EIP is at __exit_signal+0x16/0x120
 EFLAGS: 00010046    Not tainted  (2.6.9-67.0.20.ELhugemem)
EAX: 7bf6c0b0 EBX: 00000000 ECX: 00000000 EDX: 00000001
ESI: 7bf6c0b0 EDI: 00000000 EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: 00975500 CR3: 003c8000 CR4: 000006f0
 [<0212353c>] release_task+0x5c/0xfa
 [<02125051>] wait_task_zombie+0x40c/0x422
 [<021ab0a1>] task_has_perm+0x20/0x24
 [<021255f2>] do_wait+0x183/0x3b8
 [<0211e8b0>] default_wake_function+0x0/0xc
 [<0216c358>] sys_select+0x434/0x43e
 [<0211e8b0>] default_wake_function+0x0/0xc
 [<021258ba>] sys_wait4+0x27/0x2a
 [<021258d0>] sys_waitpid+0x13/0x17

                                               sibling
  task             PC      pid father child younger older

So, __exit_signal() was called twice, and the second call produced BUG(). I
think, this upstream commit is related to the problem (and, possibly, some other
commits are related to it as well):

commit 821c7de7194e77afee1a69d50830a329a6d9af9f
Author: Oleg Nesterov <oleg>
Date:   Sun Mar 2 21:44:44 2008 +0300

    exit_notify: fix kill_orphaned_pgrp() usage with mt exit

Comment 1 Vitaly Mayatskikh 2008-07-08 11:13:34 UTC
Kernel failed on tests/kernel/errata/4.6.z/450865. On dual opteron machine it is
reproducible within several seconds. This is already known issue with double irq
free in ohci_hcd driver. I've tried it with scratch build
http://porkchop.devel.redhat.com/brewroot/scratch/vmayatsk/task_1377493/ which
includes the patch for ohci_hcd and confirm, that bug is fixed.

*** This bug has been marked as a duplicate of 443052 ***

Comment 2 Vitaly Mayatskikh 2008-07-08 13:22:00 UTC
I was wrong. Test case for ohci_hcd driver bug only accelerates another problem.
Bug can be easily hitted by running both 450865 and 449361 in
tests/kernel/errata/4.6.z.

Comment 3 Vitaly Mayatskikh 2008-07-08 15:56:13 UTC
Seems, this is the same problem as
https://bugzilla.redhat.com/show_bug.cgi?id=228816

There is a race between "init" and original parent of zombie, they both try to
release zombie. I got a vmcore where original parent fails on
de_thread()->release_task():

Pid: 7871, comm:                  exe
EIP: 0060:[<0212a6ab>] CPU: 0
EIP is at __exit_signal+0x16/0x120
 EFLAGS: 00010046    Not tainted  (2.6.9-67.0.20.ELhugemem)
EAX: 3dfdd0b0 EBX: 00000000 ECX: 0000004d EDX: 00000000
ESI: 3dfdd0b0 EDI: 00000000 EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: 0057e330 CR3: 003c8000 CR4: 000006f0
 [<0212353c>] release_task+0x5c/0xfa
 [<0216498b>] de_thread+0x511/0x63d
 [<02164245>] flush_old_exec+0x17/0x24c
 [<02164012>] kernel_read+0x31/0x3b
 [<0217fbe5>] load_elf_binary+0x56f/0xd1b
 [<02163bb1>] copy_strings+0x1ed/0x1f7
 [<0217f676>] load_elf_binary+0x0/0xd1b
 [<02164d5c>] search_binary_handler+0xcf/0x242
 [<0216503c>] do_execve+0x16d/0x1fd
 [<02104b09>] sys_execve+0x2b/0x8a

7871 tries to release process 0x3dfdd0b0, which is in state EXIT_DEAD. Roland?

Comment 4 Roland McGrath 2008-07-08 20:53:10 UTC
This is not the same as bug 228816.  RHEL4 does not have that code at all.
No RHEL4 problem ever relates usefully to such a RHEL5 bug (utrace).

There have been many fixes upstream in the MT exec area since RHEL4.
Sorry, I can't point you to exactly what you need to backport for this.

Comment 5 Vitaly Mayatskikh 2008-07-09 15:07:52 UTC
Yes, you are right. This is a race between de_thread() and do_wait().

Comment 6 Vitaly Mayatskikh 2008-07-09 15:41:57 UTC
Created attachment 311383 [details]
proposed patch

Comment 7 Qian Cai 2008-07-14 10:40:21 UTC
Propose for 4.6.z and 4.7.z. This bug is much easier to trigger than I expected.
I have seen it happened randomly on almost all arches while running my Tier1
tests for 4.6.z Kernel Errata testing. Also, it blocked the rest of tests to run.

Tier1 test ID,
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=25231

X86_64,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3610276
Kernel BUG at signal:369
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6
parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
pcmcia_core loop dm_multipath button battery ac uhci_hcd ehci_hcd hw_random tg3
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata sd_mod scsi_mod
Pid: 5563, comm: exe Not tainted 2.6.9-67.0.22.ELlargesmp
RIP: 0010:[<ffffffff801417ce>] <ffffffff801417ce>{__exit_signal+29}
RSP: 0018:000001001ccd9c68  EFLAGS: 00010046
RAX: 0000010038dde890 RBX: 0000000000000000 RCX: 000000000000005c
RDX: 000001000000c000 RSI: ffffffff8050d380 RDI: 0000010038dde7f0
RBP: 0000010038dde7f0 R08: 0000000000000202 R09: 00000001801ad182
R10: 0000000000000000 R11: ffffffff801ad182 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 000001003da52540
FS:  0000000040a00960(005b) GS:ffffffff80506980(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a958722f9 CR3: 0000000000101000 CR4: 00000000000006e0
Process exe (pid: 5563, threadinfo 000001001ccd8000, task 000001003d0537f0)
Stack: 0000010038dde7f0 0000010038dde7f0 0000010038dde7f0 0000000000000000
       0000000000000000 ffffffff80139611 000001000000c000 0000000000000010
       0000010038dde7f0 000001002c90b880
Call Trace:<ffffffff80139611>{release_task+126}
<ffffffff80184699>{flush_old_exec+1696}
       <ffffffff8017a779>{vfs_read+248} <ffffffff801a408e>{load_elf_binary+0}
       <ffffffff801a4670>{load_elf_binary+1506}
<ffffffff8015d866>{generic_file_aio_read+48}
       <ffffffff8017a655>{do_sync_read+178} <ffffffff801a408e>{load_elf_binary+0}
       <ffffffff80185198>{search_binary_handler+210}
<ffffffff801854cd>{do_execve+398}
       <ffffffff80110276>{system_call+126} <ffffffff8010ee44>{sys_execve+52}
       <ffffffff8011069a>{stub_execve+106}

Code: 0f 0b 4b 98 32 80 ff ff ff ff 71 01 8b 03 85 c0 75 0c 0f 0b
RIP <ffffffff801417ce>{__exit_signal+29} RSP <000001001ccd9c68>


PPC64,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3609994
kernel BUG in __exit_signal at kernel/signal.c:369!
cpu 0x2: Vector: 700 (Program Check) at [c000000002587810]
    pc: c00000000006f5cc: .__exit_signal+0x3c/0x208
    lr: c000000000062ae4: .release_task+0xc0/0x1d4
    sp: c000000002587a90
   msr: 8000000000021032
  current = 0xc000000002581120
  paca    = 0xc0000000003fb800
    pid   = 1, comm = init
enter ? for help
2:mon>


IA64,
http://rhts.redhat.com/testlogs/25231/92544/778429/3610279-test_log--kernel-errata-4.6.z-437788-EXTERNALWATCHDOG.log
kernel BUG at kernel/signal.c:369!

init[1]: bugcheck! 0 [1]

Modules linked in: nfsd exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6
parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat loop
button ohci_hcd ehci_hcd e100 mii tg3 dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod


Pid: 1, CPU 1, comm:                 init

psr : 0000101008122010 ifs : 800000000000040b ip  : [<a000000100094130>]    Not
tainted

ip is at __exit_signal+0xb0/0x5e0

unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003

rnat: 0000000000000000 bsps: 0000000000000000 pr  : a660159222969995

ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f

csd : 0000000000000000 ssd : 0000000000000000

b0  : a000000100094130 b6  : a0000001000727c0 b7  : a00000020020cbc0

f6  : 1003e0000000000001200 f7  : 1003e8080808080808081

f8  : 1003e00000000000023dc f9  : 1003e000000000e580000

f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf

r1  : a0000001009dcb20 r2  : 0000000000000002 r3  : 0000000000280000

r8  : 0000000000000023 r9  : 0000000000000001 r10 : e000000001014b2c

r11 : 0000000000000003 r12 : e000000001427dd0 r13 : e000000001420000

r14 : 0000000000004000 r15 : a00000010076fbc0 r16 : a00000010076fbc8

r17 : e000004040e97de8 r18 : e000004040e9002c r19 : e000000001014b20

r20 : e000000001014ac0 r21 : 0000000000000003 r22 : 0000000000000002

r23 : e000004040e90040 r24 : e000000001015280 r25 : e000000001015278

r26 : e000000001015258 r27 : 0000000000000074 r28 : 0000000000000074

r29 : 0000000000000065 r30 : e000004040e90050 r31 : 00000000356f424c


Call Trace:

 [<a000000100016e40>] show_stack+0x80/0xa0

                                sp=e000000001427940 bsp=e000000001421298

 [<a000000100017750>] show_regs+0x890/0x8c0

                                sp=e000000001427b10 bsp=e000000001421250

 [<a00000010003e9b0>] die+0x150/0x240

                                sp=e000000001427b30 bsp=e000000001421210

 [<a00000010003eae0>] die_if_kernel+0x40/0x60

                                sp=e000000001427b30 bsp=e0000000014211d8

 [<a00000010003ec80>] ia64_bad_break+0x180/0x600

                                sp=e000000001427b30 bsp=e0000000014211b0

 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260

                                sp=e000000001427c00 bsp=e0000000014211b0

 [<a000000100094130>] __exit_signal+0xb0/0x5e0

                                sp=e000000001427dd0 bsp=e000000001421158

 [<a00000010007b5d0>] release_task+0x110/0x340

                                sp=e000000001427dd0 bsp=e000000001421118

 [<a0000001000817f0>] do_wait+0x1070/0x1620

                                sp=e000000001427dd0 bsp=e000000001421000

 [<a000000100081f60>] sys_wait4+0x60/0x80

                                sp=e000000001427e30 bsp=e000000001420fa0

 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20

                                sp=e000000001427e30 bsp=e000000001420fa0

 [<a000000000010640>] 0xa000000000010640

                                sp=e000000001428000 bsp=e000000001420fa0


Comment 10 Qian Cai 2008-08-26 08:12:20 UTC
For record, I have seen this on the latest 4.7.z (78.0.3.EL) Kernels as well,

kernel BUG in __exit_signal at kernel/signal.c:377!
cpu 0x7: Vector: 700 (Program Check) at [c00000000ff83810]
    pc: c00000000006fee8: .__exit_signal+0x3c/0x250
    lr: c000000000063284: .release_task+0xc0/0x1d4
    sp: c00000000ff83a90
   msr: 8000000000021032
  current = 0xc0000001e3f11120
  paca    = 0xc000000000408000
    pid   = 1, comm = init
enter ? for help
7:mon>


^M------------[ cut here ]------------
^Mkernel BUG at kernel/signal.c:377!
^Minvalid operand: 0000 [#1]
^MSMP
^MModules linked in: lp(U) nfsd exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6 parport_pc parport autofs4 sunrpc cpufreq_powersave loop button battery ac ohci_hcd ehci_hcd k8_edac edac_mc tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_svw libata sd_mod scsi_mod
^MCPU:    0
^MEIP:    0060:[<c012b541>]    Not tainted VLI
^MEFLAGS: 00010046   (2.6.9-78.0.3.ELsmp)
^MEIP is at __exit_signal+0x16/0x14f
^Meax: f71b8e30   ebx: 00000000   ecx: 00000000   edx: 00000000
^Mesi: f71b8e30   edi: 00000000   ebp: 00000000   esp: c7871edc
^Mds: 007b   es: 007b   ss: 0068
^MProcess init (pid: 1, threadinfo=c7871000 task=f7e31630)
^MStack: f71b8e30 f71b8e30 00000000 00000000 c0123808 00000000 f71b8e30 bfe58dbc
^M       00006781 c01255ca 00000002 00000000 00000000 00000000 f71b8edc f71b8e30
^M       f7e31630 00000000 c0125c6a bfe58dbc 00000000 00040001 00006783 00000000
^MCall Trace:
^M [<c0123808>] release_task+0x5c/0xfa
^M [<c01255ca>] wait_task_zombie+0x585/0x59b
^M [<c0125c6a>] do_wait+0x185/0x449
^M [<c011e7f6>] default_wake_function+0x0/0xc
^M [<c011e7f6>] default_wake_function+0x0/0xc
^M [<c0125fc1>] sys_wait4+0x27/0x2a
^M [<c0125fd7>] sys_waitpid+0x13/0x17
^M [<c02e09d7>] syscall_call+0x7/0xb

Comment 12 Vivek Goyal 2008-09-25 13:18:46 UTC
Committed in 78.11.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 14 Jerome Marchand 2008-11-26 15:27:19 UTC
*** Bug 455179 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2009-05-18 19:12:49 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.