Bug 452706
Summary: | kernel BUG at kernel/signal.c:369! (attempt to free tsk->signal twice) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Vitaly Mayatskikh <vmayatsk> | ||||
Component: | kernel | Assignee: | Vitaly Mayatskikh <vmayatsk> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.7 | CC: | anton, jan.kratochvil, jplans, lwang, qcai, roland, vgoyal | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-05-18 19:12:49 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 461297, 466214 | ||||||
Attachments: |
|
Description
Vitaly Mayatskikh
2008-06-24 15:23:53 UTC
Kernel failed on tests/kernel/errata/4.6.z/450865. On dual opteron machine it is reproducible within several seconds. This is already known issue with double irq free in ohci_hcd driver. I've tried it with scratch build http://porkchop.devel.redhat.com/brewroot/scratch/vmayatsk/task_1377493/ which includes the patch for ohci_hcd and confirm, that bug is fixed. *** This bug has been marked as a duplicate of 443052 *** I was wrong. Test case for ohci_hcd driver bug only accelerates another problem. Bug can be easily hitted by running both 450865 and 449361 in tests/kernel/errata/4.6.z. Seems, this is the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=228816 There is a race between "init" and original parent of zombie, they both try to release zombie. I got a vmcore where original parent fails on de_thread()->release_task(): Pid: 7871, comm: exe EIP: 0060:[<0212a6ab>] CPU: 0 EIP is at __exit_signal+0x16/0x120 EFLAGS: 00010046 Not tainted (2.6.9-67.0.20.ELhugemem) EAX: 3dfdd0b0 EBX: 00000000 ECX: 0000004d EDX: 00000000 ESI: 3dfdd0b0 EDI: 00000000 EBP: 00000000 DS: 007b ES: 007b CR0: 8005003b CR2: 0057e330 CR3: 003c8000 CR4: 000006f0 [<0212353c>] release_task+0x5c/0xfa [<0216498b>] de_thread+0x511/0x63d [<02164245>] flush_old_exec+0x17/0x24c [<02164012>] kernel_read+0x31/0x3b [<0217fbe5>] load_elf_binary+0x56f/0xd1b [<02163bb1>] copy_strings+0x1ed/0x1f7 [<0217f676>] load_elf_binary+0x0/0xd1b [<02164d5c>] search_binary_handler+0xcf/0x242 [<0216503c>] do_execve+0x16d/0x1fd [<02104b09>] sys_execve+0x2b/0x8a 7871 tries to release process 0x3dfdd0b0, which is in state EXIT_DEAD. Roland? This is not the same as bug 228816. RHEL4 does not have that code at all. No RHEL4 problem ever relates usefully to such a RHEL5 bug (utrace). There have been many fixes upstream in the MT exec area since RHEL4. Sorry, I can't point you to exactly what you need to backport for this. Yes, you are right. This is a race between de_thread() and do_wait(). Created attachment 311383 [details]
proposed patch
Propose for 4.6.z and 4.7.z. This bug is much easier to trigger than I expected. I have seen it happened randomly on almost all arches while running my Tier1 tests for 4.6.z Kernel Errata testing. Also, it blocked the rest of tests to run. Tier1 test ID, http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=25231 X86_64, http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3610276 Kernel BUG at signal:369 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core loop dm_multipath button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata sd_mod scsi_mod Pid: 5563, comm: exe Not tainted 2.6.9-67.0.22.ELlargesmp RIP: 0010:[<ffffffff801417ce>] <ffffffff801417ce>{__exit_signal+29} RSP: 0018:000001001ccd9c68 EFLAGS: 00010046 RAX: 0000010038dde890 RBX: 0000000000000000 RCX: 000000000000005c RDX: 000001000000c000 RSI: ffffffff8050d380 RDI: 0000010038dde7f0 RBP: 0000010038dde7f0 R08: 0000000000000202 R09: 00000001801ad182 R10: 0000000000000000 R11: ffffffff801ad182 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 000001003da52540 FS: 0000000040a00960(005b) GS:ffffffff80506980(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a958722f9 CR3: 0000000000101000 CR4: 00000000000006e0 Process exe (pid: 5563, threadinfo 000001001ccd8000, task 000001003d0537f0) Stack: 0000010038dde7f0 0000010038dde7f0 0000010038dde7f0 0000000000000000 0000000000000000 ffffffff80139611 000001000000c000 0000000000000010 0000010038dde7f0 000001002c90b880 Call Trace:<ffffffff80139611>{release_task+126} <ffffffff80184699>{flush_old_exec+1696} <ffffffff8017a779>{vfs_read+248} <ffffffff801a408e>{load_elf_binary+0} <ffffffff801a4670>{load_elf_binary+1506} <ffffffff8015d866>{generic_file_aio_read+48} <ffffffff8017a655>{do_sync_read+178} <ffffffff801a408e>{load_elf_binary+0} <ffffffff80185198>{search_binary_handler+210} <ffffffff801854cd>{do_execve+398} <ffffffff80110276>{system_call+126} <ffffffff8010ee44>{sys_execve+52} <ffffffff8011069a>{stub_execve+106} Code: 0f 0b 4b 98 32 80 ff ff ff ff 71 01 8b 03 85 c0 75 0c 0f 0b RIP <ffffffff801417ce>{__exit_signal+29} RSP <000001001ccd9c68> PPC64, http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3609994 kernel BUG in __exit_signal at kernel/signal.c:369! cpu 0x2: Vector: 700 (Program Check) at [c000000002587810] pc: c00000000006f5cc: .__exit_signal+0x3c/0x208 lr: c000000000062ae4: .release_task+0xc0/0x1d4 sp: c000000002587a90 msr: 8000000000021032 current = 0xc000000002581120 paca = 0xc0000000003fb800 pid = 1, comm = init enter ? for help 2:mon> IA64, http://rhts.redhat.com/testlogs/25231/92544/778429/3610279-test_log--kernel-errata-4.6.z-437788-EXTERNALWATCHDOG.log kernel BUG at kernel/signal.c:369! init[1]: bugcheck! 0 [1] Modules linked in: nfsd exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat loop button ohci_hcd ehci_hcd e100 mii tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod Pid: 1, CPU 1, comm: init psr : 0000101008122010 ifs : 800000000000040b ip : [<a000000100094130>] Not tainted ip is at __exit_signal+0xb0/0x5e0 unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : a660159222969995 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100094130 b6 : a0000001000727c0 b7 : a00000020020cbc0 f6 : 1003e0000000000001200 f7 : 1003e8080808080808081 f8 : 1003e00000000000023dc f9 : 1003e000000000e580000 f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf r1 : a0000001009dcb20 r2 : 0000000000000002 r3 : 0000000000280000 r8 : 0000000000000023 r9 : 0000000000000001 r10 : e000000001014b2c r11 : 0000000000000003 r12 : e000000001427dd0 r13 : e000000001420000 r14 : 0000000000004000 r15 : a00000010076fbc0 r16 : a00000010076fbc8 r17 : e000004040e97de8 r18 : e000004040e9002c r19 : e000000001014b20 r20 : e000000001014ac0 r21 : 0000000000000003 r22 : 0000000000000002 r23 : e000004040e90040 r24 : e000000001015280 r25 : e000000001015278 r26 : e000000001015258 r27 : 0000000000000074 r28 : 0000000000000074 r29 : 0000000000000065 r30 : e000004040e90050 r31 : 00000000356f424c Call Trace: [<a000000100016e40>] show_stack+0x80/0xa0 sp=e000000001427940 bsp=e000000001421298 [<a000000100017750>] show_regs+0x890/0x8c0 sp=e000000001427b10 bsp=e000000001421250 [<a00000010003e9b0>] die+0x150/0x240 sp=e000000001427b30 bsp=e000000001421210 [<a00000010003eae0>] die_if_kernel+0x40/0x60 sp=e000000001427b30 bsp=e0000000014211d8 [<a00000010003ec80>] ia64_bad_break+0x180/0x600 sp=e000000001427b30 bsp=e0000000014211b0 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e000000001427c00 bsp=e0000000014211b0 [<a000000100094130>] __exit_signal+0xb0/0x5e0 sp=e000000001427dd0 bsp=e000000001421158 [<a00000010007b5d0>] release_task+0x110/0x340 sp=e000000001427dd0 bsp=e000000001421118 [<a0000001000817f0>] do_wait+0x1070/0x1620 sp=e000000001427dd0 bsp=e000000001421000 [<a000000100081f60>] sys_wait4+0x60/0x80 sp=e000000001427e30 bsp=e000000001420fa0 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20 sp=e000000001427e30 bsp=e000000001420fa0 [<a000000000010640>] 0xa000000000010640 sp=e000000001428000 bsp=e000000001420fa0 For record, I have seen this on the latest 4.7.z (78.0.3.EL) Kernels as well, kernel BUG in __exit_signal at kernel/signal.c:377! cpu 0x7: Vector: 700 (Program Check) at [c00000000ff83810] pc: c00000000006fee8: .__exit_signal+0x3c/0x250 lr: c000000000063284: .release_task+0xc0/0x1d4 sp: c00000000ff83a90 msr: 8000000000021032 current = 0xc0000001e3f11120 paca = 0xc000000000408000 pid = 1, comm = init enter ? for help 7:mon> ^M------------[ cut here ]------------ ^Mkernel BUG at kernel/signal.c:377! ^Minvalid operand: 0000 [#1] ^MSMP ^MModules linked in: lp(U) nfsd exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6 parport_pc parport autofs4 sunrpc cpufreq_powersave loop button battery ac ohci_hcd ehci_hcd k8_edac edac_mc tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_svw libata sd_mod scsi_mod ^MCPU: 0 ^MEIP: 0060:[<c012b541>] Not tainted VLI ^MEFLAGS: 00010046 (2.6.9-78.0.3.ELsmp) ^MEIP is at __exit_signal+0x16/0x14f ^Meax: f71b8e30 ebx: 00000000 ecx: 00000000 edx: 00000000 ^Mesi: f71b8e30 edi: 00000000 ebp: 00000000 esp: c7871edc ^Mds: 007b es: 007b ss: 0068 ^MProcess init (pid: 1, threadinfo=c7871000 task=f7e31630) ^MStack: f71b8e30 f71b8e30 00000000 00000000 c0123808 00000000 f71b8e30 bfe58dbc ^M 00006781 c01255ca 00000002 00000000 00000000 00000000 f71b8edc f71b8e30 ^M f7e31630 00000000 c0125c6a bfe58dbc 00000000 00040001 00006783 00000000 ^MCall Trace: ^M [<c0123808>] release_task+0x5c/0xfa ^M [<c01255ca>] wait_task_zombie+0x585/0x59b ^M [<c0125c6a>] do_wait+0x185/0x449 ^M [<c011e7f6>] default_wake_function+0x0/0xc ^M [<c011e7f6>] default_wake_function+0x0/0xc ^M [<c0125fc1>] sys_wait4+0x27/0x2a ^M [<c0125fd7>] sys_waitpid+0x13/0x17 ^M [<c02e09d7>] syscall_call+0x7/0xb Committed in 78.11.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ *** Bug 455179 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |