While working on bug 478419, Fujitsu stumbled across this kernel panic: [root@localhost test_ia64]# ./tstrun/run2.sh.changed starting strace 1... ...end of iteration 1 starting strace 2... ...end of iteration 2 starting strace 3... ...end of iteration 3 starting strace 4... ...end of iteration 4 starting strace 5... ...end of iteration 5 starting strace 6... kernel BUG at kernel/ptrace.c:1068! strace[12847]: bugcheck! 0 [1] Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api vfat fat dm_mirror dm_multipath scsi_dh button parport_pc lp parport joydev e100 mii sr_mod tg3 cdrom libphy sg dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage lpfc scsi_transport_fc shpchp mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 12847, CPU 1, comm: strace psr : 00001010085a6010 ifs : 8000000000000918 ip : [<a000000100104e30>] Not tainted ip is at sys_ptrace+0x4d0/0x1120 unat: 0000000000000000 pfs : 0000000000000918 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000055a559 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100104e30 b6 : a000000100011060 b7 : a0000001002de0a0 f6 : 1003e00000000000000a0 f7 : 1003e20c49ba5e353f7cf f8 : 1003e00000000000004e2 f9 : 1003e000000000fa00000 f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db r1 : a000000100c12950 r2 : a000000100a2ae38 r3 : a00000010095a1e0 r8 : 0000000000000027 r9 : a000000100a2ae68 r10 : a000000100a2ae68 r11 : 0000000000000000 r12 : e00000013914fe10 r13 : e000000139148000 r14 : a000000100a2ae38 r15 : 0000000000000000 r16 : a00000010095a1e8 r17 : e0000001254dfe18 r18 : 0000000000000000 r19 : 0000000000000000 r20 : a00000010085d280 r21 : a000000100a12fa8 r22 : a000000100a2ae40 r23 : a000000100a2ae40 r24 : a0000001007e1054 r25 : 0000000000000000 r26 : a0000001007e105c r27 : a0000001007e1040 r28 : a0000001007e0008 r29 : 0000005000000078 r30 : 0000000000000000 r31 : 0000000000000000 Call Trace: [<a000000100013ba0>] show_stack+0x40/0xa0 sp=e00000013914f9a0 bsp=e000000139149338 [<a0000001000144a0>] show_regs+0x840/0x880 sp=e00000013914fb70 bsp=e0000001391492e0 [<a000000100037c00>] die+0x1c0/0x2c0 sp=e00000013914fb70 bsp=e000000139149298 [<a000000100037d50>] die_if_kernel+0x50/0x80 sp=e00000013914fb90 bsp=e000000139149268 [<a000000100650310>] ia64_bad_break+0x270/0x4a0 sp=e00000013914fb90 bsp=e000000139149240 [<a00000010000c040>] __ia64_leave_kernel+0x0/0x280 sp=e00000013914fc40 bsp=e000000139149240 [<a000000100104e30>] sys_ptrace+0x4d0/0x1120 sp=e00000013914fe10 bsp=e000000139149178 [<a00000010000bdd0>] __ia64_trace_syscall+0xd0/0x110 sp=e00000013914fe30 bsp=e000000139149178 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e000000139150000 bsp=e000000139149178 <0>Kernel panic - not syncing: Fatal exception -------------------------------------------------------------- Reproducible on ia64, failed to reproduce on x86_64 and i386. Will attach the reproducer.
Created attachment 330763 [details] Reproducer for ia64 Recipe: on an ia64 machine: * run ./mk.sh to compile test_select.c to test_select binary * run ./run.sh * watch "starting strace N, pid P..." / "...end of iteration N" messages * after ~4000 iterations, machine freezes (tested twice on different machines) NB: tarball includes a new strace binary, not yet released in RHEL.
I saw it on: intel-s6e5132-01.rhts.bos.redhat.com hp-diablo-01.rhts.bos.redhat.com
Created attachment 331125 [details] Reproducer for ia64, this time with correct strace binary (old one had x86 one!)
Un-private-ing comment #3 for dchapman (on-site HP partner engineer). P.
Denys, what kernel rev was this seen on? P.
I was finally able to reproduce on 2.6.18-130, it took over 6000 iters, I have other systems running the test that are over 8000 iters so evidently this is a rather narrow race condition. I will investigate.
Easily reproduced on 2.6.18-128 (63 iters).
(In reply to comment #9) > Easily reproduced on 2.6.18-128 (63 iters). Interesting, which system are you using that hits this so quickly? I have not seen it in less than 6000 iters and in bug 478419 they mention it usually taking at least 4000 iters. It would be easier to experiment if I could find the magic system that hits this quickly.
Using itanium 2 with 32 cores. AFAICS, as more cpu's you have, as quicker it hits. I think it's coming form the following code: utrace.c:760 760 spin_lock(&utrace->lock); 761 if (unlikely(rcu_dereference(target->utrace) != utrace) 762 || unlikely(rcu_dereference(engine->ops) 763 == &dead_engine_ops)) { 764 /* 765 * By the time we got the utrace lock, 766 * it had been reaped or detached already. 767 */ 768 spin_unlock(&utrace->lock); 769 utrace = ERR_PTR(-ESRCH); 770 } And while getting the spinlock, something happens with the utrace. Will examine it further, if any updates appear - I'll write here. And, veeery strange, but the vmcore's bt shows that it's not exiting from ptrace:1068, but from ptrace:449...
(In reply to comment #11) > Using itanium 2 with 32 cores. AFAICS, as more cpu's you have, as quicker it > hits. That's what I expected also but when I ran on the 64 cpu system back at HP it took 35000 iterations, very odd. > > And, veeery strange, but the vmcore's bt shows that it's not exiting from > ptrace:1068, but from ptrace:449... Ok, not just me. I have had the vmcore point to other places than expected also. however each time it was pointing to a BUG_ON(ret != -ESRCH)
I added some printk's to ptrace.c to see what ret was set to (since it is obvioulsy not -ESRCH) I found that it is set to -114 or -EALREADY this comes from ptrace_update which calls utrace_set_flags. It appears -EALREADY is the "default" return value for utrace_set_flags. So... not sure why this is an error, and if it is an error I don't understand why it is something we would panic on. I see there are a few places where utrace_set_flags can do the return, adding some code to figure out which one we are hitting.
What appears to be happening is we are trying to do a PTRACE_SINGLESTEP just as the target process exits. in kernel/utrace.c in utrace_set_flags we have 981 int ret = -EALREADY; .... 1042 if (unlikely(target->exit_state)) { 1043 read_unlock(&tasklist_lock); 1044 spin_unlock(&utrace->lock); 1045 return ret; 1046 } So, if the target process has exited we return -EALREADY then in kernel/ptrace.c: 1066 ret = ptrace_update(child, state, flags, 1); 1067 if (ret) 1068 BUG_ON(ret != -ESRCH); (ptrace_update gets its return value from utrace_set_flags) So, my guess is this BUG_ON is invalid, unless there is some reason why this state really is invalid. It seems we should just change: 1067 if (ret) 1068 BUG_ON(ret != -ESRCH); with 1067 if (ret) 1068 return ret; or something along those lines.
until people who actually understand utrace comment on it, I propose replacing BUG_ON with WARN_ON(ret != ESRCH && ret != EALREADY), and making ret == EALREADY act the same as ret == ESRCH.
I have confirmed that this is not ia64 specific. I was able to reproduce this on an x86_64 system running kernel-2.6.18-128.el5 Kernel BUG at kernel/ptrace.c:1068 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/class CPU 1 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac lp sg floppy i2c_i801 i2c_core e1000 ide_cd parport_pc cdrom shpchp parport serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 23377, comm: strace Not tainted 2.6.18-128.el5 #1 RIP: 0010:[<ffffffff800bfa61>] [<ffffffff800bfa61>] ptrace_common+0x146/0x19c RSP: 0018:ffff810004a9bf08 EFLAGS: 00010283 RAX: 00000000ffffff8e RBX: 0000000000000018 RCX: 0000000000000000 RDX: 0000000000000410 RSI: 0000000000000028 RDI: ffffffff803c3e80 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: ffff81001ca400c0 R11: ffff81001cc76b20 R12: ffff81001a38b840 R13: ffff8100050b6fc0 R14: ffff8100035cb860 R15: 000000000e1e6920 FS: 00002b7991a49250(0000) GS:ffff81001f31a7c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaf420d008 CR3: 000000001c2dc000 CR4: 00000000000006e0 Process strace (pid: 23377, threadinfo ffff810004a9a000, task ffff810003e700c0) Stack: 0000000000000001 0000000000000018 000000000e1e6910 0000000000000000 0000000000000000 ffffffff800c0570 0000000000000000 0000000000000018 000000000e1e6920 ffffffff8005d229 0000000000000000 ffff81001a38b840 Call Trace: [<ffffffff800c0570>] sys_ptrace+0xf0/0x1f7 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 0f 0b 68 10 b6 29 80 c2 2c 04 eb 3f 48 f7 c1 80 ff ff ff bd RIP [<ffffffff800bfa61>] ptrace_common+0x146/0x19c RSP <ffff810004a9bf08> <0>Kernel panic - not syncing: Fatal exception
Also on i386, with -128.el5 kernel. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=6730459 (please see the EXTERNALWATCHDOG.log) Last sub-test or RHTS test '/tools/strace/Regression/478419-threads-stop-when-strace--f-is-executed-on-multi-thread-process' caused that panic.
Any updates on this? This is an urgent bug, but I've not seen any patch posted to rhkernel. By the way, looking at kernel/ptrace.c, -EALREADY is handled in some places like this: if (unlikely(ret == -EALREADY)) // some are not wrapped with unlikely() ret = -ESRCH; I'm not familiar with the ptrace code, but I'd imagine we could use the same lines for this bug case as well? We might also need it for "case PTRACE_SETOPTIONS". In essence, this approach does the same thing as the Doug's idea unless there is a possibility ptrace_update() returns neither 0, -EALREADY nor -ESRCH.
(In reply to comment #11) > And, veeery strange, but the vmcore's bt shows that it's not exiting from > ptrace:1068, but from ptrace:449... Strange indeed as here -EALREADY is treated like -ESEARCH. Do you know which value was returned.
Updating PM score.
(In reply to comment #17) > I have confirmed that this is not ia64 specific. I was able to reproduce this > on an x86_64 system running kernel-2.6.18-128.el5 Did you use the same testcase, or something more specific?
(In reply to comment #24) > (In reply to comment #17) > > I have confirmed that this is not ia64 specific. I was able to reproduce this > > on an x86_64 system running kernel-2.6.18-128.el5 > > Did you use the same testcase, or something more specific? I used the test case attached here that has the x86_64 version of strace.
(In reply to comment #22) > (In reply to comment #11) > > And, veeery strange, but the vmcore's bt shows that it's not exiting from > > ptrace:1068, but from ptrace:449... > Strange indeed as here -EALREADY is treated like -ESEARCH. Do you know which > value was returned. Look at comment #15 where I explain this in detail including showing line numbers and values returned.
(In reply to comment #26) > (In reply to comment #22) > > (In reply to comment #11) > > > And, veeery strange, but the vmcore's bt shows that it's not exiting from > > > ptrace:1068, but from ptrace:449... > > Strange indeed as here -EALREADY is treated like -ESEARCH. Do you know which > > value was returned. > > Look at comment #15 where I explain this in detail including showing line > numbers and values returned. That explain the bug at ptrace:1068, not at ptrace:449 where the case -EALREADY is handled correctely.
(In reply to comment #25) > (In reply to comment #24) > > (In reply to comment #17) > > > I have confirmed that this is not ia64 specific. I was able to reproduce this > > > on an x86_64 system running kernel-2.6.18-128.el5 > > > > Did you use the same testcase, or something more specific? > > I used the test case attached here that has the x86_64 version of strace. How many cores did you use? Did you start several strace's in paralel? ...end of iteration 679334 And nothing :( on a 2.6.18-128 x86_64. And, btw, did you get SEGFAULT's on strace binary in strace? something like Feb 18 13:46:00 dell-pe1950-03 kernel: test_select[32024]: segfault at 0000000041bb4a08 rip 00000038414074f0 rsp 00007fffbfc36300 error 4 RIP is always the same, and the offset is always ending with a08 (didn't dig into it). Thank you!
(In reply to comment #28) > > How many cores did you use? Did you start several strace's in paralel? I just asked rhts to give me a random system and tried it so I don't know the details. The system was nec-em3. I don't recall if I did several test instances in parallel but I have found that reproduces the failure more quickly on ia64 so I may have. > > ...end of iteration 679334 > > And nothing :( on a 2.6.18-128 x86_64. > > And, btw, did you get SEGFAULT's on strace binary in strace? > something like > Feb 18 13:46:00 dell-pe1950-03 kernel: test_select[32024]: segfault at > 0000000041bb4a08 rip 00000038414074f0 rsp 00007fffbfc36300 error 4 > RIP is always the same, and the offset is always ending with a08 (didn't dig > into it). I don't recall seeing this.
Patch posted: http://post-office.corp.redhat.com/archives/rhkernel-list/2009-February/msg00476.html
Sorry for a long and somewhat stupid comment, but it's really hard to describe utrace workflow with my english level... The bug occurs only when the process is in EXIT_ZOMBIE exit_state. Current implementation of utrace doesn't properly handle it. Feb 20 20:13:25 dell-pe1950-03 kernel: 1) child->exit_state = 16, request = 24, data = 0 Feb 20 20:13:25 dell-pe1950-03 kernel: 2) child->exit_state = 16, request = 24, data = 0 Feb 20 20:13:25 dell-pe1950-03 kernel: 3) child->exit_state = 16, request = 24, data = 0 Feb 20 20:13:25 dell-pe1950-03 kernel: ZOMBIE detected. Feb 20 20:13:25 dell-pe1950-03 kernel: set_utrace_flags = 523624, old_utrace_flags = 40, set_utrace_flags &~ old_utrace_flags = 523584, target->exit_state = 16 Feb 20 20:13:25 dell-pe1950-03 kernel: BUG_ON 1068, ret=-114, request=24, EEE=2, exit_state before ptrace_induce_signal = 16, signal number = 0 Feb 21 20:13:25 dell-pe1950-03 kernel: Feb 21 20:13:25 dell-pe1950-03 kernel: 4) child->exit_state = 16, request = 24, data = 0 request = 24 = PTRACE_SYSCALL exit_state = 16 = EXIT_ZOMBIE (set_utrace_flags &~ old_state_flags) & _UTRACE_EVENT_DEATH == 1 where DEATH_EVENTS = (UTRACE_EVENT(DEATH) | UTRACE_EVENT(QUIESCE) = _UTRACE_EVENT_DEATH | _UTRACE_EVENT_QUIESCE Relevant parts of code: kernel/utrace.c 973 int EEE; 974 int 975 utrace_set_flags(struct task_struct *target, ... 994 if (unlikely(IS_ERR(utrace))){ 995 EEE = 3; 996 return PTR_ERR(utrace); 997 } ... 1004 if (target->exit_state 1005 && (((flags &~ old_flags) & (UTRACE_ACTION_QUIESCE 1006 | UTRACE_ACTION_NOREAP 1007 | DEATH_EVENTS)) 1008 || ((utrace->u.exit.flags & EXIT_FLAG_DEATH) 1009 && ((old_flags &~ flags) & DEATH_EVENTS)) 1010 || ((utrace->u.exit.flags & EXIT_FLAG_REAP) 1011 && ((old_flags &~ flags) & UTRACE_EVENT(REAP))))) { 1012 spin_unlock(&utrace->lock); 1013 EEE = 1; 1014 return ret; ... 1046 if (unlikely(target->exit_state)) { 1047 read_unlock(&tasklist_lock); 1048 spin_unlock(&utrace->lock); 1049 printk("set_utrace_flags = %lu, old_utrace_flags = %lu, set_utrace_flags &~ old_utrace_flags = %lu, target->exit_state = %ld\n",set_utrace_flags,old_utrace_flags,set_utrace_flags &~ old_utrace_flags,target->exit_state); 1050 EEE = 2; 1051 return ret; ... 1138 EEE = 4; 1139 return ret; 1140 } kernel/ptrace.c 995 996 extern int EEE; 997 998 static int 999 ptrace_common(long request, struct task_struct *child, ... 1050 if(pis_exit_state == EXIT_ZOMBIE){ 1051 printk("ZOMBIE detected.\n"); 1052 } ... 1071 if (ret){ 1072 printk("BUG_ON 1068, ret=%d, request=%ld, EEE=%d, exit_state before ptrace_induce_signal = %ld, signal number = %ld\n\n",ret,request,EEE,pis_exit_state,data); 1073 /* BUG_ON(ret != -ESRCH); */ 1074 } ... 1095 return ret; 1096 } 1097 1098 1099 asmlinkage long sys_ptrace(long request, long pid, long addr, long data) 1100 { 1101 struct task_struct *child = NULL; 1102 struct utrace_attached_engine *engine = NULL; 1103 struct ptrace_state *state = NULL; 1104 long ret, val; 1105 1106 pr_debug("%d sys_ptrace(%ld, %ld, %lx, %lx)\n", 1107 current->pid, request, pid, addr, data); 1108 1109 ret = ptrace_start(pid, request, &child, &engine, &state); 1110 if (ret != -EIO) 1111 goto out; 1112 if(child->exit_state){ 1113 printk("1) child->exit_state = %ld, request = %ld, data = %ld\n",child->exit_state, request, data); 1114 } 1115 val = 0; 1116 ret = arch_ptrace(&request, child, engine, addr, data, &val); 1117 if(child->exit_state){ 1118 printk("2) child->exit_state = %ld, request = %ld, data = %ld\n",child->exit_state, request, data); 1119 } ... 1128 switch (request) { 1129 default: 1130 if(child->exit_state){ 1131 printk("3) child->exit_state = %ld, request = %ld, data = %ld\n",child->exit_state, request, data); 1132 } 1133 ret = ptrace_common(request, child, engine, state, addr, data); 1134 if(child->exit_state){ 1135 printk("4) child->exit_state = %ld, request = %ld, data = %ld\n",child->exit_state, request, data); 1136 } 1137 break; There are also a lot of other small cases, when the process to which we attach is already EXIT_ZOMBIE or EXIT_DEAD, but they are handled smoothly. So, it happens only when, in sys_ptrace, the process we get is already in EXIT_ZOMBIE state and there is a PTRACE_SYSCALL request. From the ptrace_common, we already see that a bug was fixed against PTRACE_DETACH event. However, there is a possibility that we ptrace_update from PTRACE_SETOPTIONS and PTRACE_OLDSETOPTIONS, where also the bug will be triggered (ptrace_common -> ptrace_update -> utrace_update -> return -EALREADY on zombie). So, from how I understand the code, we should also change the BUG_ON from PTRACE_SETOPTIONS and PTRACE_OLDSETOPTIONS. The best option, from my point of view, would be to verify if the process is in EXIT_ZOMBIE state in sys_ptrace, and to call ptrace_update on EXIT_ZOMBIE only when there is PTRACE_KILL request. However, I don't know the code enough to say it for sure. Could somebody, who knows better the code, affirm it?
ptrace_update has no useful effect when the task is already dead (EXIT_ZOMBIE).
in kernel-2.6.18-133.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html