Description of problem: Where performing a simple gdb test on ppc, the gdb session hangs. When trying to kill the session with a kill -9 <gdb pid>, the machine panics. The reproducer is simple: #>gdb /bin/cat -x gdb-commands where gdb-commands is 'run /etc/aliases' When connecting to the hung gdb session with another gdb -p <gdb pid> command, the backtrace looks (gdb) bt #0 0x00000080b38926ec in __waitpid_nocancel () from /lib64/libc.so.6 #1 0x00000000101ed758 in .inf_ptrace_trad_target () #2 0x000000001008e4bc in .linux_nat_add_target () #3 0x0000000010161668 in .target_mourn_inferior () #4 0x000000001012f98c in .handle_inferior_event () #5 0x00000000101315dc in .wait_for_inferior () #6 0x00000000101ee244 in .startup_inferior () #7 0x00000000101ed854 in .inf_ptrace_trad_target () #8 0x00000000101ee650 in .fork_inferior () #9 0x00000000101ed7c8 in .inf_ptrace_trad_target () #10 0x000000001015d4b8 in .find_default_create_inferior () #11 0x000000001012ba8c in .get_inferior_args () #12 0x00000000100b2b50 in .scan_expression_with_cleanup () #13 0x00000000100b2e10 in .cmd_func () #14 0x000000001006c994 in .execute_command () #15 0x0000000010140db4 in .push_prompt () #16 0x0000000010142274 in .display_gdb_prompt () #17 0x00000000103a8a60 in .rl_callback_read_char () #18 0x0000000010141048 in .push_prompt () #19 0x00000000101419e4 in .stdin_event_handler () #20 0x00000000101406c4 in .gdb_do_one_event () #21 0x000000001013f534 in .delete_async_signal_handler () #22 0x00000000101401ec in .gdb_do_one_event () ---Type <return> to continue, or q <return> to quit--- #23 0x000000001013bb7c in .catch_errors () #24 0x00000000100c8a84 in ._initialize_tui_interp () #25 0x000000001013c4fc in .current_interp_command_loop () #26 0x00000000100628c8 in .gdb_main () #27 0x000000001013bb7c in .catch_errors () #28 0x0000000010063304 in .gdb_main () #29 0x000000001013bb7c in .catch_errors () #30 0x0000000010062884 in .gdb_main () #31 0x000000001006282c in .main () (gdb) quit under the mon boot shell (after the panic) the back trace looks like [root@squad9-lp1 ~]# Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6b8f Faulting instruction address: 0xc0000000001c7194 cpu 0x1: Vector: 300 (Data Access) at [c000000002d43780] pc: c0000000001c7194: ._raw_spin_lock+0x30/0x184 lr: c0000000003652e0: ._spin_lock+0x10/0x24 sp: c000000002d43a00 msr: 8000000000009032 dar: 6b6b6b6b6b6b6b8f dsisr: 40000000 current = 0xc00000000267b810 paca = 0xc0000000004caf80 pid = 3056, comm = gdb enter ? for help 1:mon> 1:mon> 1:mon> 1:mon> t [c000000002d43aa0] c0000000003652e0 ._spin_lock+0x10/0x24 [c000000002d43b10] c0000000000b39e4 .get_utrace_lock_attached+0x58/0xb4 [c000000002d43bb0] c0000000000b5140 .utrace_detach+0x34/0x128 [c000000002d43c50] c0000000000b8ea0 .ptrace_exit+0x84/0x1ac [c000000002d43cf0] c00000000007083c .do_exit+0x148/0xa50 [c000000002d43da0] c0000000000711ec .sys_exit_group+0x0/0x8 [c000000002d43e30] c0000000000086a4 syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 00000080b3893330 SP (ffffff6e590) is in userspace 1:mon> Version-Release number of selected component (if applicable): this was reproduced with the kernel-2.6.18-123.el5 kernel, but I believe kernel-2.6.18-122.el5 had the same issues. I don't know how far back this goes. How reproducible: always Steps to Reproduce: 1. gdb /bin/cat -x gdb-commands, where gdb-commands is 'run /etc/aliases' 2. (after seeing the hang), kill -9 <gdb pid> 3. box panics Actual results: hang, then panic Expected results: no hang, and no panic when killed Additional info:
Don (RH), Do you know if this is specific to POWER architecture? Have other architectures been tried?
Yes and they all succeed. This bug seems to have only popped up in the last couple of kernel releases and _only_ for ppc for some reason.
Does /bin/cat have debugging information? I can't reproduce this. It doesn't hang, nor as root or as a normal user. # cat gdb-commands run /etc/aliases # gdb /bin/cat -x gdb-commands GNU gdb Fedora (6.8-25.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "ppc64-redhat-linux-gnu"... (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) ... <Print all of /etc/aliases> ... Program exited normally. (gdb)
hmm, what kernel version are you using? I have been able to duplicate this on three different ppc boxes even with a 5.0 install and a 2.6.18-123.el5 kernel. The hang is instant, zero gdb output. I even get an audit warning complaining about an illegal SIGTRAP. Also all my testing has been done as root. -Don
This is a box loaded with this kernel: Linux 2.6.18-122.el5 #1 SMP Mon Nov 3 18:23:41 EST 2008 ppc64 ppc64 ppc64 GNU/Linux (RHEL 5.3 Snapshot 2) This is a POWER6 box. Could you please provide more information on what machine this is reproducing? Thanks, Luis
I think the problem is in ptrace_report_exec(). No return value is provided after sending the signal. The bug was introduced by the fix of bz455060.
Created attachment 323872 [details] proposed patch Not tested, not even compiled. Roland, should we return something special if send_sig() returns an error?
Luis (IBM), This bug was reported against squad9-lp1 which is a POWER p5+ Model# 913152A.
> Unable to handle kernel paging request for data at > address 0x6b6b6b6b6b6b6b8f POISON_FREE. > [c000000002d43aa0] c0000000003652e ._spin_lock+0x10/0x24 > [c000000002d43b10] c0000000000b39e4 .get_utrace_lock_attached+0x58/0xb4 so it must be utrace->lock (but the offset 0x8f - 0x6b = 36 is strange). Looks like target->utrace != NULL but freed. > [c000000002d43bb0] c0000000000b5140 .utrace_detach+0x34/0x128 > [c000000002d43c50] c0000000000b8ea0 .ptrace_exit+0x84/0x1ac > [c000000002d43cf0] c00000000007083c .do_exit+0x148/0xa50 grep, grep... shot in the dark. utrace_first_engine: target->utrace = utrace; if (target->exit_state == EXIT_DEAD) { target->utrace = NULL; kmem_cache_free(utrace); // !!! without RCU !!! } Otoh, get_utrace_lock_attached: utrace = target->utrace; if (target->exit_state != EXIT_DEAD) { spin_lock(&utrace->lock); // BOOM !! } What guarantees we must notice EXIT_DEAD ?
> What guarantees we must notice EXIT_DEAD ? To clarify. Yes I see, utrace_first_engine() does mb(), and get_utrace_lock_attached() does rmb(). But these barriers can't help here. When the task sets its ->exit_state, this STORE is completely "synchronous". IOW, let's suppose we have: utrace_first_engine() stores target->utrace get_utrace_lock_attached() loads ->utrace get_utrace_lock_attached() checks !EXIT_DEAD target sets ->exit_state = EXIT_DEAD utrace_first_engine() checks EXIT_DEAD and frees utrace. get_utrace_lock_attached() uses the freed utrace But again, this is just a wild guess...
> When the task sets its ->exit_state, this STORE is completely > "synchronous". Damn, sorry for noise... I meant "asynchronous". And just in case, yes I understand, this smp_mb() is for utrace_release_task().
> Damn, sorry for noise... Damn! even more sorry. please ignore my messages. Yes, it looks like POISONE_FREE'ed memory. but this doesn't mean ->utrace was freed! target was? Don't have the sources to look at, and anyway I think I should just shut up :( Oh, and it is not possible that ptrace_exit() can race with utrace_first_engine()...
I was able to recreate the hang, but not the BUG(), on a POWER 5 machine with the -123 kernel. I had only tried sending SIGKILL to the gdb process, but not attaching it from another gdb instance. I don't know if that is a factor. Jerome's patch fixed the hang. Redhat, have you tried the patch, and if so, does it solve the problem for you?
I built Jerome's patch against a -124 kernel, and it fixed the problem on a POWER 5+ machine. Again this machine never saw the kernel BUG() although gdb did hang.
I have tried the patch and it solves the panic but not the hang. When I run #> gdb /bin/cat (gdb) run /etc/aliases Starting program: /bin/cat /etc/aliases <hangs> Which is how it behaved before. However when I do a 'kill -9 <gdbpid>' instead of panicing the box, it properly kills the process and everything is normal again. So the patch fixed one problem, but I still have the other one (the hang). The system I tested on is: [root@squad9-lp1 ~]# cat /proc/cpuinfo processor : 0 cpu : POWER5+ (gs) clock : 2096.901000MHz revision : 3.0 (pvr 003b 0300) processor : 1 cpu : POWER5+ (gs) clock : 2096.901000MHz revision : 3.0 (pvr 003b 0300) processor : 2 cpu : POWER5+ (gs) clock : 2096.901000MHz revision : 3.0 (pvr 003b 0300) processor : 3 cpu : POWER5+ (gs) clock : 2096.901000MHz revision : 3.0 (pvr 003b 0300) timebase : 512052000 platform : pSeries machine : CHRP IBM,9131-52A [root@squad9-lp1 ~]# Performing a 'echo t > /proc/sysrq-trigger' during the hang produced this <snip> ... bash S 000000000fee9118 6144 2911 2909 3021 (NOTLB) Call Trace: [C000000002763920] [C0000000000340D4] .hash_preload+0x214/0x284 (unreliable) [C000000002763AF0] [C000000000010ACC] .__switch_to+0x124/0x148 [C000000002763B80] [C0000000003C5D44] .schedule+0xbfc/0xdb0 [C000000002763C90] [C00000000006FD90] .do_wait+0xd68/0xfbc [C000000002763DC0] [C00000000001642C] .compat_sys_waitpid+0x18/0x2c [C000000002763E30] [C0000000000086A4] syscall_exit+0x0/0x40 login S 000000000fee9070 11584 2988 1 2989 2904 (NOTLB) Call Trace: [C0000001EBCA3850] [C00000000057A8B0] 0xc00000000057a8b0 (unreliable) [C0000001EBCA3A20] [C000000000010ACC] .__switch_to+0x124/0x148 [C0000001EBCA3AB0] [C0000000003C5D44] .schedule+0xbfc/0xdb0 [C0000001EBCA3BC0] [C00000000006FD90] .do_wait+0xd68/0xfbc [C0000001EBCA3CF0] [C00000000009ACB4] .compat_sys_wait4+0x38/0x108 [C0000001EBCA3E30] [C0000000000086A4] syscall_exit+0x0/0x40 bash R running task 11408 2989 2988 (NOTLB) gdb S 00000080b38926ec 11248 3021 2911 (NOTLB) Call Trace: [C0000001EBB8B990] [C0000001EBB8BA80] 0xc0000001ebb8ba80 (unreliable) [C0000001EBB8BB60] [C000000000010ACC] .__switch_to+0x124/0x148 [C0000001EBB8BBF0] [C0000000003C5D44] .schedule+0xbfc/0xdb0 [C0000001EBB8BD00] [C00000000006FD90] .do_wait+0xd68/0xfbc [C0000001EBB8BE30] [C0000000000086A4] syscall_exit+0x0/0x40
I'm unable to recreate the hang with the patch applied, either to the -123 or -124 kernels. I've tried both a POWER 5 and POWER 5+ system. The 5+ looks like this: # cat /proc/cpuinfo processor : 0 cpu : POWER5+ (gs) clock : 1896.102000MHz revision : 2.0 (pvr 003b 0200) processor : 1 cpu : POWER5+ (gs) clock : 1896.102000MHz revision : 2.0 (pvr 003b 0200) processor : 2 cpu : POWER5+ (gs) clock : 1896.102000MHz revision : 2.0 (pvr 003b 0200) processor : 3 cpu : POWER5+ (gs) clock : 1896.102000MHz revision : 2.0 (pvr 003b 0200) timebase : 237330000 platform : pSeries machine : CHRP IBM,9133-55A gdb is at version: GNU gdb Fedora (6.8-23.el5) What little I found out about the hang on an unpatched system is that gdb appeared to be waiting on a child, but I wasn't able to find an existing child using ps.
I'm still not able to recreate the problem with the patch. I even tried a 5.0 box with the -123 kernel + the patch, since that was mentioned as one of the boxes seeing the problem. I'll be taking vacation next week, but we'll find somebody to keep looking at this.
(In reply to comment #8) That patch is correct. There is no need to check the send_sig() return value (original upstream behavior is to ignore it).
By 'the patch is correct' do you mean it fixes the hang+bug on your end? If so, are you including this patch in the next snapshot?
Oddly, using the -124.el5 kernel on the machines that caused problems before has made this problem go away. Not sure what happened. I tried to bisect it down to the patch but had no success. I feel inclined to close this as NOTABUG/WORKSFORME but Jerome's patch does seem useful, so posting that oneline patch is worth something??
Since testing with and without the patch by IBM folks showed that GDB behaves correctly with the patch and hangs without it, my opinion is that the patch really fixes a bug and should be included. The test team on our side also had similar issues running ltrace tests, so it shows that the ptrace interface is not behaving well as it is now.
RHKL post: http://post-office.corp.redhat.com/archives/rhkernel-list/2008-November/msg00787.html
should this bug go to fixed awaiting test?
in kernel-2.6.18-126.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
This does not recreate using the -128 kernel. Closing this bug on the IBM side.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html