471112 – gdb on ppc hangs, then panics with a kill -9

Bug 471112 - gdb on ppc hangs, then panics with a kill -9

Summary: gdb on ppc hangs, then panics with a kill -9

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	ppc64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ameet Paranjape
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-11 20:49 UTC by Don Zickus
Modified:	2009-01-20 19:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 19:45:18 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed patch (335 bytes, application/octet-stream) 2008-11-18 09:48 UTC, Jerome Marchand	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	49985	0	None	None	None	Never
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Don Zickus 2008-11-11 20:49:24 UTC

Description of problem:
Where performing a simple gdb test on ppc, the gdb session hangs.  When
trying to kill the session with a kill -9 <gdb pid>, the machine panics.

The reproducer is simple:

#>gdb /bin/cat -x gdb-commands
where gdb-commands is 'run /etc/aliases'

When connecting to the hung gdb session with another gdb -p <gdb pid>
command, the backtrace looks 

(gdb) bt                                                                              
#0  0x00000080b38926ec in __waitpid_nocancel () from /lib64/libc.so.6                 
#1  0x00000000101ed758 in .inf_ptrace_trad_target ()                                  
#2  0x000000001008e4bc in .linux_nat_add_target ()                                    
#3  0x0000000010161668 in .target_mourn_inferior ()                                   
#4  0x000000001012f98c in .handle_inferior_event ()                                   
#5  0x00000000101315dc in .wait_for_inferior ()                                       
#6  0x00000000101ee244 in .startup_inferior ()                                        
#7  0x00000000101ed854 in .inf_ptrace_trad_target ()                                  
#8  0x00000000101ee650 in .fork_inferior ()                                           
#9  0x00000000101ed7c8 in .inf_ptrace_trad_target ()                                  
#10 0x000000001015d4b8 in .find_default_create_inferior ()                            
#11 0x000000001012ba8c in .get_inferior_args ()                                       
#12 0x00000000100b2b50 in .scan_expression_with_cleanup ()                            
#13 0x00000000100b2e10 in .cmd_func ()                                                
#14 0x000000001006c994 in .execute_command ()                                         
#15 0x0000000010140db4 in .push_prompt ()                                             
#16 0x0000000010142274 in .display_gdb_prompt ()                                      
#17 0x00000000103a8a60 in .rl_callback_read_char ()                                   
#18 0x0000000010141048 in .push_prompt ()                                             
#19 0x00000000101419e4 in .stdin_event_handler ()                                     
#20 0x00000000101406c4 in .gdb_do_one_event ()                                        
#21 0x000000001013f534 in .delete_async_signal_handler ()                             
#22 0x00000000101401ec in .gdb_do_one_event ()                                        
---Type <return> to continue, or q <return> to quit---                                
#23 0x000000001013bb7c in .catch_errors ()                                            
#24 0x00000000100c8a84 in ._initialize_tui_interp ()                                  
#25 0x000000001013c4fc in .current_interp_command_loop ()                             
#26 0x00000000100628c8 in .gdb_main ()                                                
#27 0x000000001013bb7c in .catch_errors ()                                            
#28 0x0000000010063304 in .gdb_main ()                                                
#29 0x000000001013bb7c in .catch_errors ()                                            
#30 0x0000000010062884 in .gdb_main ()                                                
#31 0x000000001006282c in .main ()                                                    
(gdb) quit                              

under the mon boot shell (after the panic) the back trace looks like

[root@squad9-lp1 ~]# Unable to handle kernel paging request for data at
address 0x6b6b6b6b6b6b6b8f
Faulting instruction address: 0xc0000000001c7194                                                  
cpu 0x1: Vector: 300 (Data Access) at [c000000002d43780]                                          
    pc: c0000000001c7194: ._raw_spin_lock+0x30/0x184                                              
    lr: c0000000003652e0: ._spin_lock+0x10/0x24                                                   
    sp: c000000002d43a00                                                                          
   msr: 8000000000009032                                                                          
   dar: 6b6b6b6b6b6b6b8f                                                                          
 dsisr: 40000000                                                                                  
  current = 0xc00000000267b810                                                                    
  paca    = 0xc0000000004caf80                                                                    
    pid   = 3056, comm = gdb                                                                      
enter ? for help                                                                                  
1:mon>                                                                                            
1:mon>                                                                                            
1:mon>                                                                                            
1:mon> t                                                                                          
[c000000002d43aa0] c0000000003652e0 ._spin_lock+0x10/0x24                                         
[c000000002d43b10] c0000000000b39e4 .get_utrace_lock_attached+0x58/0xb4                           
[c000000002d43bb0] c0000000000b5140 .utrace_detach+0x34/0x128                                     
[c000000002d43c50] c0000000000b8ea0 .ptrace_exit+0x84/0x1ac                                       
[c000000002d43cf0] c00000000007083c .do_exit+0x148/0xa50                                          
[c000000002d43da0] c0000000000711ec .sys_exit_group+0x0/0x8                                       
[c000000002d43e30] c0000000000086a4 syscall_exit+0x0/0x40                                         
--- Exception: c00 (System Call) at 00000080b3893330                                              
SP (ffffff6e590) is in userspace                                                                  
1:mon>                           

Version-Release number of selected component (if applicable):
this was reproduced with the kernel-2.6.18-123.el5 kernel, but I believe
kernel-2.6.18-122.el5 had the same issues.  I don't know how far back this
goes.

How reproducible:
always

Steps to Reproduce:
1. gdb /bin/cat -x gdb-commands, where gdb-commands is 'run /etc/aliases'
2. (after seeing the hang), kill -9 <gdb pid>
3. box panics

Actual results:
hang, then panic

Expected results:
no hang, and no panic when killed

Additional info:

Comment 2 IBM Bug Proxy 2008-11-17 21:20:50 UTC

Don (RH),

Do you know if this is specific to POWER architecture?  Have other architectures been tried?

Comment 3 Don Zickus 2008-11-17 21:28:56 UTC

Yes and they all succeed.  This bug seems to have only popped up in the last couple of kernel releases and _only_ for ppc for some reason.

Comment 4 IBM Bug Proxy 2008-11-18 00:00:26 UTC

Does /bin/cat have debugging information?


I can't reproduce this. It doesn't hang, nor as root or as a normal user.

# cat gdb-commands
run /etc/aliases
# gdb /bin/cat -x gdb-commands
GNU gdb Fedora (6.8-25.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64-redhat-linux-gnu"...
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
...
<Print all of /etc/aliases>
...
Program exited normally.
(gdb)

Comment 5 Don Zickus 2008-11-18 02:01:36 UTC

hmm, what kernel version are you using?

I have been able to duplicate this on three different ppc boxes even with a 5.0 install and a 2.6.18-123.el5 kernel.

The hang is instant, zero gdb output.  I even get an audit warning complaining about an illegal SIGTRAP.  Also all my testing has been done as root.

-Don

Comment 6 IBM Bug Proxy 2008-11-18 02:10:58 UTC

This is a box loaded with this kernel:

Linux 2.6.18-122.el5 #1 SMP Mon Nov 3 18:23:41 EST 2008 ppc64 ppc64 ppc64 GNU/Linux (RHEL 5.3 Snapshot 2)

This is a POWER6 box.

Could you please provide more information on what machine this is reproducing?

Thanks,
Luis

Comment 7 Jerome Marchand 2008-11-18 09:42:31 UTC

I think the problem is in ptrace_report_exec(). No return value is provided after sending the signal.
The bug was introduced by the fix of bz455060.

Comment 8 Jerome Marchand 2008-11-18 09:48:54 UTC

Created attachment 323872 [details]
proposed patch

Not tested, not even compiled.

Roland, should we return something special if send_sig() returns an error?

Comment 9 IBM Bug Proxy 2008-11-18 17:29:41 UTC

Luis (IBM),

This bug was reported against squad9-lp1 which is a POWER p5+ Model# 913152A.

Comment 10 Oleg Nesterov 2008-11-18 18:48:42 UTC

> Unable to handle kernel paging request for data at
> address 0x6b6b6b6b6b6b6b8f

POISON_FREE.

> [c000000002d43aa0] c0000000003652e ._spin_lock+0x10/0x24                       
> [c000000002d43b10] c0000000000b39e4 .get_utrace_lock_attached+0x58/0xb4

so it must be utrace->lock (but the offset 0x8f - 0x6b = 36 is
strange).

Looks like target->utrace != NULL but freed.

> [c000000002d43bb0] c0000000000b5140 .utrace_detach+0x34/0x128                   
> [c000000002d43c50] c0000000000b8ea0 .ptrace_exit+0x84/0x1ac                     
> [c000000002d43cf0] c00000000007083c .do_exit+0x148/0xa50                        

grep, grep... shot in the dark.

    utrace_first_engine:

        target->utrace = utrace;
        if (target->exit_state == EXIT_DEAD) {
              target->utrace = NULL;
              kmem_cache_free(utrace); // !!! without RCU !!!
        }

Otoh,

    get_utrace_lock_attached:

        utrace = target->utrace;
        if (target->exit_state != EXIT_DEAD) {
               spin_lock(&utrace->lock); // BOOM !!
        }

What guarantees we must notice EXIT_DEAD ?

Comment 11 Oleg Nesterov 2008-11-18 19:16:45 UTC

> What guarantees we must notice EXIT_DEAD ?

To clarify. Yes I see, utrace_first_engine() does mb(), and
get_utrace_lock_attached() does rmb(). But these barriers can't
help here.

When the task sets its ->exit_state, this STORE is completely
"synchronous".

IOW, let's suppose we have:

        utrace_first_engine() stores target->utrace

        get_utrace_lock_attached() loads ->utrace

        get_utrace_lock_attached() checks !EXIT_DEAD

        target sets ->exit_state = EXIT_DEAD

        utrace_first_engine() checks EXIT_DEAD and
        frees utrace.

        get_utrace_lock_attached() uses the freed utrace


But again, this is just a wild guess...

Comment 12 Oleg Nesterov 2008-11-18 19:31:21 UTC

> When the task sets its ->exit_state, this STORE is completely
> "synchronous".

Damn, sorry for noise... I meant "asynchronous".

And just in case, yes I understand, this smp_mb() is for
utrace_release_task().

Comment 13 Oleg Nesterov 2008-11-18 22:05:24 UTC

> Damn, sorry for noise...

Damn! even more sorry. please ignore my messages.

Yes, it looks like POISONE_FREE'ed memory. but this doesn't mean
->utrace was freed! target was? Don't have the sources to look at,
and anyway I think I should just shut up :( Oh, and it is not possible
that ptrace_exit() can race with utrace_first_engine()...

Comment 15 IBM Bug Proxy 2008-11-20 16:51:57 UTC

I was able to recreate the hang, but not the BUG(), on a POWER 5 machine with the -123 kernel.  I had only tried sending SIGKILL to the gdb process, but not attaching it from another gdb instance.  I don't know if that is a factor.

Jerome's patch fixed the hang.

Redhat, have you tried the patch, and if so, does it solve the problem for you?

Comment 16 IBM Bug Proxy 2008-11-20 20:32:33 UTC

I built Jerome's patch against a -124 kernel, and it fixed the problem on a POWER 5+ machine.  Again this machine never saw the kernel BUG() although gdb did hang.

Comment 17 Don Zickus 2008-11-20 20:33:45 UTC

I have tried the patch and it solves the panic but not the hang.

When I run

#> gdb /bin/cat
(gdb) run /etc/aliases
Starting program: /bin/cat /etc/aliases
<hangs>

Which is how it behaved before.  However when I do a 'kill -9 <gdbpid>' instead of panicing the box, it properly kills the process and everything is normal again.

So the patch fixed one problem, but I still have the other one (the hang).

The system I tested on is:

[root@squad9-lp1 ~]# cat /proc/cpuinfo                                                                                   
processor       : 0
cpu             : POWER5+ (gs)
clock           : 2096.901000MHz
revision        : 3.0 (pvr 003b 0300)

processor       : 1
cpu             : POWER5+ (gs)
clock           : 2096.901000MHz
revision        : 3.0 (pvr 003b 0300)

processor       : 2
cpu             : POWER5+ (gs)
clock           : 2096.901000MHz
revision        : 3.0 (pvr 003b 0300)

processor       : 3
cpu             : POWER5+ (gs)
clock           : 2096.901000MHz
revision        : 3.0 (pvr 003b 0300)

timebase        : 512052000
platform        : pSeries
machine         : CHRP IBM,9131-52A
[root@squad9-lp1 ~]#


Performing a 'echo t > /proc/sysrq-trigger' during the hang produced this

<snip>
...
bash          S 000000000fee9118  6144  2911   2909  3021            (NOTLB)            
Call Trace:                                                                                
[C000000002763920] [C0000000000340D4] .hash_preload+0x214/0x284 (unreliable)               
[C000000002763AF0] [C000000000010ACC] .__switch_to+0x124/0x148                             
[C000000002763B80] [C0000000003C5D44] .schedule+0xbfc/0xdb0                                
[C000000002763C90] [C00000000006FD90] .do_wait+0xd68/0xfbc                                 
[C000000002763DC0] [C00000000001642C] .compat_sys_waitpid+0x18/0x2c
[C000000002763E30] [C0000000000086A4] syscall_exit+0x0/0x40
login         S 000000000fee9070 11584  2988      1  2989          2904 (NOTLB)
Call Trace:
[C0000001EBCA3850] [C00000000057A8B0] 0xc00000000057a8b0 (unreliable)
[C0000001EBCA3A20] [C000000000010ACC] .__switch_to+0x124/0x148
[C0000001EBCA3AB0] [C0000000003C5D44] .schedule+0xbfc/0xdb0
[C0000001EBCA3BC0] [C00000000006FD90] .do_wait+0xd68/0xfbc
[C0000001EBCA3CF0] [C00000000009ACB4] .compat_sys_wait4+0x38/0x108
[C0000001EBCA3E30] [C0000000000086A4] syscall_exit+0x0/0x40
bash          R  running task   11408  2989   2988                     (NOTLB)
gdb           S 00000080b38926ec 11248  3021   2911                     (NOTLB)
Call Trace:
[C0000001EBB8B990] [C0000001EBB8BA80] 0xc0000001ebb8ba80 (unreliable)
[C0000001EBB8BB60] [C000000000010ACC] .__switch_to+0x124/0x148
[C0000001EBB8BBF0] [C0000000003C5D44] .schedule+0xbfc/0xdb0
[C0000001EBB8BD00] [C00000000006FD90] .do_wait+0xd68/0xfbc
[C0000001EBB8BE30] [C0000000000086A4] syscall_exit+0x0/0x40

Comment 18 IBM Bug Proxy 2008-11-21 18:01:49 UTC

I'm unable to recreate the hang with the patch applied, either to the -123 or -124 kernels.

I've tried both a POWER 5 and POWER 5+ system.  The 5+ looks like this:

# cat /proc/cpuinfo
processor	: 0
cpu		: POWER5+ (gs)
clock		: 1896.102000MHz
revision	: 2.0 (pvr 003b 0200)

processor	: 1
cpu		: POWER5+ (gs)
clock		: 1896.102000MHz
revision	: 2.0 (pvr 003b 0200)

processor	: 2
cpu		: POWER5+ (gs)
clock		: 1896.102000MHz
revision	: 2.0 (pvr 003b 0200)

processor	: 3
cpu		: POWER5+ (gs)
clock		: 1896.102000MHz
revision	: 2.0 (pvr 003b 0200)

timebase	: 237330000
platform	: pSeries
machine		: CHRP IBM,9133-55A

gdb is at version: GNU gdb Fedora (6.8-23.el5)

What little I found out about the hang on an unpatched system is that gdb appeared to be waiting on a child, but I wasn't able to find an existing child using ps.

Comment 19 IBM Bug Proxy 2008-11-21 23:31:14 UTC

I'm still not able to recreate the problem with the patch.  I even tried a 5.0 box with the -123 kernel + the patch, since that was mentioned as one of the boxes seeing the problem.

I'll be taking vacation next week, but we'll find somebody to keep looking at this.

Comment 20 Roland McGrath 2008-11-24 07:47:27 UTC

(In reply to comment #8)
That patch is correct.  There is no need to check the send_sig() return value (original upstream behavior is to ignore it).

Comment 21 IBM Bug Proxy 2008-11-24 17:01:26 UTC

By 'the patch is correct' do you mean it fixes the hang+bug on your end?

If so, are you including this patch in the next snapshot?

Comment 22 Don Zickus 2008-11-25 21:34:26 UTC

Oddly, using the -124.el5 kernel on the machines that caused problems before has made this problem go away.  Not sure what happened.  I tried to bisect it down to the patch but had no success.

I feel inclined to close this as NOTABUG/WORKSFORME but Jerome's patch does seem useful, so posting that oneline patch is worth something??

Comment 23 IBM Bug Proxy 2008-11-26 16:20:41 UTC

Since testing with and without the patch by IBM folks showed that GDB behaves
correctly with the patch and hangs without it, my opinion is that the patch
really fixes a bug and should be included.

The test team on our side also had similar issues running ltrace tests, so it
shows that the ptrace interface is not behaving well as it is now.

Comment 24 IBM Bug Proxy 2008-11-27 02:47:57 UTC

RHKL post:

http://post-office.corp.redhat.com/archives/rhkernel-list/2008-November/msg00787.html

Comment 26 IBM Bug Proxy 2008-12-02 21:21:21 UTC

should this bug go to fixed awaiting test?

Comment 27 Don Zickus 2008-12-09 21:04:55 UTC

in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 31 IBM Bug Proxy 2009-01-06 15:10:51 UTC

This does not recreate using the -128 kernel.  Closing this bug on the IBM side.

Comment 32 errata-xmlrpc 2009-01-20 19:45:18 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.