301791 – ptrace broke

Bug 301791 - ptrace broke

Summary: ptrace broke

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	ppc64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Roland McGrath
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-22 19:56 UTC by David Woodhouse
Modified:	2007-11-30 22:12 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.6.23-0.202.rc8.fc8
Clone Of:
Environment:
Last Closed:	2007-09-25 07:51:37 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Debug output (23.54 KB, text/plain) 2007-09-24 11:47 UTC, David Woodhouse	no flags	Details
standalone test case (1.42 KB, text/x-csrc) 2007-09-24 12:37 UTC, Roland McGrath	no flags	Details
View All

Description David Woodhouse 2007-09-22 19:56:21 UTC

Upon typing 'run' into gdb to start a program...

------------[ cut here ]------------
kernel BUG at arch/powerpc/kernel/ptrace.c:71!
Oops: Exception in kernel mode, sig: 5 [#2]
SMP NR_CPUS=128 NUMA PowerMac
Modules linked in: bnep bridge nls_utf8 loop hidp vfat fat nfs lockd nfs_acl
autofs4 rfcomm l2cap sunrpc tun ipv6 dm_mirror dm_mod snd_usb_audio
snd_aoa_codec_tas snd_usb_lib snd_rawmidi snd_hwdep snd_powermac
snd_aoa_fabric_layout snd_aoa snd_seq_dummy snd_aoa_i2sbus snd_aoa_soundbus
snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device pmac_zilog snd_pcm_oss
snd_mixer_oss snd_pcm snd_page_alloc snd_timer snd soundcore usb_storage ide_cd
cdrom firewire_ohci firewire_core crc_itu_t hci_usb bluetooth sungem sungem_phy
sg shpchp sata_svw ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd
ohci_hcd ehci_hcd
NIP: c00000000000aefc LR: c0000000000d2890 CTR: c00000000000aeac
REGS: c000000112fbb890 TRAP: 0700   Tainted: G      D  (2.6.23-0.185.rc6.git7.fc8)
MSR: 9000000000029032 <EE,ME,IR,DR>  CR: 22044488  XER: 20000000
TASK = c00000011bf612c0[742] 'gdb' THREAD: c000000112fb8000 CPU: 0
GPR00: 0000000000000001 c000000112fbbb10 c000000000738ea0 fffffffffffffffb 
GPR04: c000000000669950 0000000000000000 0000000000000008 0000000000000000 
GPR08: 00000fffffcd5450 c00000002d1fbea0 0000000000000000 c00000000000b1ec 
GPR12: 0000000028044448 c000000000659180 00000000105f72c8 00000000105f5b30 
GPR16: 00000fffffcd5b38 c00000006c760000 c000000000669920 0000000000000000 
GPR20: c000000112fb8000 c000000066902960 0000000000000000 c00000000044a480 
GPR24: 0000000000000000 0000000000000000 c000000066902960 00000fffffcd5450 
GPR28: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 
NIP [c00000000000aefc] .genregs_get+0x50/0x340
LR [c0000000000d2890] .ptrace_layout_access+0x2a8/0x348
Call Trace:
[c000000112fbbb10] [c0000000006e79a8]
netlbl_cipsov4_genl_c_listall+0xebc0/0x1da20 (unreliable)
[c000000112fbbbc0] [c0000000000d2890] .ptrace_layout_access+0x2a8/0x348
[c000000112fbbcb0] [c000000000009db8] .arch_ptrace+0x11c/0x2f4
[c000000112fbbd70] [c0000000000d31bc] .sys_ptrace+0x88/0x2f0
[c000000112fbbe30] [c000000000008548] syscall_exit+0x0/0x40
Instruction dump:
f821ff51 7c7a1b78 7cbe2b78 7cdf3378 7cfc3b78 7d1b4378 e9230380 3860fffb 
2fa90000 419e02d4 e8090140 780007e0 <0b000000> 2fa60000 e89a0380 38000000

Comment 1 Roland McGrath 2007-09-24 02:42:34 UTC

I couldn't reproduce this using the  2.6.23-0.185.rc6.git7.fc8.ppc64 kernel on
an F7 install.  Can you verify a few details while I try to get an F8 install going?

gdb rpm version & arch
what binary did you run under gdb, i.e. 32 or 64 bit?

Comment 2 David Woodhouse 2007-09-24 09:31:59 UTC

Further investigation shows that it happens when my _gdb_ is 64-bit, not when
it's 32-bit. That was tested with 6.6-28. It happens when starting _any_ 32-bit
or 64-bit program that I've tried so far.

pmac /home/dwmw2 $ gdb foo
GNU gdb Red Hat Linux (6.6-27.fc8rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ppc64-redhat-linux-gnu"...
Using host libthread_db library "/lib64/libthread_db.so.1".
(gdb) run
Starting program: /home/dwmw2/foo 
Trace/breakpoint trap

Comment 3 Roland McGrath 2007-09-24 11:00:43 UTC

I got rawhide installed, and 2.6.23-0.195.rc7.git3.fc8.ppc64 is the kernel I
actually have now.  With gdb-6.6-28.fc8.ppc64, I tried "gdb /bin/true" (32-bit)
and "gdb /usr/bin/gdb" (64-bit) and just "run" in each worked fine.
After updating my mirror so I could make the rawhide install work, I no longer
have the older 0.185 kernel on hand to try with this install.  But perhaps the
problem went away.  Can you try the latest rawhide kernel?

Comment 4 David Woodhouse 2007-09-24 11:03:31 UTC

file /usr/bin/gdb

Comment 5 Roland McGrath 2007-09-24 11:06:28 UTC

I really meant it when I said gdb-6.6-28.fc8.ppc64, and yes, I double-checked.

Comment 6 Roland McGrath 2007-09-24 11:11:47 UTC

If this problem does persist for you, is the oops non-total enough that you can
get sysrq-t and see the backtrace of the ptrace target task?

Comment 7 David Woodhouse 2007-09-24 11:19:23 UTC

Hm, OK -- RPM behaviour changed recently and if you have both 32-bit and 64-bit
packages installed simultaneously now, your /usr/bin/gdb will be 32-bit.

I'm using a local build of the 0.195 kernel with some extra debugging added, and
it persists here. I can't get a backtrace of the target task (without extra
hacking) because it exits immediately. GDB quits with the BUG(), and the
inferior immediately dies with SIGTRAP.

With the patch at http://david.woodhou.se/ptrace-hack.patch I see the following
output when I use 64-bit GDB... the first four invocations with 'regs partial'
would have caused the BUG() before my patch.

genregs_get 0-7, regs partial
genregs_get 256-263, regs partial
genregs_get 0-7, regs partial
genregs_get 256-263, regs partial
genregs_get 0-7, regs full
genregs_get 256-263, regs full
genregs_get 0-7, regs full
genregs_get 256-263, regs full
genregs_get 0-7, regs full
genregs_get 256-263, regs full
genregs_get 0-7, regs full
genregs_get 256-263, regs full

I do see this on ppc32 kernels too -- but the CHECK_FULL_REGS() test doesn't
cause a BUG() there -- it just prints a message. In fact I think I've seen it on
ppc32 for a while, but only ever when I've been busy chasing something else in
gdb (go figure).

Comment 8 Roland McGrath 2007-09-24 11:34:17 UTC

Yeah, that's how the install came (two gdb rpms, the wrong one won).  But I'm
highly suspicious of such things from many past horrors, so I checked and
rediddled it correctly on autopilot before I even began.

Can you verify that it happens for you with some exact binary that I have?
14bd16ccc6c6da555ea914a2edb42164  /bin/true
from coreutils-6.9-6.fc8.ppc, e.g.

If our gdbs and kernels and all match, then perhaps it is a hardware variation
that uses different kernel code or something?  My machine is a dual-G5 Mac.

With your debugging hack, it should be easy to stick in
show_stack(current,NULL);show_stack(target,NULL); in the partial regs case.  The
target's backtrace will probably clue me in.

Comment 9 David Woodhouse 2007-09-24 11:41:27 UTC

Yes, I have exactly the same /bin/true binary, and it happens there too. About
to reboot with the show_stack() calls.

My machine is also a dual-G5 Mac. Would be interesting to see what output _you_
get with my debugging code.

Comment 10 Roland McGrath 2007-09-24 11:47:09 UTC

Another thing I should have suggested for more possibly-informative spew is
#define DEBUG 1 at the top of kernel/ptrace.c.

In my runs I think we can be pretty sure your code would never say "partial",
because it would hit the BUG_ON in the kernel I'm testing.

Comment 11 David Woodhouse 2007-09-24 11:47:52 UTC

Created attachment 204051 [details]
Debug output

Comment 12 David Woodhouse 2007-09-24 12:03:03 UTC

The above debug output already has DEBUG defined in kernel/ptrace.c

It seems that these 'partial' reg sets all happen when the inferior is still in
utrace_quiescent(), before it's even started up. But that's strange. When we
enter sys_fork() or sys_clone(), we _do_ save the full registers -- and
copy_thread() _checks_ that -- you'd have got a BUG() if you tried to clone
without having the full registers present to copy into the child.

Dunno why your system would be different to mine. Do you have auditd running?
That could well force us to save a full regset all of the time, although given
my previous paragraph I still don't quite see how that would help.

You don't have a 'special' way of forking the inferior for utrace, do you? And
somehow manage to create the new thread without tripping over the same BUG()
happening in copy_thread()? Confused...

Comment 13 Roland McGrath 2007-09-24 12:11:12 UTC

So it's the exec report.  D'oh, I looked at this code before as the top suspect
but misread arch/powerpc/kernel/process.c:start_thread.  I think this could
happen in an upstream kernel when using PTRACE_O_TRACEEXEC (which gdb might not
use, but from the traces it looks like it does).
AFAICT, start_thread never clears regs->trap after it's clobbered all the
registers.  I can't see why it would be different on my machine than on yours.
Ah, the upstream kernel has the CHECK_FULL_REGS BUG_ON for PEEKUSR but not
GETREGS, which is what gdb probably uses now.  A case using PTRACE_O_TRACEEXEC,
the child exec'ing so it's stopped in ptrace_notify, and then PTRACE_PEEKUSR to
fetch a register should hit the oops in the upstream kernel.

The upstream kernel should have CHECK_FULL_REGS for its GETREGS/SETREGS requests
too, they were probably just omitted accidentally when those were added/revamped
recently.  Then it would crash with gdb too, and the right fix is for
start_thread to clear regs->trap to 0.

But I could be missing something about all this since I still have no
explanation for why my machine does not hit the bug.

Comment 14 Roland McGrath 2007-09-24 12:14:33 UTC

I did have auditd running (default in the Fedora install).  After chkconfig
auditd off; reboot, I do get the oops.

Comment 15 Roland McGrath 2007-09-24 12:37:36 UTC

Created attachment 204101 [details]
standalone test case

This test case makes the upstream kernel crash too.

Comment 16 David Woodhouse 2007-09-24 12:53:31 UTC

As expected, with 'regs->trap &= ~1;' added to start_kernel, I see 'regs full'
in all cases in my debugging output.

Comment 17 Roland McGrath 2007-09-24 13:07:18 UTC

Good.  I have tested patches for upstream doing that, and will write them up and
submit them in the morning (nearing dawn here now), as well as put the fix into
rawhide pending upstream inclusion.  It turns out that the way gdb uses the
features will never provoke the bug on the upstream kernel (but my comment 15
test case will).  utrace does some things more uniformly and so trips this
underlying bug in more cases.

Comment 18 David Woodhouse 2007-09-24 13:31:27 UTC

Thanks. Your patch is probably a candidate for 2.6.23.

Comment 19 Roland McGrath 2007-09-24 23:42:34 UTC

gdb --args /bin/sh -c 'exec /bin/true' (and "run") is also a sufficient test
case for the upstream bug.

Comment 20 Roland McGrath 2007-09-25 07:51:37 UTC

2.6.23-0.202.rc8.fc8 fixes it

Note You need to log in before you can comment on or make changes to this bug.