Bug 418231 - ppc64: utrace(?): Stack overflows on ptrace testsuite
Summary: ppc64: utrace(?): Stack overflows on ptrace testsuite
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: ppc64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Roland McGrath
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-12-10 16:04 UTC by Jan Kratochvil
Modified: 2011-03-18 04:48 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-18 04:48:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ppc64 stack increase 4x (from 16KB to 64KB) (3.16 KB, patch)
2007-12-10 16:04 UTC, Jan Kratochvil
no flags Details | Diff

Description Jan Kratochvil 2007-12-10 16:04:42 UTC
Description of problem:
During unprivileged userland ptrace tests one can crash the ppc64 kernel.
Kernel panic dumps point to a stack corruption and larger stack workarounds the
problem.
Either ppc64 has the default 16KB stack too small or utrace is just too hungry.

Version-Release number of selected component (if applicable):
kernel-2.6.18-58.el5.utrace1.ppc64
(kernel-2.6.18-58.el5.utrace2.ppc64 crashes the same way but the dumps included
here match the utrace1 build)
(kernel-2.6.18-58.el5.ppc64 is not well testable as userland locks up too early)

How reproducible:
Usually during the first `make check'.

Steps to Reproduce:
1. Download http://sourceware.org/systemtap/wiki/utrace/tests .
2. make check, specifically using:
   i=0;while :;do date --iso=seconds;TESTTIME=$[10 * 60] make
check;i=$[$i+1];echo $i;done

Actual results:
kernel-2.6.18-58.el5.utrace1.ppc64
http://porkchop.devel.redhat.com/brewroot/scratch/roland/task_1067117/
Unable to handle kernel paging request for data at address 0x004a8850
Faulting instruction address: 0xc00000000006e6d8
cpu 0x1: Vector: 300 (Data Access) at [c0000000593bba70]
    pc: c00000000006e6d8: .do_exit+0x4cc/0xa14
    lr: c00000000006e6a8: .do_exit+0x49c/0xa14
    sp: c0000000593bbcf0
   msr: 8000000000009032
   dar: 4a8850
 dsisr: 40000000
  current = 0xc00000005ac57310
  paca    = 0xc000000000475000
    pid   = 2216, comm = tee
enter ? for help
1:mon> _

Expected results:
No crash.

Additional info:
This crash in do_exit() indicates `struct thread_info' corruption indicating a
stack overflow.  Other crashes usually indicated just a general memory
corruption.  The -debug kernel also does not print anything useful.

I wrote proper umapped-page-below (by vmap()) stack checker a long time ago but
only for x86.  I have also its simplified version for x86_64 but nothing for
ppc/ppc64.  The x86_64 patch would be easily portable, though.

Increasing the stack size 2x (to 32KB) still crashed at the same place although
during a `make check' run #9 (with the 16KB stack it crashes during the run #1):
2.6.18-58.el5.utrace2ppcstack2x:
Unable to handle kernel paging request for data at address 0x65b21ca8
Faulting instruction address: 0xc0000000000891d8
cpu 0x0: Vector: 300 (Data Access) at [c000000065bef8d0]
    pc: c0000000000891d8: .debug_mutex_add_waiter+0x4c/0x6c
    lr: c00000000034fb0c: .__mutex_lock_interruptible_slowpath+0x108/0x33c
    sp: c000000065befb50
   msr: 8000000000001032
   dar: 65b21ca8
 dsisr: 42000000
  current = 0xc000000065b21550
  paca    = 0xc000000000465000
    pid   = 4597, comm = make
enter ? for help
0:mon> ?

Including a patch increasing the stack size 4x (to 64KB), it passed 15 `make
check' runs so far but it may be also a false positive.  Still the whole testing
indicates the ppc problem is related the stack size overflow issue.

Kernel build 2.6.18-58.el5.utrace2ppcstack4x.ppc64 with the attached patch at:
  http://porkchop.devel.redhat.com/brewroot/scratch/jkratoch/task_1072746/

Comment 1 Jan Kratochvil 2007-12-10 16:04:42 UTC
Created attachment 282881 [details]
ppc64 stack increase 4x (from 16KB to 64KB)

Comment 2 Jan Kratochvil 2007-12-10 22:33:28 UTC
As 2.6.18-58.el5.utrace2ppcstack4x.ppc64 crashed in RHTS Job 11941 increasing
the stack size is probably not a solution.  But it delays the crash a lot.

x86 stack overflow patch:
http://people.redhat.com/jkratoch/kernel-stackoverflow-x86-2005.patch

x86_64 stack overflow patch (simple):
http://people.redhat.com/jkratoch/kernel-stackoverflow-x86_64.patch

This ppc64 crashing Bug is AFAIK not tracked so far for utrace so keeping this
Bug open.

RHTS Job 11941 - 2.6.18-58.el5.utrace2ppcstack4x:
list_del corruptio

RHTS Job 11921 - 2.6.18-58.el5.utrace2ppcstack2x:
Unable to handle kernel paging request for data at address 0x004a8850
Faulting instruction address: 0xc000000000067784
cpu 0x0: Vector: 300 (Data Access) at [c0000000780df950]
    pc: c000000000067784: .copy_process+0x294/0x158c
    lr: c000

RHTS Job 11871 - 2.6.18-58.el5.utrace1
kernel BUG in check_dead_utrace at kernel/utrace.c:328!
cpu 0x0: Vector: 700 (Program Check) at [c00000000269f7e0]
    pc: c0000000000ae0c4: .check_dead_utrace+0x178/0x22c
    lr: c0000000000aec44: .wake_quiescent+0x94/0x1dc
    sp: c00000000269fa60
   msr: 8000000000029032
  current = 0xc0000000764a5b60
  paca    = 0xc000000000474e00
    pid   = 18121, comm = late-ptrace-may
kernel BUG in check_dead_utrace at kernel/utrace.c:328!
enter ? for help
0:mon>

RHTS Job 11868 - 2.6.18-58.el5.utrace1
Unable to handle kernel paging request for data at address 0x004a8850
Faulting instruction address: 0xc00000000006e6d8
cpu 0x1: Vector: 300 (Data Access) at [c0000000593bba70]
    pc: c00000000006e6d8: .do_exit+0x4cc/0xa14
    lr: c00000000006e6a8: .do_exit+0x49c/0xa14
    sp: c0000000593bbcf0
   msr: 8000000000009032
   dar: 4a8850
 dsisr: 40000000
  current = 0xc00000005ac57310
  paca    = 0xc000000000475000
    pid   = 2216, comm = tee
enter ? for help
1:mon>

RHTS Job 11852 - 2.6.18-58.el5.utrace1
Unable to handle kernel paging request for data at address 0x004a8850
Faulting instruction address: 0xc000000000067784
cpu 0x0: Vector: 300 (Data Access) at [c00000004ca1b950]
    pc: c000000000067784: .copy_process+0x294/0x158c
    lr: c000000000067664: .copy_process+0x174/0x158c
    sp: c00000004ca1bbd0
   msr: 800000000000b032
   dar: 4a8850
 dsisr: 40000000
  current = 0xc000000050870b40
  paca    = 0xc000000000474e00
    pid   = 2126, comm = runtests.sh
enter ? for help
0:mon>

RHTS Job 11791 - 2.6.18-58.el5.utrace1
Unable to handle kernel paging request for data at address 0x004a8850
Faulting instruction address: 0xc000000000067784
cpu 0x0: Vector: 300 (Data Access) at [c00000002b1eb950]
    pc: c000000000067784: .copy_process+0x294/0x158c
    lr: c000000000067664: .copy_process+0x174/0x158c
    sp: c00000002b1ebbd0
   msr: 9000000000009032
   dar: 4a8850
 dsisr: 40000000
  current = 0xc00000000806dce0
  paca    = 0xc000000000474e00
    pid   = 26101, comm = rhts-test-runne
enter ? for help
0:mon>


Comment 3 Jan Kratochvil 2011-03-18 04:48:59 UTC
kernel-2.6.18-238.el5.ppc64
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

After 24h it still has not crashed
(ibm-js22-vios-01-lp3.rhts.eng.bos.redhat.com), it may have beeen already fixed.


Note You need to log in before you can comment on or make changes to this bug.