Description of problem: During unprivileged userland ptrace tests one can crash the ppc64 kernel. Kernel panic dumps point to a stack corruption and larger stack workarounds the problem. Either ppc64 has the default 16KB stack too small or utrace is just too hungry. Version-Release number of selected component (if applicable): kernel-2.6.18-58.el5.utrace1.ppc64 (kernel-2.6.18-58.el5.utrace2.ppc64 crashes the same way but the dumps included here match the utrace1 build) (kernel-2.6.18-58.el5.ppc64 is not well testable as userland locks up too early) How reproducible: Usually during the first `make check'. Steps to Reproduce: 1. Download http://sourceware.org/systemtap/wiki/utrace/tests . 2. make check, specifically using: i=0;while :;do date --iso=seconds;TESTTIME=$[10 * 60] make check;i=$[$i+1];echo $i;done Actual results: kernel-2.6.18-58.el5.utrace1.ppc64 http://porkchop.devel.redhat.com/brewroot/scratch/roland/task_1067117/ Unable to handle kernel paging request for data at address 0x004a8850 Faulting instruction address: 0xc00000000006e6d8 cpu 0x1: Vector: 300 (Data Access) at [c0000000593bba70] pc: c00000000006e6d8: .do_exit+0x4cc/0xa14 lr: c00000000006e6a8: .do_exit+0x49c/0xa14 sp: c0000000593bbcf0 msr: 8000000000009032 dar: 4a8850 dsisr: 40000000 current = 0xc00000005ac57310 paca = 0xc000000000475000 pid = 2216, comm = tee enter ? for help 1:mon> _ Expected results: No crash. Additional info: This crash in do_exit() indicates `struct thread_info' corruption indicating a stack overflow. Other crashes usually indicated just a general memory corruption. The -debug kernel also does not print anything useful. I wrote proper umapped-page-below (by vmap()) stack checker a long time ago but only for x86. I have also its simplified version for x86_64 but nothing for ppc/ppc64. The x86_64 patch would be easily portable, though. Increasing the stack size 2x (to 32KB) still crashed at the same place although during a `make check' run #9 (with the 16KB stack it crashes during the run #1): 2.6.18-58.el5.utrace2ppcstack2x: Unable to handle kernel paging request for data at address 0x65b21ca8 Faulting instruction address: 0xc0000000000891d8 cpu 0x0: Vector: 300 (Data Access) at [c000000065bef8d0] pc: c0000000000891d8: .debug_mutex_add_waiter+0x4c/0x6c lr: c00000000034fb0c: .__mutex_lock_interruptible_slowpath+0x108/0x33c sp: c000000065befb50 msr: 8000000000001032 dar: 65b21ca8 dsisr: 42000000 current = 0xc000000065b21550 paca = 0xc000000000465000 pid = 4597, comm = make enter ? for help 0:mon> ? Including a patch increasing the stack size 4x (to 64KB), it passed 15 `make check' runs so far but it may be also a false positive. Still the whole testing indicates the ppc problem is related the stack size overflow issue. Kernel build 2.6.18-58.el5.utrace2ppcstack4x.ppc64 with the attached patch at: http://porkchop.devel.redhat.com/brewroot/scratch/jkratoch/task_1072746/
Created attachment 282881 [details] ppc64 stack increase 4x (from 16KB to 64KB)
As 2.6.18-58.el5.utrace2ppcstack4x.ppc64 crashed in RHTS Job 11941 increasing the stack size is probably not a solution. But it delays the crash a lot. x86 stack overflow patch: http://people.redhat.com/jkratoch/kernel-stackoverflow-x86-2005.patch x86_64 stack overflow patch (simple): http://people.redhat.com/jkratoch/kernel-stackoverflow-x86_64.patch This ppc64 crashing Bug is AFAIK not tracked so far for utrace so keeping this Bug open. RHTS Job 11941 - 2.6.18-58.el5.utrace2ppcstack4x: list_del corruptio RHTS Job 11921 - 2.6.18-58.el5.utrace2ppcstack2x: Unable to handle kernel paging request for data at address 0x004a8850 Faulting instruction address: 0xc000000000067784 cpu 0x0: Vector: 300 (Data Access) at [c0000000780df950] pc: c000000000067784: .copy_process+0x294/0x158c lr: c000 RHTS Job 11871 - 2.6.18-58.el5.utrace1 kernel BUG in check_dead_utrace at kernel/utrace.c:328! cpu 0x0: Vector: 700 (Program Check) at [c00000000269f7e0] pc: c0000000000ae0c4: .check_dead_utrace+0x178/0x22c lr: c0000000000aec44: .wake_quiescent+0x94/0x1dc sp: c00000000269fa60 msr: 8000000000029032 current = 0xc0000000764a5b60 paca = 0xc000000000474e00 pid = 18121, comm = late-ptrace-may kernel BUG in check_dead_utrace at kernel/utrace.c:328! enter ? for help 0:mon> RHTS Job 11868 - 2.6.18-58.el5.utrace1 Unable to handle kernel paging request for data at address 0x004a8850 Faulting instruction address: 0xc00000000006e6d8 cpu 0x1: Vector: 300 (Data Access) at [c0000000593bba70] pc: c00000000006e6d8: .do_exit+0x4cc/0xa14 lr: c00000000006e6a8: .do_exit+0x49c/0xa14 sp: c0000000593bbcf0 msr: 8000000000009032 dar: 4a8850 dsisr: 40000000 current = 0xc00000005ac57310 paca = 0xc000000000475000 pid = 2216, comm = tee enter ? for help 1:mon> RHTS Job 11852 - 2.6.18-58.el5.utrace1 Unable to handle kernel paging request for data at address 0x004a8850 Faulting instruction address: 0xc000000000067784 cpu 0x0: Vector: 300 (Data Access) at [c00000004ca1b950] pc: c000000000067784: .copy_process+0x294/0x158c lr: c000000000067664: .copy_process+0x174/0x158c sp: c00000004ca1bbd0 msr: 800000000000b032 dar: 4a8850 dsisr: 40000000 current = 0xc000000050870b40 paca = 0xc000000000474e00 pid = 2126, comm = runtests.sh enter ? for help 0:mon> RHTS Job 11791 - 2.6.18-58.el5.utrace1 Unable to handle kernel paging request for data at address 0x004a8850 Faulting instruction address: 0xc000000000067784 cpu 0x0: Vector: 300 (Data Access) at [c00000002b1eb950] pc: c000000000067784: .copy_process+0x294/0x158c lr: c000000000067664: .copy_process+0x174/0x158c sp: c00000002b1ebbd0 msr: 9000000000009032 dar: 4a8850 dsisr: 40000000 current = 0xc00000000806dce0 paca = 0xc000000000474e00 pid = 26101, comm = rhts-test-runne enter ? for help 0:mon>
kernel-2.6.18-238.el5.ppc64 Red Hat Enterprise Linux Server release 5.6 (Tikanga) After 24h it still has not crashed (ibm-js22-vios-01-lp3.rhts.eng.bos.redhat.com), it may have beeen already fixed.