Description of problem: Got a crash today: do_IRQ: stack overflow:360 call trace: [<02107aa0>] do_IRQ+0x46/0x225 [<022111b7>] Nothing else. Version-Release number of selected component (if applicable): kernel-smp-2.6.5-1.327 on a UP machine running with nmi_watchdog=1, with lvm2, xfs and NFSv4.
Created attachment 99855 [details] Stack overflow call trace, showing fault in NFS->IP path Call trace/oops for case where machine did not repeatedly fault. NFS mount is TCP, but problem occurs with UDP mount too.
Created attachment 99856 [details] Tail of output of continiously repeating oops This is the tail of what was in my scroll buffer for the case where the machine faulted, i think similarly to other trace (i saw ip_finish and sock_sendmsg scroll by), and then repeatedly and continiously faulted trying to print oops, until i power cycled.
I have a similar problem with kernel-2.6.5-1.322 and kernel-2.6.5-1.343. I can reliably trigger a stack overflow by running apt-get dist-upgrade, as apt reads in its package lists - located on an NFS mount, the kernel will reliably blow up with a stack overflow. AFAICT somewhere down the nfs_writepage -> ... -> inet_sendmsg -> ... -> ip_finish_output path each time, however I cant say for 100% sure as I have only managed to capture one oops from the beginning (see below) because 9/10 times the kernel ends up oopsing repeatedly over and over again with show_registers ... do_fault ... show_registers (etc.) in the stack traces. it scrolls very quickly - the trace I have where it did not repeatedly oops, appears to be similar to what I see as the first oops on screen for these repeating cases, but i cant say for sure, it scrolls too fast. See attached for what was left in my scroll buffer for the case of a repeating fault, where I did not manage to catch the beginning of the series of faults. Also attached is the case where it did not repeatedly fault, showing what probably is similar to the initial fault for the repeating cases. Initially I had the /var/cache/apt NFS mount mounted over TCP, and the trace is with TCP mount, but the problem occurs with UDP mounts too. I have also tried booting with acpi=off and vdso=0, just to rule out those out - problem still occurs. Problem does not occur with older kernel versions, eg kernel-2.6.3-1.96. The machine is a Compaq Deskpro 4000, Pentium 233MMX, 192MB RAM, 200M swap on local disk, ThunderLAN onboard NIC, with the last available ROMPaq (BIOS) from Compaq for the series of Deskpro installed.
Is there any chance an 8kB stack version of the kernel could be made available? Or at least that the config option for 8kB stack could be reinstated so that I could build such a kernel from the src rpm?
NFSv4 has a known big stack abuser; Dave Jones has a fix pending fo rthat.
The newer 2.6.5-1.358 kernel appears to have resolved this issue. Thanks!