From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461) Description of problem: When running a memory intensive job (calculating large (thousands of digits) integers), the kernel crashes with this bug. I do a clean boot, run a test case and 15-20 minutes later, boing. This case works on Mandrake 8.0 and 8.1 (same hardware). I thought it was just crashing with cupsd, but after disabling cupsd, it crashed with sendmail. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Boot system. 2.Log in as non-privileged user. 3.Run test case to calculate large exponential. Actual Results: Kernel panic. Expected Results: No kernel panic. Additional info: If you are really going to use the info, I'll run the case again and scribble down all the register data, stack trace, etc. As noted, this happens with different programs. But it always happens. And it always worked with Mandrake 8.0 and 8.1 on the same hardware. I've seen other bug reports with this error diagnostic, but these appeared to be tied to running one program. In this case, the system crashes when I run any really big test case, but does so in a different process. The bug appears to be related to memory use.
Are you using the latest released kernel (2.4.9-13) ?
Yes, I'm using the latest kernel (I've installed all bug fixes). This bug also was present in vanilla 7.2.
ok the most relevant information is the things that look like function names in the backtrace. The rest (the numbers) are of secundary important (and if the names are decent not even needed)... the names are simpler to write down too. It would help if you could at least give the first few names..
I would like to see the output of the dmesg. Arjan seems to assume that the system is not responsible after the oosp, but it is not necesserily true. If the box stays up, running "dmesg >/tmp/xxx" may yield valuable information.
I think that you misunderstand whats happening in the bug. What happens is that when running a memory intensive app (an integer library test case which computes a large (thousands of bytes) exponential), the system will become unstable due to virtual memory bugs. Apps will fail, and eventually the kernel will panic, usually in an interrupt handler. When an app fails, I get the error message I posted - unable to handle kernel NULL pointer dereference at vaddr xxxxxxxx (varies), with OOPS = 0002. The call stack varies with the app that failed. For sendmail, I got: system_call() -> error_code() -> do_page_fault() -> sys_socket_call() -> sys_socket() -> sys_connect() -> unix_stream_connect() -> sock_wmalloc() -> alloc_skb() -> file_map_nopage(). For diskcheck I got: system_call() -> sys_execve() -> getname () -> do_execve() -> search_binary_handler() -> load_elf_binary() -> do_generic_file_read() -> update_atime() -> __mark_inode_dirty() -> __insmod_ext3_S.text_L43392() -> journal_stop_R6B8E4838() -> __insmod_ext3_S.text_L43392() -> handle_mm_fault() -> load_elf_binary(). The process that crashes is usually a daemon (cupsd, sendmail, etc) and not one of my programs. After the memory bugs start appearing, dmesg will then crash, which makes it impossible to examine the kernel msg buffer. Most other commands (sync) crash as well. Eventually, the kernel will panic in an interrupt handler with an error message like "kernel panic aiee, killing interrupt handler" (this can't be a good thing!)
Only the first oops is important. Please get the dmesg with "dmesg >/tmp/xxx" immediately after the first oops. Kill your "memory intensive" programs. Don't wait until the second oops or oops in the interrupt handler. Then run ksymoops with "ksymoops </tmp/xxx >/tmp/yyy". Attach both /tmp/xxx and /tmp/yyy to this bug, but do not drop them into the comment box! Other oopses after the fist one are useless for analysis, in fact you need to prevent them from happening before your dmesg buffer overflows. The call traces that you listed in the previous comments are not entirely useless, but an actual dmesg of the FIRST oops would be better.
You still misunderstand. By the time the first error occurs, the system is so unstable that killing the test case via Ctrl-C causes a system crash. I can run a background process to check the kernel buffers periodically but that also usually causes a system crash. If I try to use X, the system freezes completely after 15-20 minutes, and is totally unresponsive to anything but the power switch. These problems didn't occur with Mandrake 8.0 or 8.1. (8.1 had other issues.) And I think that you're still assuming that the bug is synchronous with the process which crashes, which obviously isn't the case since different processes crash (usually system daemons). The process/system crashes are just symptoms - the problem is probably with malloc/kernel heap management. In any event, I don't have any more time to waste on this.