Red Hat Bugzilla – Bug 58180
kernel oops (2): null pointer dereference
Last modified: 2007-04-18 12:38:57 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)
Description of problem:
When running a memory intensive job (calculating large (thousands of digits)
integers), the kernel crashes with this bug. I do a clean boot, run a test case
and 15-20 minutes later, boing. This case works on Mandrake 8.0 and 8.1 (same
hardware). I thought it was just crashing with cupsd, but after disabling
cupsd, it crashed with sendmail.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
2.Log in as non-privileged user.
3.Run test case to calculate large exponential.
Actual Results: Kernel panic.
Expected Results: No kernel panic.
If you are really going to use the info, I'll run the case again and scribble
down all the register data, stack trace, etc. As noted, this happens with
different programs. But it always happens. And it always worked with Mandrake
8.0 and 8.1 on the same hardware. I've seen other
bug reports with this error diagnostic, but these appeared to be tied to
running one program. In this case, the system crashes when I run any really big
test case, but does so in a different process. The bug appears to be related to
Are you using the latest released kernel (2.4.9-13) ?
Yes, I'm using the latest kernel (I've installed all bug fixes).
This bug also was present in vanilla 7.2.
ok the most relevant information is the things that look like function names in
the backtrace. The rest (the numbers) are of secundary important (and if the
names are decent not even needed)... the names are simpler to write down too.
It would help if you could at least give the first few names..
I would like to see the output of the dmesg.
Arjan seems to assume that the system is not responsible
after the oosp, but it is not necesserily true.
If the box stays up, running "dmesg >/tmp/xxx"
may yield valuable information.
I think that you misunderstand whats happening in the bug. What happens is that
when running a memory intensive app (an integer library test case which
computes a large (thousands of bytes) exponential), the system will become
unstable due to virtual memory bugs. Apps will fail, and eventually the kernel
will panic, usually in an interrupt handler.
When an app fails, I get the error message I posted - unable to handle kernel
NULL pointer dereference at vaddr xxxxxxxx (varies), with OOPS = 0002. The call
stack varies with the app that failed. For sendmail, I got: system_call() ->
error_code() -> do_page_fault() -> sys_socket_call() -> sys_socket() ->
sys_connect() -> unix_stream_connect() -> sock_wmalloc() -> alloc_skb() ->
file_map_nopage(). For diskcheck I got: system_call() -> sys_execve() -> getname
() -> do_execve() -> search_binary_handler() -> load_elf_binary() ->
do_generic_file_read() -> update_atime() -> __mark_inode_dirty() ->
__insmod_ext3_S.text_L43392() -> journal_stop_R6B8E4838() ->
__insmod_ext3_S.text_L43392() -> handle_mm_fault() -> load_elf_binary(). The
process that crashes is usually a daemon (cupsd, sendmail, etc) and not one of
After the memory bugs start appearing, dmesg will then crash, which makes it
impossible to examine the kernel msg buffer. Most other commands (sync) crash
as well. Eventually, the kernel will panic in an interrupt handler with an
error message like "kernel panic aiee, killing interrupt handler" (this can't
be a good thing!)
Only the first oops is important. Please get the dmesg
with "dmesg >/tmp/xxx" immediately after the first oops.
Kill your "memory intensive" programs.
Don't wait until the second oops or oops in the interrupt handler.
Then run ksymoops with "ksymoops </tmp/xxx >/tmp/yyy".
Attach both /tmp/xxx and /tmp/yyy to this bug,
but do not drop them into the comment box!
Other oopses after the fist one are useless for analysis,
in fact you need to prevent them from happening before your
dmesg buffer overflows.
The call traces that you listed in the previous comments
are not entirely useless, but an actual dmesg of the FIRST
oops would be better.
You still misunderstand. By the time the first error occurs, the system is so
unstable that killing the test case via Ctrl-C causes a system crash. I can run
a background process to check the kernel buffers periodically but that also
usually causes a system crash. If I try to use X, the system freezes completely
after 15-20 minutes, and is totally unresponsive to anything but the power
switch. These problems didn't occur with Mandrake 8.0 or 8.1. (8.1 had other
issues.) And I think that you're still assuming that the bug is synchronous
with the process which crashes, which obviously isn't the case since different
processes crash (usually system daemons). The process/system crashes are just
symptoms - the problem is probably with malloc/kernel heap management.
In any event, I don't have any more time to waste on this.