Bug 58180
Summary: | kernel oops (2): null pointer dereference | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Douglas H. Steves <dhs> |
Component: | kernel | Assignee: | Pete Zaitcev <zaitcev> |
Status: | CLOSED WORKSFORME | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.2 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i586 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2002-01-24 04:37:57 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Douglas H. Steves
2002-01-10 17:03:20 UTC
Are you using the latest released kernel (2.4.9-13) ? Yes, I'm using the latest kernel (I've installed all bug fixes). This bug also was present in vanilla 7.2. ok the most relevant information is the things that look like function names in the backtrace. The rest (the numbers) are of secundary important (and if the names are decent not even needed)... the names are simpler to write down too. It would help if you could at least give the first few names.. I would like to see the output of the dmesg. Arjan seems to assume that the system is not responsible after the oosp, but it is not necesserily true. If the box stays up, running "dmesg >/tmp/xxx" may yield valuable information. I think that you misunderstand whats happening in the bug. What happens is that when running a memory intensive app (an integer library test case which computes a large (thousands of bytes) exponential), the system will become unstable due to virtual memory bugs. Apps will fail, and eventually the kernel will panic, usually in an interrupt handler. When an app fails, I get the error message I posted - unable to handle kernel NULL pointer dereference at vaddr xxxxxxxx (varies), with OOPS = 0002. The call stack varies with the app that failed. For sendmail, I got: system_call() -> error_code() -> do_page_fault() -> sys_socket_call() -> sys_socket() -> sys_connect() -> unix_stream_connect() -> sock_wmalloc() -> alloc_skb() -> file_map_nopage(). For diskcheck I got: system_call() -> sys_execve() -> getname () -> do_execve() -> search_binary_handler() -> load_elf_binary() -> do_generic_file_read() -> update_atime() -> __mark_inode_dirty() -> __insmod_ext3_S.text_L43392() -> journal_stop_R6B8E4838() -> __insmod_ext3_S.text_L43392() -> handle_mm_fault() -> load_elf_binary(). The process that crashes is usually a daemon (cupsd, sendmail, etc) and not one of my programs. After the memory bugs start appearing, dmesg will then crash, which makes it impossible to examine the kernel msg buffer. Most other commands (sync) crash as well. Eventually, the kernel will panic in an interrupt handler with an error message like "kernel panic aiee, killing interrupt handler" (this can't be a good thing!) Only the first oops is important. Please get the dmesg with "dmesg >/tmp/xxx" immediately after the first oops. Kill your "memory intensive" programs. Don't wait until the second oops or oops in the interrupt handler. Then run ksymoops with "ksymoops </tmp/xxx >/tmp/yyy". Attach both /tmp/xxx and /tmp/yyy to this bug, but do not drop them into the comment box! Other oopses after the fist one are useless for analysis, in fact you need to prevent them from happening before your dmesg buffer overflows. The call traces that you listed in the previous comments are not entirely useless, but an actual dmesg of the FIRST oops would be better. You still misunderstand. By the time the first error occurs, the system is so unstable that killing the test case via Ctrl-C causes a system crash. I can run a background process to check the kernel buffers periodically but that also usually causes a system crash. If I try to use X, the system freezes completely after 15-20 minutes, and is totally unresponsive to anything but the power switch. These problems didn't occur with Mandrake 8.0 or 8.1. (8.1 had other issues.) And I think that you're still assuming that the bug is synchronous with the process which crashes, which obviously isn't the case since different processes crash (usually system daemons). The process/system crashes are just symptoms - the problem is probably with malloc/kernel heap management. In any event, I don't have any more time to waste on this. |