Bug 58180

Summary:	kernel oops (2): null pointer dereference
Product:	[Retired] Red Hat Linux	Reporter:	Douglas H. Steves <dhs>
Component:	kernel	Assignee:	Pete Zaitcev <zaitcev>
Status:	CLOSED WORKSFORME	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.2
Target Milestone:	---
Target Release:	---
Hardware:	i586
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2002-01-24 04:37:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Douglas H. Steves 2002-01-10 17:03:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)

Description of problem:
When running a memory intensive job (calculating large (thousands of digits) 
integers), the kernel crashes with this bug. I do a clean boot, run a test case 
and 15-20 minutes later, boing. This case works on Mandrake 8.0 and 8.1 (same 
hardware). I thought it was just crashing with cupsd, but after disabling 
cupsd, it crashed with sendmail.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Boot system.
2.Log in as non-privileged user.
3.Run test case to calculate large exponential.
	

Actual Results:  Kernel panic.

Expected Results:  No kernel panic.

Additional info:

If you are really going to use the info, I'll run the case again and scribble 
down all the register data, stack trace, etc. As noted, this happens with 
different programs. But it always happens. And it always worked with Mandrake 
8.0 and 8.1 on the same hardware. I've seen other
bug reports with this error diagnostic, but these appeared to be tied to 
running one program. In this case, the system crashes when I run any really big 
test case, but does so in a different process. The bug appears to be related to 
memory use.

Comment 1 Arjan van de Ven 2002-01-10 17:06:42 UTC

Are you using the latest released kernel (2.4.9-13) ?

Comment 2 Douglas H. Steves 2002-01-11 17:19:53 UTC

Yes, I'm using the latest kernel (I've installed all bug fixes).
This bug also was present in vanilla 7.2.

Comment 3 Arjan van de Ven 2002-01-11 17:21:54 UTC

ok the most relevant information is the things that look like function names in
the backtrace. The rest (the numbers) are of secundary important (and if the
names are decent not even needed)... the names are simpler to write down too.
It would help if you could at least give the first few names..

Comment 4 Pete Zaitcev 2002-01-17 21:40:31 UTC

I would like to see the output of the dmesg.
Arjan seems to assume that the system is not responsible
after the oosp, but it is not necesserily true.
If the box stays up, running "dmesg >/tmp/xxx"
may yield valuable information.

Comment 5 Douglas H. Steves 2002-01-22 18:59:24 UTC

I think that you misunderstand whats happening in the bug. What happens is that 
when running a memory intensive app (an integer library test case which 
computes a large (thousands of bytes) exponential), the system will become 
unstable due to virtual memory bugs. Apps will fail, and eventually the kernel
will panic, usually in an interrupt handler.
When an app fails, I get the error message I posted - unable to handle kernel 
NULL pointer dereference at vaddr xxxxxxxx (varies), with OOPS = 0002. The call 
stack varies with the app that failed. For sendmail, I got: system_call() -> 
error_code() -> do_page_fault() -> sys_socket_call() -> sys_socket() -> 
sys_connect() -> unix_stream_connect() -> sock_wmalloc() -> alloc_skb() -> 
file_map_nopage(). For diskcheck I got: system_call() -> sys_execve() -> getname
() -> do_execve() -> search_binary_handler() -> load_elf_binary() -> 
do_generic_file_read() -> update_atime() -> __mark_inode_dirty() -> 
__insmod_ext3_S.text_L43392() -> journal_stop_R6B8E4838() -> 
__insmod_ext3_S.text_L43392() -> handle_mm_fault() -> load_elf_binary(). The 
process that crashes is usually a daemon (cupsd, sendmail, etc) and not one of 
my programs.
After the memory bugs start appearing, dmesg will then crash, which makes it 
impossible to examine the kernel msg buffer. Most other commands (sync) crash 
as well. Eventually, the kernel will panic in an interrupt handler with an 
error message like "kernel panic aiee, killing interrupt handler" (this can't 
be a good thing!)

Comment 6 Pete Zaitcev 2002-01-22 19:20:39 UTC

Only the first oops is important. Please get the dmesg
with "dmesg >/tmp/xxx" immediately after the first oops.
Kill your "memory intensive" programs.
Don't wait until the second oops or oops in the interrupt handler.
Then run ksymoops with "ksymoops </tmp/xxx >/tmp/yyy".
Attach both /tmp/xxx and /tmp/yyy to this bug,
but do not drop them into the comment box!

Other oopses after the fist one are useless for analysis,
in fact you need to prevent them from happening before your
dmesg buffer overflows.

The call traces that you listed in the previous comments
are not entirely useless, but an actual dmesg of the FIRST
oops would be better.

Comment 7 Douglas H. Steves 2002-01-24 04:37:52 UTC

You still misunderstand. By the time the first error occurs, the system is so 
unstable that killing the test case via Ctrl-C causes a system crash. I can run 
a background process to check the kernel buffers periodically but that also 
usually causes a system crash. If I try to use X, the system freezes completely 
after 15-20 minutes, and is totally unresponsive to anything but the power 
switch. These problems didn't occur with Mandrake 8.0 or 8.1. (8.1 had other
issues.) And I think that you're still assuming that the bug is synchronous 
with the process which crashes, which obviously isn't the case since different 
processes crash (usually system daemons). The process/system crashes are just 
symptoms - the problem is probably with malloc/kernel heap management.
In any event, I don't have any more time to waste on this.