Bug 53360
Summary: | kernel NULL pointer dereference Oops: 0002System lockup | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | William W. Austin <waustin> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brock Organ <borgan> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.1 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-30 15:39:10 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
William W. Austin
2001-09-07 13:11:04 UTC
First: could you try passing "ide=nodma" on the lilo prompt ? Sometimes it seems using IDE dma silently corrupts things. Also, could you try "mem=xxM" where XX is the amount of ram in the machine (in MB) minus 2. Minus 2 because sometimes the bios lies a bit about the last megabytes of memory. Also, trying to get the 2.4.3-12 kernel to work would be useful; remember to re-make the initrd for 2.4.3-12! A) Will try the ide=nodma" on the lilo prompt and will also try mem=382M (real mem=384) B) Already id the initrd, of course (have done this before) -- the other machines now running 7.1 upgraded to 2.4.3-12 OK but the main overall difference is that they have no scsi altogether, whereas this one is mixed (adaptec 2940 UW, plus 2940 U -- because [expletive deleted] scanner insists that it can't live on same bus as anything else -- and 45Gb ibm drive). I should have info later today (not at home at the moment and can't access lilo prompt from dsl line...) Thanks OK, tried both boot options, first separately then together -- same result. BTW, is there any way to tell from the message whether this could be a H/W problem rather than a S/W bug? The message indicates memory corruption; that can be either caused by a kernel bug (although both 2.4.2-2 and 2.4.3-12 aren't "bad" kernels; the number of bugreports like yours is very very small, and often it ends up as hardware). It's worth checking to see if the CPU fan still turns or if it has a lot of dust that prevents air-circulation. FWIW, the reason the 2.4.3-12 won't/wouldn't boot is that it doesn't like the 3rd wide scsi drive on my 2940uw. I bit the bullet (it's only a 4.3 Gb drive) and pulled it and can run the 2.4.3-12 kernel that way. I'm trying to re-create the problem under 2.4.-12 at this point (at first with no additional args to lilo) and will update as it goes. Thanks for the feedback -- it helps. One slight change: examining logs, etc., many of the error messages centered on the drive which the 2.4.3-12 kernel did not like. To make a long story short, after removing that drive from the system, the number of lockups decreased slightly (subsequent tests: that drive is not dead). However, I also ended up having to replace the controller card as well. The lockups are now far fewer -- I suspect a hardware problem which (a) corrupted memory and (b) killed the controller card AND the drive. I am still getting lockups, however, and am testing with non-absolutely-necessary boards removed from the system. Here is an excerpt from the log file containing the error message which was the last thing logged before the system froze: > Sep 11 04:07:33 entropy kernel: invalid operand: 0000 > Sep 11 04:07:33 entropy kernel: CPU: 0 > Sep 11 04:07:33 entropy kernel: EIP: 0010:[prune_dcache+109/336] > Sep 11 04:07:33 entropy kernel: EIP: 0010:[<c0143c5d>] > Sep 11 04:07:33 entropy kernel: EFLAGS: 00010206 > Sep 11 04:07:33 entropy kernel: eax: 00800000 ebx: c5b28380 ecx: d4a63dc0 edx: c5b28500 > Sep 11 04:07:33 entropy kernel: esi: c5b28360 edi: c194fe6c ebp: 00008e51 esp: c1959f74 > Sep 11 04:07:33 entropy kernel: ds: 0018 es: 0018 ss: 0018 > Sep 11 04:07:33 entropy kernel: Process kswapd (pid: 4, stackpage=c1959000) > Sep 11 04:07:33 entropy kernel: Stack: c137d200 c012b906 c137d1e4 000009a9 c1958000 00000010 000009a9 c012bb13 > Sep 11 04:07:33 entropy kernel: 00010f00 00000004 00000034 00000004 c0143ff1 0000e952 c012bbae 00000004 > Sep 11 04:07:33 entropy kernel: 00000004 00010f00 ffffffff 00000004 0008e000 c012bc4b 00000004 00000000 > Sep 11 04:07:33 entropy kernel: Call Trace: [refill_inactive_scan+150/256] [refill_inactive+115/176] [shrink_dcache_memory+33/64] [do_try_to_free_pages+94/128] [kswapd+123/288] > Sep 11 04:07:33 entropy kernel: Call Trace: [<c012b906>] [<c012bb13>] [<c0143ff1>] [<c012bbae>] [<c012bc4b>] > Sep 11 04:07:33 entropy kernel: [do_linuxrc+0/224] [do_linuxrc+0/224] [kernel_thread+38/48] [kswapd+0/288] > Sep 11 04:07:33 entropy kernel: [<c0105000>] [<c0105000>] [<c0105596>] [<c012bbd0>] > Sep 11 04:07:33 entropy kernel: > Sep 11 04:07:33 entropy kernel: Code: 0f 0b 8d 56 18 8b 4a 04 8b 46 18 89 48 04 89 01 89 56 18 89 > Sep 11 04:07:33 entropy kernel: invalid operand: 0000 To me it is begining to look like a hardwre problem, not a software issue, but any suggestions concerning tracking it down would be greatly appreciated. Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |