Bug 53360

Summary: kernel NULL pointer dereference Oops: 0002System lockup
Product: [Retired] Red Hat Linux Reporter: William W. Austin <waustin>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: medium    
Version: 7.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:39:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description William W. Austin 2001-09-07 13:11:04 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.2-2 i686)

Description of problem:
 recently upgraded my last 7.0 machine to 7.1  About 2 days after the
install
I started getting first temporary freezes then full lockups on this system.
The lockups seem to come at random, although running a couple of things
(xlock, or a large file ftp -- apparently unrelated of course) will
reliably
lock it up.  (Sys unresponsive to keyboard, mouse, ping,
telnet, etc.)

I redid the installation and for almost 2 days, no problems -- and suddenly
the problem is back in full.

Here is a a pretty typical excerpt from /var/log/messages (the machine name
is
"entropy"):

> Sep  5 08:51:51 entropy kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020
> Sep  5 08:51:51 entropy kernel:  printing eip:
> Sep  5 08:51:51 entropy kernel: c0134b87
> Sep  5 08:51:51 entropy kernel: pgd entry c5a02000: 0000000000000000
> Sep  5 08:51:51 entropy kernel: pmd entry c5a02000: 0000000000000000
> Sep  5 08:51:51 entropy kernel: ... pmd not present!
> Sep  5 08:51:51 entropy kernel: Oops: 0002
> Sep  5 08:51:51 entropy kernel: CPU:    0
> Sep  5 08:51:51 entropy kernel: EIP:   
0010:[__remove_from_lru_list+23/112]
> Sep  5 08:51:51 entropy kernel: EIP:    0010:[<c0134b87>]
> Sep  5 08:51:51 entropy kernel: EFLAGS: 00010206
> Sep  5 08:51:51 entropy kernel: eax: 00800000   ebx: c5310b40   ecx:
c5310b40   edx: 00000000
> Sep  5 08:51:51 entropy kernel: esi: c5310b40   edi: c5310b40   ebp:
00000000   esp: c17d5f28
> Sep  5 08:51:51 entropy kernel: ds: 0018   es: 0018   ss: 0018
> Sep  5 08:51:51 entropy kernel: Process kswapd (pid: 4,
stackpage=c17d5000)
> Sep  5 08:51:51 entropy kernel: Stack: c0134c6d c5310b40 00000000
c0137640 c5310b40 00000003 00000001 caddec20
> Sep  5 08:51:52 entropy kernel:        000001dc c012c88e 00000000
c11608e0 c5310b40 00000000 c012c15e c11608e0
> Sep  5 08:51:52 entropy kernel:        00000000 00000143 00000000
00000004 00000000 00000071 00000000 0000010d
> Sep  5 08:51:52 entropy kernel: Call Trace: [__remove_from_queues+45/48]
[try_to_free_buffers+112/384] [free_shortage+30/144]
[page_launder+1006/2432] [free_shortage+30/144]
[do_try_to_free_pages+53/128] [kswapd+123/288]
> Sep  5 08:51:52 entropy kernel: Call Trace: [<c0134c6d>] [<c0137640>]
[<c012c88e>] [<c012c15e>] [<c012c88e>] [<c012ca85>]
[<c012cb4b>]
> Sep  5 08:51:52 entropy kernel:        [empty_bad_page+0/4096]
[empty_bad_page+0/4096] [kernel_thread+38/48] [kswapd+0/288]
> Sep  5 08:51:52 entropy kernel:        [<c0105000>] [<c0105000>]
[<c0107596>] [<c012cad0>]
> Sep  5 08:51:52 entropy kernel:
> Sep  5 08:51:52 entropy kernel: Code: 89 42 20 8b 41 20 8b 51 24 89 50 24
8b 44 24 08 8d 14 85 00
> Sep  5 08:51:53 entropy kernel: kernel BUG at exit.c:465!
> Sep  5 08:51:53 entropy kernel: invalid operand: 0000
> Sep  5 08:51:53 entropy kernel: CPU:    0
> Sep  5 08:51:53 entropy kernel: EIP:    0010:[do_exit+541/560]
> Sep  5 08:51:53 entropy kernel: EIP:    0010:[<c0118a3d>]

The system is a P3/866MHz w/394Mb ram, 2 adaptec scsi cards, 3com nic, agp
video (matrox g450), and isa soundblaster 64, plus one EIDE (udma/66)
drive,
'normal' keyboard and ps/2 wheel mouse.  Running everything stock from the
7.1 release with current updates from updates.redhat.com *EXCEPT* I'm still
running the 2.4.2-2 kernel. [Reason: I have downloaded the 2.4.3-12 kernels
and installed them, but they (both i686 and i386) get serious problems when
they try to get to the scsi drives -- I figured it's still the same scsi
problem which plagued 7.1 originally, and yes, I *am* using the aic7xxx_mod
adaptec driver.] USB not used.

Here is the output of lsmod:
> #lsmod
> Module                  Size  Used by
> mga                    95984   2
> agpgart                23392   3
> nls_iso8859-1           2880   3 (autoclean)
> nls_cp437               4384   3 (autoclean)
> vfat                    9392   3 (autoclean)
> fat                    32672   0 (autoclean) [vfat]
> nfs                    79008   7 (autoclean)
> lockd                  52464   1 (autoclean) [nfs]
> sunrpc                 61328   1 (autoclean) [nfs lockd]
> binfmt_misc             6400   1
> vmnet                  18320   3
> vmmon                  18032   0 (unused)
> 3c59x                  25344   1 (autoclean)
> ipchains               38976   0 (unused)
> st                     26016   0 (unused)
> sb                      7856   0
> sb_lib                 36016   0 [sb]
> uart401                 6768   0 [sb_lib]
> sound                  62688   0 [sb_lib uart401]
> soundcore               4464   5 [sb_lib sound]
> aic7xxx_mod           125472  14
> sd_mod                 11680  14
> scsi_mod               95072   3 [st aic7xxx_mod sd_mod]

I am not certain that this is a bug -- it could also be a hardware problem,
but
I have tried to find out and cannot tell.  My other machines here
(different
configurations, of course) here run 7.1 fine -- I have swapped all boards,
drives, etc. from this box to another box and could not reproduce the
problem
there, so I am leaning towards a bug as the culprit here instead of the
h/w.
FWIW, I never had this problem under 7.0 and the box (another boot
partition)
runs win98 and nt4.0 without a problem (yeah, I know, but who *wants*
to...)

Any help on this one would be greatly appreciated


Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
ANY of the following by itself has triggered the freeze/lockup:
1. run xlock 
2. rcp  or ftp a large file
3. edit a large file
4. run netscape and try to go to http://www.redat.com
	

Actual Results:  (respectively)
1. System locks up requiring a hard boot.
2. same
3. same
4. same

Expected Results:  (respectively)
1. xlock runs
2. file is copied
3. file is editable
4. http://www.redat.com comes up in browser

Additional info:

(I included it all under the description above -- not sure which goes where
on this one)

This happens about 4-5 times per hour.  It would probably happen more
often, but it takes > 10 minutes to finish fsck'ing > 70 Gb of disks.

Comment 1 Arjan van de Ven 2001-09-07 13:17:48 UTC
First: could you try passing "ide=nodma" on the lilo prompt ? Sometimes it seems
using IDE dma silently corrupts things. Also, could you try "mem=xxM" where XX
is the amount of ram in the machine (in MB) minus 2. Minus 2 because sometimes
the bios lies a bit about the last megabytes of memory.

Also, trying to get the 2.4.3-12 kernel to work would be useful; remember to
re-make the initrd for 2.4.3-12!

Comment 2 William W. Austin 2001-09-07 16:15:10 UTC
A) Will try the ide=nodma" on the lilo prompt  and will also try mem=382M (real
mem=384)

B) Already id the initrd, of course (have done this before) -- the other
machines now running 7.1 upgraded to 2.4.3-12 OK but the main overall difference
is that they have no scsi altogether, whereas this one is mixed (adaptec 2940
UW, plus 2940 U -- because [expletive deleted] scanner insists that it can't
live on same bus as anything else -- and 45Gb ibm drive).

I should have info later today (not at home at the moment and can't access lilo
prompt from dsl line...)  Thanks


Comment 3 William W. Austin 2001-09-07 19:19:24 UTC
OK, tried both boot options, first separately then together -- same result.

BTW, is there any way to tell from the message whether this could be a H/W
problem rather than a S/W bug?

Comment 4 Arjan van de Ven 2001-09-07 19:23:53 UTC
The message indicates memory corruption; that can be either caused by a kernel
bug (although both 2.4.2-2 and 2.4.3-12 aren't "bad" kernels; the number of
bugreports like yours is very very small, and often it ends up as hardware).

It's worth checking to see if the CPU fan still turns or if it has a lot of dust
that prevents air-circulation.

Comment 5 William W. Austin 2001-09-07 21:10:57 UTC
FWIW, the reason the 2.4.3-12 won't/wouldn't boot is that it doesn't like the
3rd wide scsi drive on my 2940uw.  I bit the bullet (it's only a 4.3 Gb drive)
and pulled it and can run the 2.4.3-12 kernel that way.  I'm trying to re-create
the problem under 2.4.-12 at this point (at first with no additional args to
lilo) and will update as it goes.
Thanks for the feedback -- it helps.

Comment 6 William W. Austin 2001-09-11 16:01:57 UTC
One slight change:  examining logs, etc., many of the error messages centered on
the drive which the 2.4.3-12 kernel did not like.  To make a long story short,
after removing that drive from the system, the number of lockups decreased
slightly (subsequent tests: that drive is not dead).  However, I also ended up
having to replace the controller card as well.  The lockups are now far fewer --
I suspect a hardware problem which (a) corrupted memory and (b) killed the
controller card AND the drive.  I am still getting lockups, however, and am
testing with non-absolutely-necessary boards removed from the system.

Here is an excerpt from the log file containing the error message which was the
last thing logged before the system froze:

> Sep 11 04:07:33 entropy kernel: invalid operand: 0000
> Sep 11 04:07:33 entropy kernel: CPU:    0
> Sep 11 04:07:33 entropy kernel: EIP:    0010:[prune_dcache+109/336]
> Sep 11 04:07:33 entropy kernel: EIP:    0010:[<c0143c5d>]
> Sep 11 04:07:33 entropy kernel: EFLAGS: 00010206
> Sep 11 04:07:33 entropy kernel: eax: 00800000   ebx: c5b28380   ecx:
d4a63dc0   edx: c5b28500
> Sep 11 04:07:33 entropy kernel: esi: c5b28360   edi: c194fe6c   ebp:
00008e51   esp: c1959f74
> Sep 11 04:07:33 entropy kernel: ds: 0018   es: 0018   ss: 0018
> Sep 11 04:07:33 entropy kernel: Process kswapd (pid: 4, stackpage=c1959000)
> Sep 11 04:07:33 entropy kernel: Stack: c137d200 c012b906 c137d1e4 000009a9
c1958000 00000010 000009a9 c012bb13
> Sep 11 04:07:33 entropy kernel:        00010f00 00000004 00000034 00000004
c0143ff1 0000e952 c012bbae 00000004
> Sep 11 04:07:33 entropy kernel:        00000004 00010f00 ffffffff 00000004
0008e000 c012bc4b 00000004 00000000
> Sep 11 04:07:33 entropy kernel: Call Trace: [refill_inactive_scan+150/256]
[refill_inactive+115/176] [shrink_dcache_memory+33/64]
[do_try_to_free_pages+94/128] [kswapd+123/288] 
> Sep 11 04:07:33 entropy kernel: Call Trace: [<c012b906>] [<c012bb13>]
[<c0143ff1>] [<c012bbae>] [<c012bc4b>]
> Sep 11 04:07:33 entropy kernel:    [do_linuxrc+0/224] [do_linuxrc+0/224]
[kernel_thread+38/48] [kswapd+0/288]
> Sep 11 04:07:33 entropy kernel:    [<c0105000>] [<c0105000>] [<c0105596>]
[<c012bbd0>]
> Sep 11 04:07:33 entropy kernel:
> Sep 11 04:07:33 entropy kernel: Code: 0f 0b 8d 56 18 8b 4a 04 8b 46 18 89 48
04 89 01 89 56 18 89
> Sep 11 04:07:33 entropy kernel:  invalid operand: 0000

To me it is begining to look like a hardwre problem, not a software issue, but
any suggestions concerning tracking it down would be greatly appreciated.


Comment 7 Bugzilla owner 2004-09-30 15:39:10 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/