Description of problem:
On one machines around 6:40 local time, when a computer in question was
really not doing anything, a kernel oopsed and a machine went down.
An attempt of an autoreboot (nobody was around) ended up with
-- System halted
Only later when a machine was powered down manually it was possible
to power it up and restart.
Here is a decoded oops.
Unable to handle kernel NULL pointer dereference at virtual address 00000005
*pde = 00000000
EIP: 0010:[<c0116a3a>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
eax: 00000001 ebx: 00200292 ecx: 00000002 edx: dd97c03c
esi: cd97c000 edi: cd97c008 ebp: 00000000 esp: d390ff1c
ds: 0018 es: 0018 ss: 0018
Process gnome-smproxy (pid: 23583, stackpage=d390f000)
Stack: dd97c038 c0146a6e 00000000 d6a68340 00000001 c0146e26 d390ff54 d390ff54
00000020 d390e000 7fffffff 00000006 00000000 00000006 00000000 cd97c000
00000001 bffff7f4 deb1dd58 00000006 c01471a9 00000006 d390ff90 d390ff8c
Call Trace: [<c0146a6e>] poll_freewait [kernel] 0x2e (0xd390ff20))
[<c0146e26>] do_select [kernel] 0x226 (0xd390ff30))
[<c01471a9>] sys_select [kernel] 0x339 (0xd390ff6c))
[<c010893b>] system_call [kernel] 0x33 (0xd390ffc0))
Code: 89 48 04 89 01 53 9d 5b c3 8d b6 00 00 00 00 8d bc 27 00 00
>>EIP; c0116a3a <remove_wait_queue+a/20> <=====
Trace; c0146a6e <poll_freewait+2e/50>
Trace; c0146e26 <do_select+226/240>
Trace; c01471a9 <sys_select+339/480>
Trace; c010893b <system_call+33/38>
Code; c0116a3a <remove_wait_queue+a/20>
Code; c0116a3a <remove_wait_queue+a/20> <=====
0: 89 48 04 mov %ecx,0x4(%eax) <=====
Code; c0116a3d <remove_wait_queue+d/20>
3: 89 01 mov %eax,(%ecx)
Code; c0116a3f <remove_wait_queue+f/20>
5: 53 push %ebx
Code; c0116a40 <remove_wait_queue+10/20>
6: 9d popf
Code; c0116a41 <remove_wait_queue+11/20>
7: 5b pop %ebx
Code; c0116a42 <remove_wait_queue+12/20>
8: c3 ret
Code; c0116a43 <remove_wait_queue+13/20>
9: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
Code; c0116a49 <remove_wait_queue+19/20>
f: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi
Version-Release number of selected component (if applicable):
Is the hardware for this machine known good? Does it pass an overnight run of
memtest86? The fact that a boot failed with an invalid crc strongly hints at
that, or possibly the cpu overheating.
> Is the hardware for this machine known good?
Well, it is in a continous use for the last two years and this is the
first incident of that sort (some three weeks after 2.4.18-18.7.x
was installed). In other words so far hardware looked good. :-)
It runs for now after a powerdown and reboot.
memtest86 did not run so far and this is not that easy as the machine
is quite far from my desk. :-) That is still an open option but not
that easy to arrange.
As for today this oops looks like it was really caused by a broken CPU fan.
I will monitor the situation further.
It definitely was a broken fan.