Description of problem: On one machines around 6:40 local time, when a computer in question was really not doing anything, a kernel oopsed and a machine went down. An attempt of an autoreboot (nobody was around) ended up with Uncompressing Linux.... crc error -- System halted Only later when a machine was powered down manually it was possible to power it up and restart. Here is a decoded oops. Unable to handle kernel NULL pointer dereference at virtual address 00000005 printing eip: c0116a3a *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[<c0116a3a>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00210092 eax: 00000001 ebx: 00200292 ecx: 00000002 edx: dd97c03c esi: cd97c000 edi: cd97c008 ebp: 00000000 esp: d390ff1c ds: 0018 es: 0018 ss: 0018 Process gnome-smproxy (pid: 23583, stackpage=d390f000) Stack: dd97c038 c0146a6e 00000000 d6a68340 00000001 c0146e26 d390ff54 d390ff54 00000020 d390e000 7fffffff 00000006 00000000 00000006 00000000 cd97c000 00000001 bffff7f4 deb1dd58 00000006 c01471a9 00000006 d390ff90 d390ff8c Call Trace: [<c0146a6e>] poll_freewait [kernel] 0x2e (0xd390ff20)) [<c0146e26>] do_select [kernel] 0x226 (0xd390ff30)) [<c01471a9>] sys_select [kernel] 0x339 (0xd390ff6c)) [<c010893b>] system_call [kernel] 0x33 (0xd390ffc0)) Code: 89 48 04 89 01 53 9d 5b c3 8d b6 00 00 00 00 8d bc 27 00 00 >>EIP; c0116a3a <remove_wait_queue+a/20> <===== Trace; c0146a6e <poll_freewait+2e/50> Trace; c0146e26 <do_select+226/240> Trace; c01471a9 <sys_select+339/480> Trace; c010893b <system_call+33/38> Code; c0116a3a <remove_wait_queue+a/20> 00000000 <_EIP>: Code; c0116a3a <remove_wait_queue+a/20> <===== 0: 89 48 04 mov %ecx,0x4(%eax) <===== Code; c0116a3d <remove_wait_queue+d/20> 3: 89 01 mov %eax,(%ecx) Code; c0116a3f <remove_wait_queue+f/20> 5: 53 push %ebx Code; c0116a40 <remove_wait_queue+10/20> 6: 9d popf Code; c0116a41 <remove_wait_queue+11/20> 7: 5b pop %ebx Code; c0116a42 <remove_wait_queue+12/20> 8: c3 ret Code; c0116a43 <remove_wait_queue+13/20> 9: 8d b6 00 00 00 00 lea 0x0(%esi),%esi Code; c0116a49 <remove_wait_queue+19/20> f: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi Version-Release number of selected component (if applicable): 2.4.18-18.7.x
Is the hardware for this machine known good? Does it pass an overnight run of memtest86? The fact that a boot failed with an invalid crc strongly hints at that, or possibly the cpu overheating.
> Is the hardware for this machine known good? Well, it is in a continous use for the last two years and this is the first incident of that sort (some three weeks after 2.4.18-18.7.x was installed). In other words so far hardware looked good. :-) It runs for now after a powerdown and reboot. memtest86 did not run so far and this is not that easy as the machine is quite far from my desk. :-) That is still an open option but not that easy to arrange.
As for today this oops looks like it was really caused by a broken CPU fan. I will monitor the situation further.
It definitely was a broken fan.