Bug 79884

Summary: Oops with 2.4.18-18.7.x
Product: [Retired] Red Hat Linux Reporter: Michal Jaegermann <michal>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-01-11 18:27:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Michal Jaegermann 2002-12-17 20:05:24 UTC
Description of problem:

On one machines around 6:40 local time, when a computer in question was
really not doing anything, a kernel oopsed and a machine went down.
An attempt of an autoreboot (nobody was around) ended up with

Uncompressing Linux....

crc error

-- System halted

Only later when a machine was powered down manually it was possible
to power it up and restart.

Here is a decoded oops.

Unable to handle kernel NULL pointer dereference at virtual address 00000005
 printing eip:
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<c0116a3a>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210092
eax: 00000001   ebx: 00200292   ecx: 00000002   edx: dd97c03c
esi: cd97c000   edi: cd97c008   ebp: 00000000   esp: d390ff1c
ds: 0018   es: 0018   ss: 0018
Process gnome-smproxy (pid: 23583, stackpage=d390f000)
Stack: dd97c038 c0146a6e 00000000 d6a68340 00000001 c0146e26 d390ff54 d390ff54 
       00000020 d390e000 7fffffff 00000006 00000000 00000006 00000000 cd97c000 
       00000001 bffff7f4 deb1dd58 00000006 c01471a9 00000006 d390ff90 d390ff8c 
Call Trace: [<c0146a6e>] poll_freewait [kernel] 0x2e (0xd390ff20))
[<c0146e26>] do_select [kernel] 0x226 (0xd390ff30))
[<c01471a9>] sys_select [kernel] 0x339 (0xd390ff6c))
[<c010893b>] system_call [kernel] 0x33 (0xd390ffc0))
Code: 89 48 04 89 01 53 9d 5b c3 8d b6 00 00 00 00 8d bc 27 00 00 

>>EIP; c0116a3a <remove_wait_queue+a/20>   <=====
Trace; c0146a6e <poll_freewait+2e/50>
Trace; c0146e26 <do_select+226/240>
Trace; c01471a9 <sys_select+339/480>
Trace; c010893b <system_call+33/38>
Code;  c0116a3a <remove_wait_queue+a/20>
00000000 <_EIP>:
Code;  c0116a3a <remove_wait_queue+a/20>   <=====
   0:   89 48 04                  mov    %ecx,0x4(%eax)   <=====
Code;  c0116a3d <remove_wait_queue+d/20>
   3:   89 01                     mov    %eax,(%ecx)
Code;  c0116a3f <remove_wait_queue+f/20>
   5:   53                        push   %ebx
Code;  c0116a40 <remove_wait_queue+10/20>
   6:   9d                        popf   
Code;  c0116a41 <remove_wait_queue+11/20>
   7:   5b                        pop    %ebx
Code;  c0116a42 <remove_wait_queue+12/20>
   8:   c3                        ret    
Code;  c0116a43 <remove_wait_queue+13/20>
   9:   8d b6 00 00 00 00         lea    0x0(%esi),%esi
Code;  c0116a49 <remove_wait_queue+19/20>
   f:   8d bc 27 00 00 00 00      lea    0x0(%edi,1),%edi

Version-Release number of selected component (if applicable):

Comment 1 Ben LaHaise 2002-12-17 20:37:02 UTC
Is the hardware for this machine known good?  Does it pass an overnight run of
memtest86?  The fact that a boot failed with an invalid crc strongly hints at
that, or possibly the cpu overheating.

Comment 2 Michal Jaegermann 2002-12-17 22:49:47 UTC
> Is the hardware for this machine known good?

Well, it is in a continous use for the last two years and this is the
first incident of that sort (some three weeks after 2.4.18-18.7.x
was installed). In other words so far hardware looked good. :-)
It runs for now after a powerdown and reboot.

memtest86 did not run so far and this is not that easy as the machine
is quite far from my desk. :-)  That is still an open option but not
that easy to arrange.

Comment 3 Michal Jaegermann 2002-12-21 17:44:29 UTC
As for today this oops looks like it was really caused by a broken CPU fan.
I will monitor the situation further.

Comment 4 Michal Jaegermann 2003-01-11 18:27:49 UTC
It definitely was a broken fan.