Bug 74588

Summary: Unable to handle kernel paging request
Product: [Retired] Red Hat Linux Reporter: Philip Pokorny <ppokorny>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 7.3CC: jwright
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:39:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
console log none

Description Philip Pokorny 2002-09-27 03:06:09 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
Several identical Dual P4 Xeon machines with 6GB memory, 7.3 and latest errata
(2.4.18-10bigmem) kernel crash frequently.

Two unhandled kernel paging requests.  These may be different bugs, if so
forgive me...

The first is non-fatal.  Running a 2.4.18-5bigmem kernel.  bb_local.sh is "Big
Brother" system monitoring script run once a minute from cron.

Sep 14 10:38:21 nc4 kernel: Unable to handle kernel paging request at virtual
address 420b4ced
Sep 14 10:38:21 nc4 kernel:  printing eip:
Sep 14 10:38:21 nc4 kernel:
420b4ced                                           
Sep 14 10:38:21 nc4 kernel: *pde = 00000000
Sep 14 10:38:21 nc4 kernel: Oops: 0004
Sep 14 10:38:21 nc4 kernel: e1000 eepro100 usb-uhci usbcore ext3 jbd aic7xxx
sd_mod scsi_mod 
Sep 14 10:38:21 nc4 kernel: CPU:    0
Sep 14 10:38:21 nc4 kernel: EIP:    0023:[<420b4ced>]    Tainted: P            
Sep 14 10:38:21 nc4 kernel: EFLAGS: 00010246
Sep 14 10:38:21 nc4 kernel:                                                    
Sep 14 10:38:21 nc4 kernel: EIP is at Using_Versions [] 0x420b4cec
(2.4.18-5bigmem)                                                               
            
Sep 14 10:38:21 nc4 kernel: eax: 00000000   ebx: 00000000   ecx: 00000000   edx:
4213030c                                                                      
Sep 14 10:38:21 nc4 kernel: esi: 4212ec44   edi: 00000000   ebp: bfffd1c8   esp:
bfffd19c
Sep 14 10:38:21 nc4 kernel: ds: 002b   es: 002b   ss: 002b
Sep 14 10:38:21 nc4 kernel: Process bb-local.sh (pid: 5191,stackpage=ea883000)


Other crashes (with both 2.4.18-5bigmem and 2.4.18-10bigmem) are fatal and
typically recursive.  Only recently did we get a serial console dump of a
recursive crash.  Console log attached as a file.  It appears that the process
table (or some associated data structure) is getting trashed.  The results is
that unguarded references to semaphores and mutex's generate exceptions due to
invalid addresses.  

Ultimately, the stack overflows, or something really evil happens and the box
stops completely.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Run system for several weeks.
2.
3.
	

Actual Results:  System crashes

Expected Results:  System runs flawlessly...

Additional info:

Unable to<1>Unable to handle kernel paging request at virtual address a4a663b0
 printing eip:
c010a6df
*pde = 00000000
Oops: 0002
e1000 e100 usb-uhci usbcore ext3 jbd aic7xxx sd_mod scsi_mod
CPU:    1111573280
EIP:    0010:[<c010a6df>]    Not tainted
EFLAGS: 00010086

EIP is at do_IRQ [kernel] 0x2f (2.4.18-10bigmem)
eax: f91abc00   ebx: 00000000   ecx: c03e3d60   edx: 00000018
esi: c0364800   edi: 00000000   ebp: 42414320   esp: d1307e1c
ds: 0018   es: 0018   ss: 0018
Process ,"1.5FT SCSI CABLE DB68 VHDCI M  MICRO
DB68M","66.05","66.08","0.00",0,"757120136224"
"09/10/2002","CABLES TO GO","02469","101004","NULL MODEM ADPT DB25M
DB25F","5.28","5.28","0.00",38,"757120024699"
"01/12/2001","CABLES TO GO","16202","101018","BAY 7825 LUCENT 25FT
CONSOLE-MODEM","0.00","0.00","0.00",0,"757120162025"
"02/28/2001","CABLES TO GO","16207","101022","15FT CABLE BAY 7219 LUCENT DB15M
V.35M","0.00","0.00","0.00",0,"757120162074"
"02/28/2001","CABLES TO GO","16208","101023","15FT BAY 7120 LUCENT DB15M V.35M
V.25 BIS","0.00","0.00","0.00",0,"757120162083"
"02/28/2001","CABLES TO GO","16210","101025","15FT BAY 7138 LUCENT HD44M DB25M
RZ DTR","0.00","0.00","0.00",0,"757120162101"
"01/12/2001","CABLES TO GO","16213","101028","BAY 7826 LUCENT 15FT HD44M-DB25M
V.25BIS","0.00","0.00","0.00",0,"757120162131"
"01/12/2001","CABLES TO GO","16214","101029","BAY 7224 LUCENT 15FT HD44M-DB15M
X.21","0.00","0.00","0.00",0,"757120162148"
"01/12/2001","CABLES TO GO","16216","101031","BAY 910-2 LUCENT
Stack: 2c223241 00000005 c03e3d60 c03b0d05 00000009 c023c2e4 00000005 c03e3d60
       000003fd c03e3d60 c03b0d05 00000009 c03e3d00 4f4e0018 4c4c0018 ffffff00
       c018ce6a 00000010 00000202 000f3fe3 c01923f8 c03e3d60 00000005 0000000d
Call Trace: [<c018ce6a>] serial_in [kernel] 0x2a
[<c01923f8>] serial_console_write [kernel] 0x68
[<c011c8a6>] __call_console_drivers [kernel] 0x46
[<c011ca1b>] call_console_drivers [kernel] 0xeb
[<c011cc7e>] release_console_sem [kernel] 0x4e
[<c011cbc8>] printk [kernel] 0x128
[<c01178b1>] do_page_fault [kernel] 0x81
[<c0117830>] do_page_fault [kernel] 0x0
[<c0117adb>] do_page_fault [kernel] 0x2ab
[<c0117830>] do_page_fault [kernel] 0x0
[<c0108d4c>] error_code [kernel] 0x34
[<c0117830>] do_page_fault [kernel] 0x0
[<c01178b1>] do_page_fault [kernel] 0x81


Code: ff 04 85 b0 73 3b c0 f0 fe 8b 10 48 36 c0 0f 88 86 08 00 00
 <1>Unable to handle kernel paging request at virtual address a4a66470
 printing eip:
c010a6df
*pde = 00000000
Oops: 0002
e1000 e100 usb-uhci usbcore ext3 jbd aic7xxx sd_mod scsi_mod
CPU:    1111573280
EIP:    0010:[<c010a6df>]    Not tainted
EFLAGS: 00010086

EIP is at do_IRQ [kernel] 0x2f (2.4.18-10bigmem)
eax: f91abc30   ebx: 00001800   ecx: 00000001   edx: 00000018
esi: c0366000   edi: 00000030   ebp: 42414320   esp: d1307ca0
ds: 0018   es: 0018   ss: 0018
Process [snipped]
Stack: c02f3304 00000002 00000000 d1307de8 d1307de8 c023c2e4 00000002 00000001
       f61a9f64 00000000 d1307de8 d1307de8 c025556a d1300018 c0110018 ffffff30
       c01091d5 00000010 00000282 00000000 c010a6df c0244c20 c0117b68 0000000b
Call Trace: [<c0110018>] check_pcibios [kernel] 0x88
[<c01091d5>] die [kernel] 0x75
[<c010a6df>] do_IRQ [kernel] 0x2f
[<c0117b68>] do_page_fault [kernel] 0x338
[<c0117830>] do_page_fault [kernel] 0x0
[<c0108d4c>] error_code [kernel] 0x34
[<c010a6df>] do_IRQ [kernel] 0x2f
[<c018ce6a>] serial_in [kernel] 0x2a
[<c01923f8>] serial_console_write [kernel] 0x68
[<c011c8a6>] __call_console_drivers [kernel]0x46
[<c011ca1b>] call_console_drivers [kernel] 0xeb
[<c011cc7e>] release_console_sem [kernel] 0x4e
[<c011cbc8>] printk [kernel] 0x128
[<c01178b1>] do_page_fault [kernel] 0x81
[<c0117830>] do_page_fault [kernel] 0x0
[<c0117adb>] do_page_fault [kernel] 0x2ab
[<c0117830>] do_page_fault [kernel] 0x0
[<c0108d4c>] error_code [kernel] 0x34
[<c0117830>] do_page_fault [kernel] 0x0
[<c01178b1>] do_page_fault [kernel] 0x81


Code: ff 04 85 b0 73 3b c0 f0 fe 8b 10 48 36 c0 0f 88 86 08 00 00
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
 <1>Unable to handle kernel paging request at virtual address 0179dc7c
 printing eip:


.....  You get the idea.  It goes on for 20 cycles until the stack gets clobbered.

The problems were worse with HyperThreading turned on (4 virtual CPU's) and Big
Brother running.

We downloaded, compiled and installed the latest E1000 and E100 drivers from
Intel as gigabit Ethernet traffic was implicated in some early testing.
The work load is a large set of Perl programs doing database to HTML processing
for web pages.  I can attach SAR data if helpfull.

Comment 1 Philip Pokorny 2002-09-27 03:08:12 UTC
Created attachment 77424 [details]
console log

Comment 2 Bugzilla owner 2004-09-30 15:39:57 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/