Bug 18510

Summary:	kernel 2.2.16-22 freeze
Product:	[Retired] Red Hat Linux	Reporter:	Bernhard Ege <bme>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.0	CC:	mw, turchi, waananen
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-06-05 21:51:07 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bernhard Ege 2000-10-06 10:37:47 UTC

I have had the kernel freeze on my several times with no indications why in
the log. The I had the idea that I should let the virtual console 1 display
to see if I still got the kernel crash. Well I did, and the result I put
through scripts/ksymoops to clarify it a bit:

unable to handle kernel paging request at virtual address ffffffff
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
oops: 0000
cpu: 0
eip: 0010:[<ffffffff>]
eflags: 00210286
eax: 0000000f ebx: c7e471c0 ecx: 00000000 edx: 00000001
esi: c885b9bc edi: c022c3a4 ebp: c0247f4c esp: c0247f18
ds: 0018 es: 0018 ss: 0018
stack: 00000001 c7e471c0 c01127ae c7e471c0 00000001 c027dea0 00259eae
c022c3a4
       00000001 c010ae3a 00000000 00000000 00000000 c0247f60 c01196d9
00000000
       c0246000 c010b19a 00000e00 c010ae60 00000000 c0246000 00000000
c0246000
call trace: [<c01127ae>] [<c010ae3a>] [<c01196d9>] [<c010b19a>]
[<c010ae60>] [<c01088dd>] [<c0106000>]
            [<c0108900>] [<c010a06c>] [<c0106000>] [<c0106077>]
[<c0106000>] [<c0100175>]
code: bad eip value.
Warning: trailing garbage ignored on Code: line
  Text: 'code: bad eip value.'
  Garbage: 'ip value.'
Oops_code_values invalid value 0xbad in Code line, not a multiple of 2
digits, value ignored
Oops_code_values invalid value 0xe in Code line, not a multiple of 2
digits, value ignored

>>EIP: ffffffff <END_OF_CODE+37649e23/???
Trace: c01127ae <timer_bh+2be/404>
Trace: c010ae3a <do_8259A_IRQ+9a/a8>
Trace: c01196d9 <do_bottom_half+49/70>
Trace: c010b19a <do_IRQ+3a/3c>
Trace: c010ae60 <common_interrupt+18/20>
Trace: c01088dd <cpu_idle+5d/6c>
Trace: c0106000 <get_options+0/70>
Trace: c0108900 <sys_idle+14/20>
Code:  ffffffff <END_OF_CODE+37649e23/???      00000000 <_EIP>: <===

aiee, killing interrupt handler
kernel panic: attempted to kill the idle task!
in swapper task - not syncing

1737 warnings and 5 errors issued.  Results may not be reliable.


As suggested by a third party, I should be able to find the offending
module (if it is a module) this way:

#!/bin/sh
cd /lib/modules/2.2.16-22
for i in `find -name '*.o'`;do
        echo $i
        objdump --disassemble-all --reloc $i | grep '^0.*9bc <'
done

*9bc originates from the ESI register (used by a function call, I was
explained) and the only valid match was this:

./misc/agpgart.o
000009bc <agp_generic_remove_memory>:

This is the AGP part of the /var/log/messages file:

Oct  5 15:01:11 overmind kernel: Linux agpgart interface v0.99 (c) Jeff
Hartmann
Oct  5 15:01:11 overmind kernel: agpgart: Maximum main memory to use for
agp memory: 96M
Oct  5 15:01:11 overmind kernel: agpgart: Detected AMD Irongate chipset
Oct  5 15:01:11 overmind kernel: agpgart: AGP aperture is 64M @ 0xe0000000

The strange thing is, that my system is much more stable (if not
completely, hard to say without waiting 14 days) if I disable USB in the
bios, which causes usb-ohci and usbcore not to be loaded. If loaded, 1-3
freezes a day occurs.

I am using the nvidia drivers for XFree86, but have seen the kernel freeze
even with the nv driver (XF86 driver).

I have disabled AGP usage in XF86Config-4 and this is also seen by lsmod,
which shows agpgart to be loaded and not used by anything.

output from lspci:

00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-751 [Irongate] System
Controller (rev 23)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-751 [Irongate] AGP
Bridge (rev 01)
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-756 [Viper] ISA (rev
01)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-756 [Viper] IDE
(rev 03)00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-756 [Viper] ACPI
(rev 03)
00:07.4 USB Controller: Advanced Micro Devices [AMD] AMD-756 [Viper] USB
(rev 06)
00:08.0 Multimedia audio controller: Yamaha Corporation YMF-724F [DS-1
Audio Controller] (rev 03)
00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139 (rev
10)
01:05.0 VGA compatible controller: nVidia Corporation NV11 (rev a1)

I am not using the kernel YMF724 driver (which causes kernel hangs as well,
but that bug is reported by someone else).

In this particular kernel freeze, the nvidia driver and the usb driver were
both using the same interrupt (only ones sharing interrupts).

regards,

Bernhard Ege

Comment 1 dmike 2000-10-09 20:36:34 UTC

I probably had exactly the same problem on my computer. When I left it idle for
enough time, it just froze with blank screen and nothing in the logs. 
It happened only with the shipped 2.2.16-22 kernel, after I upgraded to
2.4.0-test9, the problem seems to be gone. My CPU is Pentium III 500MHz (Katmai)
on MSI 6163 motherboard(440BX) with TNT2 video (nv driver, agpgart module was
loaded too I think). Other hardware is 3c905B and sb PCI 128, all working well.
I thought it was somehow related to power management, because the system froze
only after being idle, but adding Option "NoPM" "true" to XF86Config didn't
change anything. 
If necessary I can restore the old kernel and perform more testing. 

Mike

Comment 2 Thomas Dodd 2000-11-10 15:30:05 UTC

I've seen the same problems with USB.
Looks like a bug in the USB back port.
unloading the USB modules stopped the
oopes here.

	-Thomas

Comment 3 mw 2000-11-10 18:18:46 UTC

My system is AMD ATHLON 600MHZ, nvidia TNT2-Vanta, stock kernel from 7.0.  The
additional remark is that my system freezes even if  I disable the onboard USB
support in the BIOS.  This happens 2 minutes after starting X.

When I had USB enabled and usb module loaded,  my system froze as soon as kudzu
started, hence I could never boot my system.

Mate

Comment 4 mw 2000-11-13 19:40:38 UTC

It seems that if I add

append="x86_serial_nr=1"

to lilo.conf,  my system is fine (previously, it froze 2 minutes after starting
X). On the other hand, if I enable USB in the BIOS, my system immediately
freezes when I start X.

Mate