Bug 37503

Summary:	[lmsensors] Unresponsive system after "kernel BUG at memory.c:358!"
Product:	[Retired] Red Hat Linux	Reporter:	Nitin Dahyabhai <nitind>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED ERRATA	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.1	CC:	alan
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-06-09 17:09:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nitin Dahyabhai 2001-04-25 03:20:30 UTC

While running sensors (lm_sensors) on a via686* and rebuilding the kernel
SRPM,
this error was thrown to the console.  The comments within the sources
make it seem non-fatal and having once-in-a-long-while frequency, but
I hit it with < 15 mins. uptime.  I then halted the both programs and
restarted
them.  A third vc that I was logged into would no longer start programs and
although init caught my ctrl-alt-del and started sending TERM signals,
CROND ran its 10 minute jobs twice before I finally cut power.

I saw similiar behavior once during the beta while doing the same thing,
but I
was away for hours while this was left running and could not check the
logs.

Reproducible: Sometimes
Steps to Reproduce:
1. Run a CPU intensive task (suggest "while true ; do rpm --rebuild --clean
kernel.src.rpm ; done")
2. Run 'watch sensors|tail' at the same time
3. Wait.
	

Actual Results:  Ocassional hard lockups.

Expected Results:  Frequent hard lockups.

Hardware was:
  Athlon 1GHz (10x100) at rated clock, Abit KT7-RAID m/b
  512MB of PC133 RAM with CAS3 setting
 
output:
===
kernel BUG at memory.c:358!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c0121d5e>]
EFLAGS: 00010282
eax: 0000001c   ebx: d000de60   ecx: 00000008   edx: 00000000
esi: 00000000   edi: df46f500   ebp: 00000000   esp: dabd5ef0
ds: 0018   es: 0018   ss: 0018
Process cc1 (pid: 12737, stackpage=dabd5000)
Stack: c020cbbb c020cd7d 00000166 00000001 00153000 00000001 00153000
d5c62408
       00000000 00000000 d5c62000 00000001 dabd4000 dabd4000 d9e036a0
d9e036a0
       00030002 cfe57320 c0123ff3 c1893c14 d9e036a0 d9e036a0 00000292
d000de60
Call Trace: [<c020cbbb>] [<c020cd7d>] [<c0123ff3>] [<c0124698>]
[<c0114b86>] [<c01188d9>] [<c01243db>]
       [<c0112f30>] [<c010901b>]

Code: 0f 0b 83 c4 0c 8d b6 00 00 00 00 8d bc 27 00 00 00 00 8b 44
===

Comment 1 Arjan van de Ven 2001-04-25 08:10:31 UTC

The machine runs fine if you don't have the lmsensors module loaded ?

Comment 2 Nitin Dahyabhai 2001-04-27 04:07:13 UTC

No, I haven't been able to reproduce that exact error.
I ran into the following two errors while trying to reproduce it, but I suspect 
a partially hardware related cause at this point.

Unable to handle kernel paging request at virtual address 7564705d
 printing eip:
c01339b8
pgd entry cf86a754: 0000000000000000
pmd entry cf86a754: 0000000000000000
... pmd not present!
Oops: 0000
CPU:    0
EIP:    0010:[<c01339b8>]
EFLAGS: 00010206
eax: 75646f6d   ebx: d5b831a0   ecx: cfbc0320   edx: 00001000
esi: 00001000   edi: 00000000   ebp: 00001000   esp: cf86df90
ds: 0018   es: 0018   ss: 0018
Process cc1 (pid: 4280, stackpage=cf86d000)
Stack: cf86dfbc dea5f640 cfb28000 dea5f640 cf86c000 00000000 cf86c000 00000006
       cf86c000 401508e0 4015e000 bfffe648 c010901b 00000000 4015e000 00001000
       401508e0 4015e000 bfffe648 00000003 0000002b 0000002b 00000003 40101f44
Call Trace: [<c010901b>]

Code: f6 80 f0 00 00 00 01 74 0a 6a 01 50 e8 c7 50 01 00 5f 5d 89

kernel BUG at page_alloc.c:90!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c012d0ea>]
EFLAGS: 00010282
eax: 0000001f   ebx: 00000000   ecx: 00000005   edx: 00000000
esi: 00000000   edi: c10cc384   ebp: 00000000   esp: ce1b3ea8
ds: 0018   es: 0018   ss: 0018
Process rm (pid: 23050, stackpage=ce1b3000)
Stack: c020ea7b c020ec89 0000005a c10cc384 c0135b93 c10cc384 00000000 c10cc384
       c10cc384 c10cc384 c10cc384 00000000 c0124d6c c10cc384 00000000 c19e3a00
       c285b1a0 000001a0 c285b00c ce1b3f0c cc913108 00000000 bfffed68 c0124e2d
Call Trace: [<c020ea7b>] [<c020ec89>] [<c0135b93>] [<c0124d6c>] [<c0124e2d>] [<c
0147b57>] [<c0158b4f>]
       [<c0158b61>] [<c014643c>] [<c013fb6c>] [<c013fc39>] [<c010901b>]

Code: 0f 0b 83 c4 0c 8b 47 18 83 e0 20 74 16 6a 5c 68 89 ec 20 c0

Comment 3 Alan Cox 2003-06-09 17:09:05 UTC

Closing. Seems to match the VIA chipset flaw that 2.4.9 and later work around.
If it stil occurs feel free to re-open