102133 – kernel panic in page_referenced()

Bug 102133 - kernel panic in page_referenced()

Summary: kernel panic in page_referenced()

Keywords:
Status:	CLOSED DUPLICATE of bug 100739
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-08-11 16:30 UTC by Robert Hentosh
Modified:	2007-04-18 16:56 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-21 18:58:04 UTC
Embargoed:

Attachments	(Terms of Use)
Panics and partial objdumps of kernels. (740.05 KB, application/octet-stream) 2003-08-11 16:32 UTC, Robert Hentosh	no flags	Details
JPG of oops (104.24 KB, image/jpeg) 2003-08-11 17:10 UTC, Michael K. Johnson	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2003:263	0	high	SHIPPED_LIVE	Updated 2.4 kernel resolves obscure bugs.	2003-08-20 04:00:00 UTC

Description Robert Hentosh 2003-08-11 16:30:25 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)

Description of problem:
This failure happened during testing for another VM issue explained in bugzilla 
100739. You may want to check that issue for history and test environment 
description. I am breaking this out as requested by MKJ.

We have only seen this issue on SMP boxes, but most of the testing has been on 
SMP boxes.

Version-Release number of selected component (if applicable):
kernel-2.4.20-19.9.3

How reproducible:
Sometimes

Steps to Reproduce:
1. Run newburn on a System with VNC server activated for about 2 days.
2.
3.
    

Actual Results:  System panic in page_referenced().

Expected Results:  No panic.

Additional info:

This has been reproduced on 2.4.20-19.9.3 based kernel. The kernel was 
recompiled with a 1.18h megaraid driver to support the PERC4/DC that was in 
some of the test cases.  It has been reproduced on earlier kernel versions. 
Those crashes are included in the tarball in a subdirectory.

Comment 1 Robert Hentosh 2003-08-11 16:32:08 UTC

Created attachment 93578 [details]
Panics and partial objdumps of kernels.

Comment 2 Michael K. Johnson 2003-08-11 17:10:30 UTC

Created attachment 93579 [details]
JPG of oops

Excerpt from call trace: launder_page 0x1f2
refill_inactive_zone 0x51e
rebalance_dirty_zone
rebalance_inactive_zone
rebalance_inactive
do_try_to_free_pages_kswapd

Comment 3 Michael K. Johnson 2003-08-11 17:20:43 UTC

From "older-panics":
EIP is at page_referenced [kernel] 0xe5 (2.4.20-9rhsmp)
                                         ^^^^^^^^^^^^^

So this is a longer-standing problem that was NOT raised as a show-stopper
when we asked for a list of *all* show-stoppers before starting this
exercise.

PLEASE, if you are going to use a camera to record oops messages, do it at
a higher resolution than 640x480.  That's screen resolution, and that means
that your images are barely legible.  Stopping the camera down to lowest
resolution (or using a decade-old camera or a webcam or some similar junk)
just makes the job harder.  I know that Dell sells real digital cameras...

Comment 4 Robert Hentosh 2003-08-11 20:09:14 UTC

In regards to comment #3:

The issue was regressed to see if they would occur under the latest errata.  
Especially since there was a code change in page_referenced() that seems like 
it might have fixed the issue.

Comment 5 Michael K. Johnson 2003-08-11 20:21:36 UTC

So what was the previous unique bugzilla# you had reported this under?

Comment 6 Rik van Riel 2003-08-11 20:24:33 UTC

This issue was seen during Taroon development and subsequently corrected. I'm
not sure we ever saw it on x86 though ...

Comment 7 Rik van Riel 2003-08-11 20:28:05 UTC

IIRC the solution was to make cpu_relax() include a barrier, which in x86 means
making rep_nop() include a memory barrier.  Could you please try that ?

Comment 8 Michael K. Johnson 2003-08-11 20:38:32 UTC

@@ -517,7 +518,7 @@ struct microcode {
 /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
 static inline void rep_nop(void)
 {
-       __asm__ __volatile__("rep;nop");
+       __asm__ __volatile__("rep;nop" ::: "memory");
 }

 #define cpu_relax()    rep_nop()

Comment 9 Robert Hentosh 2003-08-11 22:07:15 UTC

I will try this patch tonight... but shouldn't I see more spin lock failures if 
this was the case?

Comment 10 Robert Hentosh 2003-08-12 11:37:22 UTC

Okay, I must be missing something.  I just don't see how this can help when the
problem is occuring on a single physical CPU system with hyper-threading.

Can you dig up exactly why this was put in Taroon?  Was the processor/chipset
vendor involved?

Comment 11 Rik van Riel 2003-08-12 12:45:41 UTC

The cpu_relax() has to include a memory barrier, otherwise the compiler has no
obligation to reload the variable from memory and the system could spin forever
in this loop, not unlike the way you've seen happening...

On x86 it usually doesn't trigger, but on some other architectures it was
immediately noticable. In the 2.5 kernel cpu_relax() includes a barrier on all
architectures.

Comment 12 Robert Hentosh 2003-08-12 13:23:41 UTC

Yes. But this problem is happening on a single physical CPU with HT on.  It has 
a shared cache between the CPU's.

Comment 13 Robert Hentosh 2003-08-12 13:29:07 UTC

Besides x86 is a cache coherent architecture.  Memory barriers should only be 
needed for ordering of the cache writes, not for insuring cache coherency. If 
that is the problem then the CPU/chipset vendors needs to know and fix the 
issue.

Comment 14 Robert Hentosh 2003-08-14 02:39:50 UTC

Please see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100739#c58
Bugzilla 100739 comment #58 for information that might pertain to this issue as
well.

Comment 15 Robert Hentosh 2003-08-18 16:17:49 UTC

Please see bugzilla 100739 for a patch in the 64th comment for this issue. The 
patch has been verified on 15 machines running newburn for 3 days.  One would 
normally see 6+ failures in this time frame on those same test machines. 

Testing will continue while code review verifies the fix of this race condition.

Comment 16 Michael K. Johnson 2003-08-18 19:08:54 UTC

OK, so I was wrong, it's all one big happy bug family...

*** This bug has been marked as a duplicate of 100739 ***

Comment 17 Larry Troan 2003-08-20 21:03:00 UTC

Opening up per Dell request (rh).

Comment 18 Red Hat Bugzilla 2006-02-21 18:58:04 UTC

Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.