85275 – Summit-kernel hangs under heavy load.

Bug 85275 - Summit-kernel hangs under heavy load.

Summary: Summit-kernel hangs under heavy load.

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	123572
TreeView+	depends on / blocked

Reported:	2003-02-27 11:29 UTC by Juergen Siats
Modified:	2007-11-30 22:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-09-29 00:42:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bug fix for race in IPI code (1.65 KB, patch) 2004-03-30 22:40 UTC, Christopher Li	no flags	Details \| Diff
View All

Description Juergen Siats 2003-02-27 11:29:26 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)

Description of problem:
I have installed Red Hat AS 2.1 on a Fujitsu Siemens Computers Primergy T80 
with 8 cpu's and 32 GB main memory. The Primergy T850 is the same hardware like 
the IBM x440.
After installing RH AS2.1 out of the box I have installed the kernel 2.4.9-e.12:
rpm -Uvh kernel-doc-2.4.9-e.12.i386.rpm
rpm -Uvh kernel-headers-2.4.9-e.12.i386.rpm
rpm -Uvh kernel-source-2.4.9-e.12.i386.rpm
rpm -ivh kernel-summit-2.4.9-e.12.i686.rpm

Additionally I use the boot option "notsc".

The systems hangs approximate 30 minutes after starting our testsuite "Platform 
Load Test". I see a frozen graphic console, keyboard and mouse doesn't work and 
a ping doesn't work.
I can not find any error messages.

I had reported this error also against the beta-kernel versions under the 
bugzilla #83715.
The testsuite and some additional information you can find as attachment of 
bugzilla #83715.

I think you can reproduce the error on a IBM x440 system because it is the same 
hardware.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.see bugzilla #83715
2.
3.
    

Additional info:

Comment 1 Arjan van de Ven 2003-02-27 11:31:26 UTC

Please read
http://www.redhat.com/support/techsupport/production/GSS_caveat.html

Comment 2 Juergen Siats 2003-02-28 15:07:27 UTC

I did the same test with 16 GB main memory and XFree86 4.1.0-29 and the system 
hangs again.

Comment 3 Juergen Siats 2003-03-04 07:41:52 UTC

I did some tests with XFree86 4.1.0-44 and the same error happens. Sometimes I 
find this message in /var/log/messages:
Unable to handle kernel NULL pointer dereference at virtual address 00000007
Please see also bugzilla #85341

Comment 4 Christopher Li 2004-03-30 22:40:35 UTC

Created attachment 98982 [details]
bug fix for race in  IPI code

I am not sure this bug is fix in the latest summit kernel or not.
We run into this problem also. It turn out to be a bug in the
summit kernel.

The following code is quote from the summit kernel smp_callfunction.

#else
		/* Weirdly, the cpu_relax below breaks NUMA-Q boxes. */
		if (!clustered_apic_logical)
			cpu_relax();
#endif

No, cpu_relax() is just the pause instruction it should not break NUMA boxes.
It is pretty much a good indication there are some timing race elsewhere.
It turn out that smp_call_function_interrupt() will set the finished bit
no matter it is sync/async call.

Let's assume there is a async call follow by a very slow sync call.
The second sync call started before the first async finished.
Then the first async call finish before the second one finished.
The async set the finished bit will cause the caller of the second
sync call mistakenly think the sync call is finished.
So the caller continue and release the context data while the sync
call is running on it!

do_ccupdate_local is what I am talking about.

BTW, this bug do not exist on the offical kernel.
The offical kernel does smp_call_function correctly.
It is something introduce by the summit kernel.

After apply the attach patch, we can abuse the IBM x440
on big memory vmware stress test for weeks and kernel does not
OOPS yet. It has some other problem recovering from swap storm but
that is kind of expected given over commit that many memory.

Comment 6 Jason Baron 2004-09-21 20:40:56 UTC

looks like this a dup of #111219. Will double check.

Comment 8 Larry Woodman 2005-09-29 00:42:06 UTC

We cant really make changes to the AS2.1 kernel at this point.  Please let me
know if this is OK on RHEL3.

Larry Woodman

Comment 9 Juergen Siats 2005-09-29 06:45:28 UTC

There are no problems with RHEL3.

Note You need to log in before you can comment on or make changes to this bug.