Bug 85275

Summary: Summit-kernel hangs under heavy load.
Product: Red Hat Enterprise Linux 2.1 Reporter: Juergen Siats <juergen.siats>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED CANTFIX QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: dff
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-29 00:42:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 123572    
Attachments:
Description Flags
bug fix for race in IPI code none

Description Juergen Siats 2003-02-27 11:29:26 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)

Description of problem:
I have installed Red Hat AS 2.1 on a Fujitsu Siemens Computers Primergy T80 
with 8 cpu's and 32 GB main memory. The Primergy T850 is the same hardware like 
the IBM x440.
After installing RH AS2.1 out of the box I have installed the kernel 2.4.9-e.12:
rpm -Uvh kernel-doc-2.4.9-e.12.i386.rpm
rpm -Uvh kernel-headers-2.4.9-e.12.i386.rpm
rpm -Uvh kernel-source-2.4.9-e.12.i386.rpm
rpm -ivh kernel-summit-2.4.9-e.12.i686.rpm

Additionally I use the boot option "notsc".

The systems hangs approximate 30 minutes after starting our testsuite "Platform 
Load Test". I see a frozen graphic console, keyboard and mouse doesn't work and 
a ping doesn't work.
I can not find any error messages.

I had reported this error also against the beta-kernel versions under the 
bugzilla #83715.
The testsuite and some additional information you can find as attachment of 
bugzilla #83715.

I think you can reproduce the error on a IBM x440 system because it is the same 
hardware.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.see bugzilla #83715
2.
3.
    

Additional info:

Comment 1 Arjan van de Ven 2003-02-27 11:31:26 UTC
Please read
http://www.redhat.com/support/techsupport/production/GSS_caveat.html

Comment 2 Juergen Siats 2003-02-28 15:07:27 UTC
I did the same test with 16 GB main memory and XFree86 4.1.0-29 and the system 
hangs again.

Comment 3 Juergen Siats 2003-03-04 07:41:52 UTC
I did some tests with XFree86 4.1.0-44 and the same error happens. Sometimes I 
find this message in /var/log/messages:
Unable to handle kernel NULL pointer dereference at virtual address 00000007
Please see also bugzilla #85341

Comment 4 Christopher Li 2004-03-30 22:40:35 UTC
Created attachment 98982 [details]
bug fix for race in  IPI code

I am not sure this bug is fix in the latest summit kernel or not.
We run into this problem also. It turn out to be a bug in the
summit kernel.

The following code is quote from the summit kernel smp_callfunction.

#else
		/* Weirdly, the cpu_relax below breaks NUMA-Q boxes. */
		if (!clustered_apic_logical)
			cpu_relax();
#endif

No, cpu_relax() is just the pause instruction it should not break NUMA boxes.
It is pretty much a good indication there are some timing race elsewhere.
It turn out that smp_call_function_interrupt() will set the finished bit
no matter it is sync/async call.

Let's assume there is a async call follow by a very slow sync call.
The second sync call started before the first async finished.
Then the first async call finish before the second one finished.
The async set the finished bit will cause the caller of the second
sync call mistakenly think the sync call is finished.
So the caller continue and release the context data while the sync
call is running on it!

do_ccupdate_local is what I am talking about.

BTW, this bug do not exist on the offical kernel.
The offical kernel does smp_call_function correctly.
It is something introduce by the summit kernel.

After apply the attach patch, we can abuse the IBM x440
on big memory vmware stress test for weeks and kernel does not
OOPS yet. It has some other problem recovering from swap storm but
that is kind of expected given over commit that many memory.

Comment 6 Jason Baron 2004-09-21 20:40:56 UTC
looks like this a dup of #111219. Will double check.

Comment 8 Larry Woodman 2005-09-29 00:42:06 UTC
We cant really make changes to the AS2.1 kernel at this point.  Please let me
know if this is OK on RHEL3.

Larry Woodman


Comment 9 Juergen Siats 2005-09-29 06:45:28 UTC
There are no problems with RHEL3.