Red Hat Bugzilla – Bug 85275
Summit-kernel hangs under heavy load.
Last modified: 2007-11-30 17:06:52 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)
Description of problem:
I have installed Red Hat AS 2.1 on a Fujitsu Siemens Computers Primergy T80
with 8 cpu's and 32 GB main memory. The Primergy T850 is the same hardware like
the IBM x440.
After installing RH AS2.1 out of the box I have installed the kernel 2.4.9-e.12:
rpm -Uvh kernel-doc-2.4.9-e.12.i386.rpm
rpm -Uvh kernel-headers-2.4.9-e.12.i386.rpm
rpm -Uvh kernel-source-2.4.9-e.12.i386.rpm
rpm -ivh kernel-summit-2.4.9-e.12.i686.rpm
Additionally I use the boot option "notsc".
The systems hangs approximate 30 minutes after starting our testsuite "Platform
Load Test". I see a frozen graphic console, keyboard and mouse doesn't work and
a ping doesn't work.
I can not find any error messages.
I had reported this error also against the beta-kernel versions under the
The testsuite and some additional information you can find as attachment of
I think you can reproduce the error on a IBM x440 system because it is the same
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.see bugzilla #83715
I did the same test with 16 GB main memory and XFree86 4.1.0-29 and the system
I did some tests with XFree86 4.1.0-44 and the same error happens. Sometimes I
find this message in /var/log/messages:
Unable to handle kernel NULL pointer dereference at virtual address 00000007
Please see also bugzilla #85341
Created attachment 98982 [details]
bug fix for race in IPI code
I am not sure this bug is fix in the latest summit kernel or not.
We run into this problem also. It turn out to be a bug in the
The following code is quote from the summit kernel smp_callfunction.
/* Weirdly, the cpu_relax below breaks NUMA-Q boxes. */
No, cpu_relax() is just the pause instruction it should not break NUMA boxes.
It is pretty much a good indication there are some timing race elsewhere.
It turn out that smp_call_function_interrupt() will set the finished bit
no matter it is sync/async call.
Let's assume there is a async call follow by a very slow sync call.
The second sync call started before the first async finished.
Then the first async call finish before the second one finished.
The async set the finished bit will cause the caller of the second
sync call mistakenly think the sync call is finished.
So the caller continue and release the context data while the sync
call is running on it!
do_ccupdate_local is what I am talking about.
BTW, this bug do not exist on the offical kernel.
The offical kernel does smp_call_function correctly.
It is something introduce by the summit kernel.
After apply the attach patch, we can abuse the IBM x440
on big memory vmware stress test for weeks and kernel does not
OOPS yet. It has some other problem recovering from swap storm but
that is kind of expected given over commit that many memory.
looks like this a dup of #111219. Will double check.
We cant really make changes to the AS2.1 kernel at this point. Please let me
know if this is OK on RHEL3.
There are no problems with RHEL3.