Bug 85275
Summary: | Summit-kernel hangs under heavy load. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Juergen Siats <juergen.siats> | ||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||
Status: | CLOSED CANTFIX | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.1 | CC: | dff | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-09-29 00:42:06 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 123572 | ||||||
Attachments: |
|
Description
Juergen Siats
2003-02-27 11:29:26 UTC
I did the same test with 16 GB main memory and XFree86 4.1.0-29 and the system hangs again. I did some tests with XFree86 4.1.0-44 and the same error happens. Sometimes I find this message in /var/log/messages: Unable to handle kernel NULL pointer dereference at virtual address 00000007 Please see also bugzilla #85341 Created attachment 98982 [details]
bug fix for race in IPI code
I am not sure this bug is fix in the latest summit kernel or not.
We run into this problem also. It turn out to be a bug in the
summit kernel.
The following code is quote from the summit kernel smp_callfunction.
#else
/* Weirdly, the cpu_relax below breaks NUMA-Q boxes. */
if (!clustered_apic_logical)
cpu_relax();
#endif
No, cpu_relax() is just the pause instruction it should not break NUMA boxes.
It is pretty much a good indication there are some timing race elsewhere.
It turn out that smp_call_function_interrupt() will set the finished bit
no matter it is sync/async call.
Let's assume there is a async call follow by a very slow sync call.
The second sync call started before the first async finished.
Then the first async call finish before the second one finished.
The async set the finished bit will cause the caller of the second
sync call mistakenly think the sync call is finished.
So the caller continue and release the context data while the sync
call is running on it!
do_ccupdate_local is what I am talking about.
BTW, this bug do not exist on the offical kernel.
The offical kernel does smp_call_function correctly.
It is something introduce by the summit kernel.
After apply the attach patch, we can abuse the IBM x440
on big memory vmware stress test for weeks and kernel does not
OOPS yet. It has some other problem recovering from swap storm but
that is kind of expected given over commit that many memory.
looks like this a dup of #111219. Will double check. We cant really make changes to the AS2.1 kernel at this point. Please let me know if this is OK on RHEL3. Larry Woodman There are no problems with RHEL3. |