This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 202205 - system freeze under limpack stress
system freeze under limpack stress
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.2
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-08-11 11:31 EDT by Benedikt Schaefer
Modified: 2012-06-20 09:17 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-20 09:17:01 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Call trace from crached machine (3.92 KB, text/plain)
2006-08-11 11:31 EDT, Benedikt Schaefer
no flags Details
an other erro log from a nother machine with some syntomps (3.51 KB, text/plain)
2006-08-14 08:53 EDT, Benedikt Schaefer
no flags Details
raid1_mempool_race.patch (820 bytes, patch)
2006-08-15 12:29 EDT, Erich Focht
no flags Details | Diff

  None (edit)
Description Benedikt Schaefer 2006-08-11 11:31:51 EDT
Description of problem:
System freeze when limpack stress test is running after a few hours.
System is no longer reachable over network (no rsh, no ping)

Version-Release number of selected component (if applicable):
Test System:
Tyan GT24 (S2891)
 Bios: v2.02n10
 BMC: rev. 17
 CPUs: 2 x AMD Opteron 280
 Mem: 8 x 1GB PC3200
 HDD: 2 x 80GB SATA Hitachi
 Interconect: 1 x Infiniband HCA (Voltaire)

OS: RHEL4U2
kernel: 2.6.9-22.0.2


How reproducible:
It doesn't happend every time

Steps to Reproduce:
1. run linpack over 24h
2.
3.
  
Actual results:
Some server crashs some not.

Expected results:
no crash

Additional info:
we think that BIO_MEMPOOL is full, but the application will do more bio.
during the linpack tress test the system swapps about 3GB.
Attached the call trace from one crashed machine
Comment 1 Benedikt Schaefer 2006-08-11 11:31:52 EDT
Created attachment 134028 [details]
Call trace from crached machine
Comment 2 Benedikt Schaefer 2006-08-14 08:53:36 EDT
Created attachment 134135 [details]
an other erro log from a nother machine with some syntomps

Here an other erro log from a different machine (same type) which also crashed.
Comment 3 Erich Focht 2006-08-14 12:23:23 EDT
Looks like the patch submitted in the LKML thread
"[PATCH] dm: Fix deadlock under high i/o load in raid1 setup."
is addressing exactly this issue.
http://opensubscriber.com/message/linux-kernel%40vger.kernel.org/4640513.html

Any chance this goes into the RHEL4 kernel?

regards,
Erich
Comment 4 Larry Woodman 2006-08-14 14:04:57 EDT
Did you get a crash dump form this machine?  The reason I ask is because it
looks like the machine took an NMI watchdog timeout panic because this CPU was
stuck in a spinlock with interrupts disabled.  Evidently someone else has the
zone->lock so this CPU starved without taking timer interrupts long enough to
incur the NMIwatchdog crash.


static struct page *
buffered_rmqueue(struct zone *zone, int order, int gfp_flags)
{
...
        if (page == NULL) {
                spin_lock_irqsave(&zone->lock, flags);
                page = __rmqueue(zone, order);
                spin_unlock_irqrestore(&zone->lock, flags);
        }
Comment 5 Benedikt Schaefer 2006-08-15 02:21:58 EDT
No we have no crash dump for this machine
Comment 6 Benedikt Schaefer 2006-08-15 02:46:57 EDT
I have look at the patch posted by Erich, but I'm not sure this will help us
because we are not using dm-raid we are using mdadm. Could this effect also
happend with mdadm or is it a bug only from the dm-raid package?
Comment 7 Erich Focht 2006-08-15 12:29:10 EDT
Created attachment 134221 [details]
raid1_mempool_race.patch

In theory this patch should solve the issue in drivers/md/raid1.c similarly to
what was posted to LKML. My attempt to reproduce the bug lead straight into
another lockup (ext3 related). Will check bugzilla for something similar and
eventually post the report in another ticket...
Comment 8 Benedikt Schaefer 2006-08-17 02:29:40 EDT
I t looks like an other bugreport exist with the same problem (Bugreport #149088)
Comment 9 Larry Woodman 2006-08-17 11:10:16 EDT
Erich, did you verify that the patch in comment #7 fixed this problem?  The NMI
watchdog panic attached in comment #2 is certainly a different problem but this
patch might be the fix for memory allocation failure attached in comment #1 and
that might very well cause the system to hang.

Larry Woodman
Comment 10 Erich Focht 2006-08-18 04:42:40 EDT
Hi Larry,
I'm trying to produce the first (kswapd related) freeze but didn't succeed, 
yet. It's a pretty rare event. Trying still with the original kernel. Actually 
this should occur faster on single core machines (IMHO), so we switched 
testing to single core nodes. Once the reproducer works, I'll try with the 
patch. And keep you updated, of course.
Regards,
Erich 
Comment 11 Larry Woodman 2006-12-01 15:27:39 EST
Erich or Benedikt, can you try increasing /proc/sys/vm/min_free_kbytes to 4
times its default value and see if this prevents this hang from happening?  This
is what was done in the upstream kernel and does prevent the system from totally
exhausting RAM.

Thanks, Larry Woodman
 
Comment 12 Jiri Pallich 2012-06-20 09:17:01 EDT
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.