202205 – system freeze under limpack stress

Bug 202205 - system freeze under limpack stress

Summary: system freeze under limpack stress

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-08-11 15:31 UTC by Benedikt Schaefer
Modified:	2012-06-20 13:17 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 13:17:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Call trace from crached machine (3.92 KB, text/plain) 2006-08-11 15:31 UTC, Benedikt Schaefer	no flags	Details
an other erro log from a nother machine with some syntomps (3.51 KB, text/plain) 2006-08-14 12:53 UTC, Benedikt Schaefer	no flags	Details
raid1_mempool_race.patch (820 bytes, patch) 2006-08-15 16:29 UTC, Erich Focht	no flags	Details \| Diff
View All

Description Benedikt Schaefer 2006-08-11 15:31:51 UTC

Description of problem:
System freeze when limpack stress test is running after a few hours.
System is no longer reachable over network (no rsh, no ping)

Version-Release number of selected component (if applicable):
Test System:
Tyan GT24 (S2891)
 Bios: v2.02n10
 BMC: rev. 17
 CPUs: 2 x AMD Opteron 280
 Mem: 8 x 1GB PC3200
 HDD: 2 x 80GB SATA Hitachi
 Interconect: 1 x Infiniband HCA (Voltaire)

OS: RHEL4U2
kernel: 2.6.9-22.0.2


How reproducible:
It doesn't happend every time

Steps to Reproduce:
1. run linpack over 24h
2.
3.
  
Actual results:
Some server crashs some not.

Expected results:
no crash

Additional info:
we think that BIO_MEMPOOL is full, but the application will do more bio.
during the linpack tress test the system swapps about 3GB.
Attached the call trace from one crashed machine

Comment 1 Benedikt Schaefer 2006-08-11 15:31:52 UTC

Created attachment 134028 [details]
Call trace from crached machine

Comment 2 Benedikt Schaefer 2006-08-14 12:53:36 UTC

Created attachment 134135 [details]
an other erro log from a nother machine with some syntomps

Here an other erro log from a different machine (same type) which also crashed.

Comment 3 Erich Focht 2006-08-14 16:23:23 UTC

Looks like the patch submitted in the LKML thread
"[PATCH] dm: Fix deadlock under high i/o load in raid1 setup."
is addressing exactly this issue.
http://opensubscriber.com/message/linux-kernel%40vger.kernel.org/4640513.html

Any chance this goes into the RHEL4 kernel?

regards,
Erich

Comment 4 Larry Woodman 2006-08-14 18:04:57 UTC

Did you get a crash dump form this machine?  The reason I ask is because it
looks like the machine took an NMI watchdog timeout panic because this CPU was
stuck in a spinlock with interrupts disabled.  Evidently someone else has the
zone->lock so this CPU starved without taking timer interrupts long enough to
incur the NMIwatchdog crash.


static struct page *
buffered_rmqueue(struct zone *zone, int order, int gfp_flags)
{
...
        if (page == NULL) {
                spin_lock_irqsave(&zone->lock, flags);
                page = __rmqueue(zone, order);
                spin_unlock_irqrestore(&zone->lock, flags);
        }

Comment 5 Benedikt Schaefer 2006-08-15 06:21:58 UTC

No we have no crash dump for this machine

Comment 6 Benedikt Schaefer 2006-08-15 06:46:57 UTC

I have look at the patch posted by Erich, but I'm not sure this will help us
because we are not using dm-raid we are using mdadm. Could this effect also
happend with mdadm or is it a bug only from the dm-raid package?

Comment 7 Erich Focht 2006-08-15 16:29:10 UTC

Created attachment 134221 [details]
raid1_mempool_race.patch

In theory this patch should solve the issue in drivers/md/raid1.c similarly to
what was posted to LKML. My attempt to reproduce the bug lead straight into
another lockup (ext3 related). Will check bugzilla for something similar and
eventually post the report in another ticket...

Comment 8 Benedikt Schaefer 2006-08-17 06:29:40 UTC

I t looks like an other bugreport exist with the same problem (Bugreport #149088)

Comment 9 Larry Woodman 2006-08-17 15:10:16 UTC

Erich, did you verify that the patch in comment #7 fixed this problem?  The NMI
watchdog panic attached in comment #2 is certainly a different problem but this
patch might be the fix for memory allocation failure attached in comment #1 and
that might very well cause the system to hang.

Larry Woodman

Comment 10 Erich Focht 2006-08-18 08:42:40 UTC

Hi Larry,
I'm trying to produce the first (kswapd related) freeze but didn't succeed, 
yet. It's a pretty rare event. Trying still with the original kernel. Actually 
this should occur faster on single core machines (IMHO), so we switched 
testing to single core nodes. Once the reproducer works, I'll try with the 
patch. And keep you updated, of course.
Regards,
Erich

Comment 11 Larry Woodman 2006-12-01 20:27:39 UTC

Erich or Benedikt, can you try increasing /proc/sys/vm/min_free_kbytes to 4
times its default value and see if this prevents this hang from happening?  This
is what was done in the upstream kernel and does prevent the system from totally
exhausting RAM.

Thanks, Larry Woodman

Comment 12 Jiri Pallich 2012-06-20 13:17:01 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.