54821 – 2.4.9-6smp crash when running Cerberus with 1GB RAM

Bug 54821 - 2.4.9-6smp crash when running Cerberus with 1GB RAM

Summary: 2.4.9-6smp crash when running Cerberus with 1GB RAM

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-10-19 18:55 UTC by David
Modified:	2007-04-18 16:37 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-06-07 18:46:24 UTC
Embargoed:

Attachments	(Terms of Use)

Description David 2001-10-19 18:55:17 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

Description of problem:
Machine: Dell PowerEdge 2550, dual 1GHz P3, 1GB RAM, Perc3/Di RAID-5
Kernel: The recently released 2.4.9-6smp, upgraded from 2.4.3-12smp
Software: All errata applied. Running the Cerberus test v1.3.0pre3.
Abstract: Not long into the test all h-ll breaks loose, kswapd runs 
forever, no progress is made, the scsi subsystem times out and finally the 
system hangs with a panic. Same test with previous official kernel release 
from RedHat, 2.4.3-12smp, works fine except for a few bounce buffer 
shortages.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Boot Dell SMP system with 1GB ram using kernel 2.4.9-6smp.
2. Start cerberus test.
3. Crumble, crackle, crash.

Actual Results:  kswapd started running forever, the machine stopped 
making any progress, impossible to login, degrading, raid-array stopped 
doing work, scsi errors (timeouts), kernel panic, crash.

Expected Results:  As when doing the same using 2.4.3-12smp, i.e chug 
along, with a load of about 11.30, for days.

Additional info:

I have performed the exact same test on two other, very similar, machines. 
One with the only difference of having 512MB RAM instead of 1GB and the 
other having 512MB RAM and a RAID-1 array instead of a RAID-5 array. All 
machines are Dell PowerEdge 25x0, with dual P3's.

This *seems* to be related to high-mem. I am *not* a kernel hacker so 
obviously I don't know. I tend to read some of the more interesting 
threads on lkml, though, and I do recall several issues with the 
combination of high load and high-mem. This just looks related.

The machines are about to put into production and I tend to stress test 
them before that to be fairly sure about their ability to keep up with the 
production load (multi-site web-serving, heavily updated PostgreSQL 
database, real-time graphing, etc.).

The reason for upgrading the kernel is obviously to get rid of the 
recently published local root-exploit.

Comment 1 Arjan van de Ven 2001-10-19 19:52:23 UTC

What firmware version is the megaraid card ?
My personal testbox is 1Gb with megaraid and no problems at all so far... but I
have a recent firmware version

Comment 2 Arjan van de Ven 2001-10-19 20:04:31 UTC

also, any information about the "panic" would be most welcome

Comment 3 David 2001-10-19 20:57:31 UTC

It's not a megaraid card, it's a Perc3/Di, i.e aacraid card.
If I understood correctly, the megaraids are from AMI and
the Perc3/Di's are from Adaptec. Different drivers, right?

The Perc3/Di has the latest officially available firmware
from Dell. Controller BIOS: 2.5-0, build #2991, and controller
firmware build #2991 as well. Since the machineis quite new,
it has the latest system bios as well (A05).

Unfortunately I have no details from the panic available,
i.e no stacktrace, oops symbols or similar. Not good...

I do recall having similar problems, albeit not leading
up to a system crash, when testing a previously bought
machine of the exact same type and configuration using
the original kernel in RH7.1 (2.4.2-something).

What happened then was that under certain workloads, e.g
a bonnie++ run combined with another memory and disk-
intensive task the load would go through the roof, with
kswapd seemingly just spinning. I managed to kill the
processes that time, but no such luck this time.

That problem went away after upgrading to 2.4.3, just
as this one after downgrading back to 2.4.3.

I hope this helps a little bit. I am away from the
machines right now, but they are (all three of them)
running Cerberus continuously over the weekend.

So far, no problems related to the stress test, except
for this little one in /var/log/messages:

Oct 19 15:53:07 db2 kernel: mm: critical shortage of bounce buffers.

That one appeared pretty early in the Cerberus run.
Nothing more related in the logs for 8 hours since.
I have no other similar machines (SMP, 1GB RAM, HW-RAID),
but non-Dell, to run comparative tests on.

The one that crashed is still running, with a load
of 11.44 at this time.

Comment 4 David 2001-10-20 08:29:22 UTC

Hmm...as a matter of fact, this problem is different from
the one related to 2.4.2-something I described above.

In the 2.4.2 case, the kswapd got all cpu time and spun.

In the 2.4.9-6smp case, the kswapd was running with a
significant portion of the cpu time, but not actually
*all* of it.

Based on this second thought, my personal suspicion is
turned towards the scsi subsystem, with perhaps the
aacraid driver as the one that actually hung. A very
big difference in this crash was the rather long list
of scsi timeouts that did not occur with 2.4.2. During
those timeouts the RAID-array was seemingly sitting
idle (no activity lights on the drives). The aacraid
version seems to be newer in 2.4.9-6smp compared to
2.4.3-12smp. Could that, or something else that's been
changed in the scsi subsystem, be the source of this?

Comment 5 Alan Cox 2003-06-07 18:46:24 UTC

I rewrote the aacraid driver in the end. The old one had an out of memory hang
case that fits this description so assuming fixed

Note You need to log in before you can comment on or make changes to this bug.