Red Hat Bugzilla – Bug 54821
2.4.9-6smp crash when running Cerberus with 1GB RAM
Last modified: 2007-04-18 12:37:40 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Description of problem:
Machine: Dell PowerEdge 2550, dual 1GHz P3, 1GB RAM, Perc3/Di RAID-5
Kernel: The recently released 2.4.9-6smp, upgraded from 2.4.3-12smp
Software: All errata applied. Running the Cerberus test v1.3.0pre3.
Abstract: Not long into the test all h-ll breaks loose, kswapd runs
forever, no progress is made, the scsi subsystem times out and finally the
system hangs with a panic. Same test with previous official kernel release
from RedHat, 2.4.3-12smp, works fine except for a few bounce buffer
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Boot Dell SMP system with 1GB ram using kernel 2.4.9-6smp.
2. Start cerberus test.
3. Crumble, crackle, crash.
Actual Results: kswapd started running forever, the machine stopped
making any progress, impossible to login, degrading, raid-array stopped
doing work, scsi errors (timeouts), kernel panic, crash.
Expected Results: As when doing the same using 2.4.3-12smp, i.e chug
along, with a load of about 11.30, for days.
I have performed the exact same test on two other, very similar, machines.
One with the only difference of having 512MB RAM instead of 1GB and the
other having 512MB RAM and a RAID-1 array instead of a RAID-5 array. All
machines are Dell PowerEdge 25x0, with dual P3's.
This *seems* to be related to high-mem. I am *not* a kernel hacker so
obviously I don't know. I tend to read some of the more interesting
threads on lkml, though, and I do recall several issues with the
combination of high load and high-mem. This just looks related.
The machines are about to put into production and I tend to stress test
them before that to be fairly sure about their ability to keep up with the
production load (multi-site web-serving, heavily updated PostgreSQL
database, real-time graphing, etc.).
The reason for upgrading the kernel is obviously to get rid of the
recently published local root-exploit.
What firmware version is the megaraid card ?
My personal testbox is 1Gb with megaraid and no problems at all so far... but I
have a recent firmware version
also, any information about the "panic" would be most welcome
It's not a megaraid card, it's a Perc3/Di, i.e aacraid card.
If I understood correctly, the megaraids are from AMI and
the Perc3/Di's are from Adaptec. Different drivers, right?
The Perc3/Di has the latest officially available firmware
from Dell. Controller BIOS: 2.5-0, build #2991, and controller
firmware build #2991 as well. Since the machineis quite new,
it has the latest system bios as well (A05).
Unfortunately I have no details from the panic available,
i.e no stacktrace, oops symbols or similar. Not good...
I do recall having similar problems, albeit not leading
up to a system crash, when testing a previously bought
machine of the exact same type and configuration using
the original kernel in RH7.1 (2.4.2-something).
What happened then was that under certain workloads, e.g
a bonnie++ run combined with another memory and disk-
intensive task the load would go through the roof, with
kswapd seemingly just spinning. I managed to kill the
processes that time, but no such luck this time.
That problem went away after upgrading to 2.4.3, just
as this one after downgrading back to 2.4.3.
I hope this helps a little bit. I am away from the
machines right now, but they are (all three of them)
running Cerberus continuously over the weekend.
So far, no problems related to the stress test, except
for this little one in /var/log/messages:
Oct 19 15:53:07 db2 kernel: mm: critical shortage of bounce buffers.
That one appeared pretty early in the Cerberus run.
Nothing more related in the logs for 8 hours since.
I have no other similar machines (SMP, 1GB RAM, HW-RAID),
but non-Dell, to run comparative tests on.
The one that crashed is still running, with a load
of 11.44 at this time.
Hmm...as a matter of fact, this problem is different from
the one related to 2.4.2-something I described above.
In the 2.4.2 case, the kswapd got all cpu time and spun.
In the 2.4.9-6smp case, the kswapd was running with a
significant portion of the cpu time, but not actually
*all* of it.
Based on this second thought, my personal suspicion is
turned towards the scsi subsystem, with perhaps the
aacraid driver as the one that actually hung. A very
big difference in this crash was the rather long list
of scsi timeouts that did not occur with 2.4.2. During
those timeouts the RAID-array was seemingly sitting
idle (no activity lights on the drives). The aacraid
version seems to be newer in 2.4.9-6smp compared to
2.4.3-12smp. Could that, or something else that's been
changed in the scsi subsystem, be the source of this?
I rewrote the aacraid driver in the end. The old one had an out of memory hang
case that fits this description so assuming fixed