From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Description of problem: Machine: Dell PowerEdge 2550, dual 1GHz P3, 1GB RAM, Perc3/Di RAID-5 Kernel: The recently released 2.4.9-6smp, upgraded from 2.4.3-12smp Software: All errata applied. Running the Cerberus test v1.3.0pre3. Abstract: Not long into the test all h-ll breaks loose, kswapd runs forever, no progress is made, the scsi subsystem times out and finally the system hangs with a panic. Same test with previous official kernel release from RedHat, 2.4.3-12smp, works fine except for a few bounce buffer shortages. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Boot Dell SMP system with 1GB ram using kernel 2.4.9-6smp. 2. Start cerberus test. 3. Crumble, crackle, crash. Actual Results: kswapd started running forever, the machine stopped making any progress, impossible to login, degrading, raid-array stopped doing work, scsi errors (timeouts), kernel panic, crash. Expected Results: As when doing the same using 2.4.3-12smp, i.e chug along, with a load of about 11.30, for days. Additional info: I have performed the exact same test on two other, very similar, machines. One with the only difference of having 512MB RAM instead of 1GB and the other having 512MB RAM and a RAID-1 array instead of a RAID-5 array. All machines are Dell PowerEdge 25x0, with dual P3's. This *seems* to be related to high-mem. I am *not* a kernel hacker so obviously I don't know. I tend to read some of the more interesting threads on lkml, though, and I do recall several issues with the combination of high load and high-mem. This just looks related. The machines are about to put into production and I tend to stress test them before that to be fairly sure about their ability to keep up with the production load (multi-site web-serving, heavily updated PostgreSQL database, real-time graphing, etc.). The reason for upgrading the kernel is obviously to get rid of the recently published local root-exploit.
What firmware version is the megaraid card ? My personal testbox is 1Gb with megaraid and no problems at all so far... but I have a recent firmware version
also, any information about the "panic" would be most welcome
It's not a megaraid card, it's a Perc3/Di, i.e aacraid card. If I understood correctly, the megaraids are from AMI and the Perc3/Di's are from Adaptec. Different drivers, right? The Perc3/Di has the latest officially available firmware from Dell. Controller BIOS: 2.5-0, build #2991, and controller firmware build #2991 as well. Since the machineis quite new, it has the latest system bios as well (A05). Unfortunately I have no details from the panic available, i.e no stacktrace, oops symbols or similar. Not good... I do recall having similar problems, albeit not leading up to a system crash, when testing a previously bought machine of the exact same type and configuration using the original kernel in RH7.1 (2.4.2-something). What happened then was that under certain workloads, e.g a bonnie++ run combined with another memory and disk- intensive task the load would go through the roof, with kswapd seemingly just spinning. I managed to kill the processes that time, but no such luck this time. That problem went away after upgrading to 2.4.3, just as this one after downgrading back to 2.4.3. I hope this helps a little bit. I am away from the machines right now, but they are (all three of them) running Cerberus continuously over the weekend. So far, no problems related to the stress test, except for this little one in /var/log/messages: Oct 19 15:53:07 db2 kernel: mm: critical shortage of bounce buffers. That one appeared pretty early in the Cerberus run. Nothing more related in the logs for 8 hours since. I have no other similar machines (SMP, 1GB RAM, HW-RAID), but non-Dell, to run comparative tests on. The one that crashed is still running, with a load of 11.44 at this time.
Hmm...as a matter of fact, this problem is different from the one related to 2.4.2-something I described above. In the 2.4.2 case, the kswapd got all cpu time and spun. In the 2.4.9-6smp case, the kswapd was running with a significant portion of the cpu time, but not actually *all* of it. Based on this second thought, my personal suspicion is turned towards the scsi subsystem, with perhaps the aacraid driver as the one that actually hung. A very big difference in this crash was the rather long list of scsi timeouts that did not occur with 2.4.2. During those timeouts the RAID-array was seemingly sitting idle (no activity lights on the drives). The aacraid version seems to be newer in 2.4.9-6smp compared to 2.4.3-12smp. Could that, or something else that's been changed in the scsi subsystem, be the source of this?
I rewrote the aacraid driver in the end. The old one had an out of memory hang case that fits this description so assuming fixed