Red Hat Bugzilla – Bug 61497
Disk IO on LSI U320 SCSI controller hangs machine
Last modified: 2007-04-18 12:41:02 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019
Description of problem:
On a slimmerlot w/ hampton beta2 (2.4.18-0.1smp), bios A00, 1GB ram, 2GB swap,
and LSI 39320 controller as boot device...
Intensive disk IO (cpcmp) causes machine to lock up after about an hour.
Running ./cpcmp.pl 9 testbig.iso ./a,./b,20,5
(Five streams using a 650MB iso file)
I had over 25GB of disk space, plenty to run this amount of cpcmp
No serial console output, no kernel panic, capslock and numlock light still
active, can change VC's but login times out. Sysrq listing can be brought up
but showtasks and showmem do not execute. Sysrq reboot worked.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Install hampton beta2 on a slimmerlot with 39320 as boot device
2.Run cpcmp on a filesystem on 39320 drive
3.Wait for machine to hang
Actual Results: Machine hangs
Expected Results: No machine hang
With beta3 (2.4.18-0.4smp), disk IO running extremely slow.
Sysreq showtasks and showmemory do not execute or show any output.
Unable to login in to another VC.
Top shows the five cpcmp streams are still running and an 'ls -al' shows the
sizes of the copied files are changing very slowly.
Disk activity light is solidly lit.
Top shows Cpu Utilization is 97% idle, all but 9MB of 1GB of ram is in use, and
none of the 2GB of swap is in use.
does this still happen with 2.4.18-0.16 ?
Yes, same behavior as in beta3. Tested with 2.4.18-0.16smp.
Have you reported this to LSI?
Yes, they're trying to reproduce it. Seems they don't have SlimMerlots...
We need to try reproducing it on other platforms to rule that out.
So I tried this on a napa and got less clear-cut results than I was
hoping for. On the slimmerlot, I would come in after letting it run
overnight and the 5 cpcmp streams would only be at pass 5 or 6, showing
very clearly that something is wrong.
On the napa, the 5 streams were at pass 35 this morning, however all the
other signs of the problem were still there: unable to login to another
VC, all but 9MB of 1GB ram in use, sysrq showtask and showmem options do
not execute. Had this been another driver I definitely would have
expected it to have finished with its 50 passes by the next day, so I
definitely think the driver needs some tweaking. But it does appear to
be running better on the napa than the slimmerlot, which I do not
From: Michael K. Johnson [mailto:email@example.com]
More specifically than before: the driver sleeps with spinlocks held.
This is never allowed, and because LSI knows the driver better than
we do (trivially) it is far better for them to fix it and send us a
*patch* rather than a new driver version.
From: Brauer, Jonathan [mailto:firstname.lastname@example.org]
Sent: Monday, April 15, 2002 10:33 PM
I've tried to reproduce in Chris' lab, without success.
What FW and Option ROM are you running on the HBA? You can find
this on the revision field of the POST screen. If you have any questions or
concerns, Chris is willing to go up and take a look.
If the revision field for the HBA on the POST screen says 900 or
less, then we probably have a root cause already found and solved on this
There is an errata on RCC chipsets which required a FW work around
which was implemented in FW 1000 (FW-00.00.10.00).
Chris, thanks for the update. Clay, please check the version numbers.
However, I don't believe this addresses Red Hat's concern about the driver
sleeping while holding spinlocks. Have you examined that?
This is what I'm seeing during post:
LSILogic 1020/1030 [ 102] 10000
Ran two instances of IObasher against a scsi drive connected to the 22320 card
on a slimmerlot with 1GB ram, 2GB swap w/ Pensacola RC1 installed. It ran over
the weekend just fine with no system responsiveness issues or hangs.
The scsi drive used contained the OS.
IObasher is a program I got from our IO Engineering that has been used to
quickly reproduce this issue on hampton, 7.2, and 7.1 on 6400 and slimmerlot.
It's just one executable file that when started creates a results file and a tmp
file while in use. I'll attach it for you.
BTW, the fw and bios on this card was 5.01.05. This is the latest rev that IO
Created attachment 54849 [details]
IObasher disk stress test
Running cpcmp on NAPA with similar configuration, after the test runs for a few
hours, the following message is displayed continuously:
ENOMEM in do_get_write_access, retrying
ENOMEM in journal_alloc_journal_head, retrying
System with LSI U320 controller shows the same symptom as discribed
above. 'free' reports to have plenty of free memory while the message is
Other configuration I am testing are:
1. LSI U320 and RH7.3
2. LSI U320 and RH7.2
3. Adaptec AIC-7899 (Napa onboard) and RH7.2
Only item 1 locks up (as descibed above) the system. The other two become slow
sometimes but can excute commands.
Does this happen in the limbo+latest rawhide kernel still?
Any update on this Clay??
Ran the original cpcmp scenario overnight (20 count, 5 stream, 650MB iso), with
sda on the LSI 320 card. Sda contained the OS (Milan Beta4). System was
extremely slow and I was unable to login to another VC. Ram was completely in
use and no swap was in use, however it did finish by this morning and system
responsiveness immediately returned.
Clay will re-run with either Milan beta5 kernel or latest rawhide, which has a
newer mptfusionh driver.
Tried 2.4.18-10.98smp overnight with same result. The cpcmp did complete
successfully. I was unable to login to other virtual console while it was
running. The original issue locked up the machine before the cpcmp ever
finished, where as with Milan the machine atleast finishes even if it is very
Same behavior w/ 2.4.18-10.98smp on a discovery w/ onboard lsi 320 scsi.
Tried same cpcmp setup on a discovery onboard u320, and on a slimmerlot with an
add-in u320 only this time I hit a non-OS disk with the disk IO. System
responsiveness was not an issue at all with either system when doing this. Our
IO team recommended I try the non-OS disk, as they see basically the same
responsiveness issue in Windows when pounding the OS disk. I will keep this in
mind during future disk stress testing.
I am satisfied that the original issue (a complete hang) is no longer occuring,
and that the latest results (poor system response) is expected behavior.
Time tracking values updated