From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 Description of problem: On a slimmerlot w/ hampton beta2 (2.4.18-0.1smp), bios A00, 1GB ram, 2GB swap, and LSI 39320 controller as boot device... Intensive disk IO (cpcmp) causes machine to lock up after about an hour. Running ./cpcmp.pl 9 testbig.iso ./a,./b,20,5 (Five streams using a 650MB iso file) I had over 25GB of disk space, plenty to run this amount of cpcmp No serial console output, no kernel panic, capslock and numlock light still active, can change VC's but login times out. Sysrq listing can be brought up but showtasks and showmem do not execute. Sysrq reboot worked. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Install hampton beta2 on a slimmerlot with 39320 as boot device 2.Run cpcmp on a filesystem on 39320 drive 3.Wait for machine to hang Actual Results: Machine hangs Expected Results: No machine hang Additional info:
With beta3 (2.4.18-0.4smp), disk IO running extremely slow. Sysreq showtasks and showmemory do not execute or show any output. Unable to login in to another VC. Top shows the five cpcmp streams are still running and an 'ls -al' shows the sizes of the copied files are changing very slowly. Disk activity light is solidly lit. Top shows Cpu Utilization is 97% idle, all but 9MB of 1GB of ram is in use, and none of the 2GB of swap is in use.
does this still happen with 2.4.18-0.16 ?
Yes, same behavior as in beta3. Tested with 2.4.18-0.16smp.
Have you reported this to LSI?
Yes, they're trying to reproduce it. Seems they don't have SlimMerlots... We need to try reproducing it on other platforms to rule that out.
So I tried this on a napa and got less clear-cut results than I was hoping for. On the slimmerlot, I would come in after letting it run overnight and the 5 cpcmp streams would only be at pass 5 or 6, showing very clearly that something is wrong. On the napa, the 5 streams were at pass 35 this morning, however all the other signs of the problem were still there: unable to login to another VC, all but 9MB of 1GB ram in use, sysrq showtask and showmem options do not execute. Had this been another driver I definitely would have expected it to have finished with its 50 passes by the next day, so I definitely think the driver needs some tweaking. But it does appear to be running better on the napa than the slimmerlot, which I do not understand.
From: Michael K. Johnson [johnsonm] More specifically than before: the driver sleeps with spinlocks held. This is never allowed, and because LSI knows the driver better than we do (trivially) it is far better for them to fix it and send us a *patch* rather than a new driver version.
From: Brauer, Jonathan [jbrauer] Sent: Monday, April 15, 2002 10:33 PM Matt, I've tried to reproduce in Chris' lab, without success. What FW and Option ROM are you running on the HBA? You can find this on the revision field of the POST screen. If you have any questions or concerns, Chris is willing to go up and take a look. If the revision field for the HBA on the POST screen says 900 or less, then we probably have a root cause already found and solved on this issue. There is an errata on RCC chipsets which required a FW work around which was implemented in FW 1000 (FW-00.00.10.00). ------------------- Chris, thanks for the update. Clay, please check the version numbers. However, I don't believe this addresses Red Hat's concern about the driver sleeping while holding spinlocks. Have you examined that? -Matt
This is what I'm seeing during post: MPTBios-5.00 Dell Build Product Rev LSILogic 1020/1030 [ 102] 10000
Ran two instances of IObasher against a scsi drive connected to the 22320 card on a slimmerlot with 1GB ram, 2GB swap w/ Pensacola RC1 installed. It ran over the weekend just fine with no system responsiveness issues or hangs. The scsi drive used contained the OS. IObasher is a program I got from our IO Engineering that has been used to quickly reproduce this issue on hampton, 7.2, and 7.1 on 6400 and slimmerlot. It's just one executable file that when started creates a results file and a tmp file while in use. I'll attach it for you. BTW, the fw and bios on this card was 5.01.05. This is the latest rev that IO Engineering has.
Created attachment 54849 [details] IObasher disk stress test
Running cpcmp on NAPA with similar configuration, after the test runs for a few hours, the following message is displayed continuously: ENOMEM in do_get_write_access, retrying ENOMEM in journal_alloc_journal_head, retrying System with LSI U320 controller shows the same symptom as discribed above. 'free' reports to have plenty of free memory while the message is displaying. Other configuration I am testing are: 1. LSI U320 and RH7.3 2. LSI U320 and RH7.2 3. Adaptec AIC-7899 (Napa onboard) and RH7.2 Only item 1 locks up (as descibed above) the system. The other two become slow sometimes but can excute commands.
Does this happen in the limbo+latest rawhide kernel still?
Any update on this Clay??
Ran the original cpcmp scenario overnight (20 count, 5 stream, 650MB iso), with sda on the LSI 320 card. Sda contained the OS (Milan Beta4). System was extremely slow and I was unable to login to another VC. Ram was completely in use and no swap was in use, however it did finish by this morning and system responsiveness immediately returned.
Clay will re-run with either Milan beta5 kernel or latest rawhide, which has a newer mptfusionh driver.
Tried 2.4.18-10.98smp overnight with same result. The cpcmp did complete successfully. I was unable to login to other virtual console while it was running. The original issue locked up the machine before the cpcmp ever finished, where as with Milan the machine atleast finishes even if it is very unresponsive.
Same behavior w/ 2.4.18-10.98smp on a discovery w/ onboard lsi 320 scsi.
Tried same cpcmp setup on a discovery onboard u320, and on a slimmerlot with an add-in u320 only this time I hit a non-OS disk with the disk IO. System responsiveness was not an issue at all with either system when doing this. Our IO team recommended I try the non-OS disk, as they see basically the same responsiveness issue in Windows when pounding the OS disk. I will keep this in mind during future disk stress testing. I am satisfied that the original issue (a complete hang) is no longer occuring, and that the latest results (poor system response) is expected behavior.
Time tracking values updated