Bug 61497 - Disk IO on LSI U320 SCSI controller hangs machine
Summary: Disk IO on LSI U320 SCSI controller hangs machine
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 61901 67218
TreeView+ depends on / blocked
 
Reported: 2002-03-20 18:04 UTC by Clay Cooper
Modified: 2007-04-18 16:41 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2002-08-22 15:35:03 UTC
Embargoed:


Attachments (Terms of Use)
IObasher disk stress test (55.63 KB, application/octet-stream)
2002-04-22 15:54 UTC, Clay Cooper
no flags Details

Description Clay Cooper 2002-03-20 18:04:32 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019
Netscape6/6.2

Description of problem:
On a slimmerlot w/ hampton beta2 (2.4.18-0.1smp), bios A00, 1GB ram, 2GB swap,
and LSI 39320 controller as boot device...

Intensive disk IO (cpcmp) causes machine to lock up after about an hour.

Running ./cpcmp.pl 9 testbig.iso ./a,./b,20,5
(Five streams using a 650MB iso file)

I had over 25GB of disk space, plenty to run this amount of cpcmp

No serial console output, no kernel panic, capslock and numlock light still
active, can change VC's but login times out.  Sysrq listing can be brought up
but showtasks and showmem do not execute.  Sysrq reboot worked.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Install hampton beta2 on a slimmerlot with 39320 as boot device
2.Run cpcmp on a filesystem on 39320 drive
3.Wait for machine to hang
	

Actual Results:  Machine hangs

Expected Results:  No machine hang

Additional info:

Comment 1 Clay Cooper 2002-03-22 15:24:04 UTC
With beta3 (2.4.18-0.4smp), disk IO running extremely slow.
Sysreq showtasks and showmemory do not execute or show any output.
Unable to login in to another VC.  
Top shows the five cpcmp streams are still running and an 'ls -al' shows the
sizes of the copied files are changing very slowly.
Disk activity light is solidly lit.
Top shows Cpu Utilization is 97% idle, all but 9MB of 1GB of ram is in use, and
none of the 2GB of swap is in use.



Comment 2 Arjan van de Ven 2002-04-08 15:54:00 UTC
does this still happen with 2.4.18-0.16 ?

Comment 3 Clay Cooper 2002-04-09 12:40:19 UTC
Yes, same behavior as in beta3.  Tested with 2.4.18-0.16smp.

Comment 4 Michael K. Johnson 2002-04-09 16:37:59 UTC
Have you reported this to LSI?

Comment 5 Matt Domsch 2002-04-09 17:39:28 UTC
Yes, they're trying to reproduce it.  Seems they don't have SlimMerlots...
We need to try reproducing it on other platforms to rule that out.

Comment 6 Clay Cooper 2002-04-10 13:58:02 UTC
So I tried this on a napa and got less clear-cut results than I was
hoping for.  On the slimmerlot, I would come in after letting it run
overnight and the 5 cpcmp streams would only be at pass 5 or 6, showing
very clearly that something is wrong.  
On the napa, the 5 streams were at pass 35 this morning, however all the
other signs of the problem were still there:  unable to login to another
VC, all but 9MB of 1GB ram in use, sysrq showtask and showmem options do
not execute.  Had this been another driver I definitely would have
expected it to have finished with its 50 passes by the next day, so I
definitely think the driver needs some tweaking.  But it does appear to
be running better on the napa than the slimmerlot, which I do not
understand.


Comment 7 Matt Domsch 2002-04-12 14:08:19 UTC
From: Michael K. Johnson [johnsonm]

More specifically than before: the driver sleeps with spinlocks held.
This is never allowed, and because LSI knows the driver better than
we do (trivially) it is far better for them to fix it and send us a
*patch* rather than a new driver version.



Comment 8 Matt Domsch 2002-04-16 03:40:39 UTC
From: Brauer, Jonathan [jbrauer]
Sent: Monday, April 15, 2002 10:33 PM

Matt,
	I've tried to reproduce in Chris' lab, without success.

	What FW and Option ROM are you running on the HBA?  You can find
this on the revision field of the POST screen.  If you have any questions or
concerns, Chris is willing to go up and take a look.

	If the revision field for the HBA on the POST screen says 900 or
less, then we probably have a root cause already found and solved on this
issue.

	There is an errata on RCC chipsets which required a FW work around
which was implemented in FW 1000 (FW-00.00.10.00).

-------------------
Chris, thanks for the update.  Clay, please check the version numbers.
However, I don't believe this addresses Red Hat's concern about the driver 
sleeping while holding spinlocks.  Have you examined that?
-Matt

Comment 9 Clay Cooper 2002-04-16 13:01:41 UTC
This is what I'm seeing during post:

MPTBios-5.00
Dell Build

            Product             Rev
LSILogic    1020/1030 [  102]   10000



Comment 10 Clay Cooper 2002-04-22 15:52:39 UTC
Ran two instances of IObasher against a scsi drive connected to the 22320 card
on a slimmerlot with 1GB ram, 2GB swap w/ Pensacola RC1 installed. It ran over
the weekend just fine with no system responsiveness issues or hangs.  
The scsi drive used contained the OS.

IObasher is a program I got from our IO Engineering that has been used to
quickly reproduce this issue on hampton, 7.2, and 7.1 on 6400 and slimmerlot. 
It's just one executable file that when started creates a results file and a tmp
file while in use.  I'll attach it for you.

BTW, the fw and bios on this card was 5.01.05.  This is the latest rev that IO
Engineering has.

Comment 11 Clay Cooper 2002-04-22 15:54:40 UTC
Created attachment 54849 [details]
IObasher disk stress test

Comment 12 Tesfamariam Michael 2002-05-14 19:53:52 UTC
Running cpcmp on NAPA with similar configuration, after the test runs for a few 
hours, the following message is displayed continuously:
   ENOMEM in do_get_write_access, retrying
   ENOMEM in journal_alloc_journal_head, retrying
System with LSI U320 controller shows the same symptom as discribed 
above. 'free' reports to have plenty of free memory while the message is 
displaying. 

Other configuration I am testing are:
 1. LSI U320 and RH7.3
 2. LSI U320 and RH7.2
 3. Adaptec AIC-7899 (Napa onboard) and RH7.2
Only item 1 locks up (as descibed above) the system. The other two become slow 
sometimes but can excute commands.

Comment 13 Arjan van de Ven 2002-07-11 11:36:44 UTC
Does this happen in the limbo+latest rawhide kernel still?

Comment 14 John A. Hull 2002-08-12 17:01:25 UTC
Any update on this Clay??

Comment 15 Clay Cooper 2002-08-13 15:11:58 UTC
Ran the original cpcmp scenario overnight (20 count, 5 stream, 650MB iso), with
sda on the LSI 320 card. Sda contained the OS (Milan Beta4).  System was
extremely slow and I was unable to login to another VC.  Ram was completely in
use and no swap was in use, however it did finish by this morning and system
responsiveness immediately returned.

Comment 16 Matt Domsch 2002-08-13 18:26:15 UTC
Clay will re-run with either Milan beta5 kernel or latest rawhide, which has a 
newer mptfusionh driver.

Comment 17 Clay Cooper 2002-08-14 18:16:25 UTC
Tried 2.4.18-10.98smp overnight with same result.  The cpcmp did complete
successfully.  I was unable to login to other virtual console while it was
running.  The original issue locked up the machine before the cpcmp ever
finished, where as with Milan the machine atleast finishes even if it is very
unresponsive.

Comment 18 Clay Cooper 2002-08-14 20:33:48 UTC
Same behavior w/ 2.4.18-10.98smp on a discovery w/ onboard lsi 320 scsi.

Comment 19 Clay Cooper 2002-08-22 15:34:54 UTC
Tried same cpcmp setup on a discovery onboard u320, and on a slimmerlot with an
add-in u320 only this time I hit a non-OS disk with the disk IO.  System
responsiveness was not an issue at all with either system when doing this.  Our
IO team recommended I try the non-OS disk, as they see basically the same
responsiveness issue in Windows when pounding the OS disk.  I will keep this in
mind during future disk stress testing.


I am satisfied that the original issue (a complete hang) is no longer occuring,
and that the latest results (poor system response) is expected behavior.

Comment 20 Michael Fulbright 2002-12-20 17:38:25 UTC
Time tracking values updated


Note You need to log in before you can comment on or make changes to this bug.