Bug 50593

Summary: (SCSI IPS)Netfinity 4500R, ServeRAID 4L, firmware 4.70, kernel-2.4.3-12 hangs after a day
Product: [Retired] Red Hat Linux Reporter: rosa
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: medium    
Version: 7.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
URL: http://netfinity.lanceerplaats.nl/
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-06-19 07:26:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
Summary of the history of this case. none

Description rosa 2001-08-01 13:41:14 UTC
Description of problem:
All Netfinity 4500R, ServeRAID 4L, firmware 4.70.17 with redhat 7.1 
+ updates will hang after about a day.

No users, no network, no serial, no cronjobs other than the standard 
ones shipped with RedHat *minimal install* like updatedb, logrotate etc.
Only a display attached.

How reproducible:

Steps to Reproduce:
1.Take a new Netfinity 4500R + ServeRAID 4L card + three 18.4 GB disks
2.Install (insert CD and reboot, rest is automatic)
a) IBM UpdateExpress  CD as per IBM website 
b) IBM ServeRAID 4.70 CD as per IBM website 
   http://www.pc.ibm.com/qtechinfo/MIGR-4X7R6P.html, iso is at
c) Redhat 7.1 Linux   CD + updates (+ kickstart)

Actual Results:  After about a day the machine will crash.

Expected Results:  The machine should have stayed up !

Additional info:

1) Several `(ips0) Resetting controller' entries in kernel log
2) `device events' counters continuously increasing for the SCSI 
   disk devices
3) At several occasions, after a cold boot, the machine rebooted 
   immediately after displaying the SCSI messages *). So far the
   second boot has always succeeded.
4) After one to two days, left unattended, the machine hangs.

*) right after this:
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.4/5.2.0
       <Adaptec AIC-7899 Ultra 160/m SCSI host adapter>
scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.4/5.2.0
       <Adaptec AIC-7899 Ultra 160/m SCSI host adapter>
(Normally it would show ... Loading ips module)

Here's an overview of different netfinities:

 server        device events    firmware  longest     redhat       kernel
           (disk1, disk2, disk3) version  uptime      version      version
 number1        0,  0,  0       4.00.06   225 days   6.2+updates   2.2.16-3smp
 number2        0,  0,  0       4.50.05   >100 days  6.2+updates   2.2.16-3smp
 number3        0,  1,  0       4.50.05   >100 days  6.2+updates   2.2.16-3smp

 replacement   10, 16, 31      4.70.17   1 day      7.1+updates   2.4.3-12
 senttothelab  14, 19, 45      4.70.17   2 day      7.1           2.4.2-2

All machines are 2CPU Netfinity 4500R, ServeRAID 4L,  512MB or 1GB main memory
except for number1 which only has ServeRAID 3L,

Below I'll attach a history of what happened prior to this.
Previous report (way too much detail, piled up as we went along) is at 

A quote from there:
`There are lots of folks now using Red Hat 7.1 and are not seeing this'

Could anybody running this same configuration *) for longer than a
week please, please send me a note !

*) I.e. netfinity 4500R (aka xSeries 340 eServer), ServeRAID 4L,
   firmware 4.70.17, redhat 7.1 


Comment 1 rosa 2001-08-01 13:45:43 UTC
Created attachment 25772 [details]
Summary of the history of this case.

Comment 2 rosa 2003-06-19 00:02:56 UTC
The issue has been resolved since end of 2001. Around that time IBM released 
bugfix RAID firmware 4.80.26 
The machine has since been running without a prob under a load of 2-3 for 
well over a year now. Initially 2.4.9 would cause it to agressively
swap which sometimes made it crawl, but it never went down. That swap problem
went away with an upgrade to 2.4.18

Sorry for not updating ! Only when Alan changed the subject on Tue, 10 Jun 2003 
I noticed that it was still open.