Bug 126998

Summary: Machine hangs in less than one hour
Product: Red Hat Enterprise Linux 3 Reporter: Pierre Fumery <pierre.fumery>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: jbaron, lwoodman, petrides, riel, sdenham, tburke
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-20 20:55:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sysreport result none

Description Pierre Fumery 2004-06-30 12:04:36 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7)
Gecko/20040514

Description of problem:
This issue has been raised after the "low memory" bug has been fixed.
So, some preliminary information can be found in this previous issue
BZ#121029.

We used both kernel-2.4.21-15.5.EL and kernel-2.4.21-15.11.EL with
same bad results. Machine hangs in less than one hour.


Following comments/results were copied from BZ #121029. Tar file can
be extracted from BZ #121029.

trace files after HANG (with "echo m > /proc/sysrq-trigger")

The attached "compressed tar" file contents trace files about dbgen
HANG with a
NS5160.

This last test has been done with traces taken every 30s and including
the "echo m > /proc/sysrq-trigger".
The machine "broke" after 40 minutes.

The tar file includes:
- meminfo.sh: script that takes the traces
- meminfo.txt: ouput from meminfo.sh
- top.txt: result of the "top" command runned during the test
- messages: /var/log/messages saved after rebooting the machine.



Version-Release number of selected component (if applicable):
kernel-2.4.21-15.11.EL

How reproducible:
Always

Steps to Reproduce:
1. Get the "dbgen" test which has been sent over to RedHat.
2. Run this "dbgen" test.
3. More detailled information can be found in BZ #121029
    

Actual Results:  Machine hangs everytime this test is performed.

Expected Results:  Machine should be loaded but should still be alive.

Additional info:

As SCSI card could be suspected (see #121029), please find its
description:

- SCSI Adapter= Adaptec
     Content of /proc/pci:

  Bus  4, device   1, function  0:
    SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 1).
      IRQ 54.
      Master Capable.  Latency=64.  Min Gnt=40.Max Lat=25.
      I/O at 0xc400 [0xc4ff].
      Non-prefetchable 64 bit memory at 0xfa6fe000 [0xfa6fefff].
  Bus  4, device   1, function  1:
    SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (#2)
(rev 1).
      IRQ 55.
      Master Capable.  Latency=64.  Min Gnt=40.Max Lat=25.
      I/O at 0xc800 [0xc8ff].
      Non-prefetchable 64 bit memory at 0xfa6ff000 [0xfa6fffff].

Comment 1 Pierre Fumery 2004-06-30 12:11:42 UTC
Hi Tim,
Are you aware of such a known problem ? Could you reassign this issue
to the right people in your team ?
Thanks.

Comment 2 Susan Denham 2004-06-30 21:20:26 UTC
Pierre:

As you saw in a mail message from Tim in response to your report of a
problem with the LSI22320-R adapters on your IA64 
platforms (on both RHEL2.1-U4 and RHEL3-U1), we've got a pre-beta RHEL
3 U3 kernel (call it "U3betaRC" for U3betaReleaseCandidate) available
for you to test that may also address the issue that you've reported
in this bugzilla.

Location of the kernel:   ftp://people.redhat.com/tburke/.pre_u3

We'll be waiting for your feedback....

Sue

Comment 3 Tom Coughlan 2004-07-02 21:30:04 UTC
I see from the syslogs that you have storage on mpt fusion, QLogic,
and aic7xxx adapters.  I gather that your system is installed on the
mpt fusion disks, and the dbgen test is running exclusively on the
aic7xxx disks. Are the QLogic disks idle?  

Please describe your storage configuration, and whether there is
anything running other than dbgen.  If you could run sysreport and
post the results that would be helpful as well.

Thanks. 

Comment 4 Pierre Fumery 2004-07-05 09:20:00 UTC
The other problem with LSI22320-R adapters (IT #43391) prevented us to
go further on testing the new kernel Tim provided to us.

But I asked to get further information as you requested in your note
above (Comment#3), though.


Comment 5 Claude BRUNET 2004-07-05 11:39:01 UTC
Created attachment 101637 [details]
sysreport result

The various ddgen processes write in various file systems. Some of them are in
a SCSI disk subsystem (SR0812 - Chaparral accessed through an adaptec SCSI
adapter (aic7xxx driver)), other are in a fibre channel disk subsystem (FDA2300
- NEC iStorage accessed through QLogic QLA2340 adapters (qla2300 driver)).

In the sysreport given in attachment, the file systems used by dbgen processes
are not present because we started other test on this server.

Comment 6 Pierre Fumery 2004-07-07 14:58:11 UTC
A workaround has been found for LSI22320-R adapters boot problem (IT
#43391) and BZ #127385 has been opened to get it fix on RHEL3-U3.

But we're still waiting a new RHEL3-betaU3 version to check potential
enhancement on this current defect.

Comment 7 Pierre Fumery 2004-11-30 14:33:49 UTC
This issue has been fixed on RHEL3-U4 kernel (beta versions). It can
closed now.
We'll open another one if another "same" problem would be raised on
G.A. version, but we don't expect such a regression ...

Comment 8 Ernie Petrides 2004-12-03 03:35:21 UTC
Thank you for the information, Pierre.  I will revert the state of this bug
to MODIFIED, since U4 is not yet released.  It will automatically be changed
to CLOSED/ERRATA by the Errata System when U4 becomes available on RHN.

Comment 11 Ernie Petrides 2004-12-03 23:16:30 UTC
Larry's fix was committed to the RHEL3 U4 patch pool on 18-Oct-2004
(in kernel version 2.4.21-22.EL).

Comment 12 John Flanagan 2004-12-20 20:55:35 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html