Bug 112436

Summary:	aacraid + kernel2.4.21-4.0.1.ELsmp + x86_64 == crash
Product:	Red Hat Enterprise Linux 3	Reporter:	Jeff Thomas <jeff>
Component:	kernel	Assignee:	Tom Coughlan <coughlan>
Status:	CLOSED CANTFIX	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	ckloiber, jparadis, petrides, riel, stakagi, tao
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-09-19 13:56:00 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Thomas 2003-12-19 17:49:04 UTC

Description of problem:

When using Adaptec 2200s raid card with IBM e325 dual opteron
system, buiding a filesystem on the disk array causes the machine
to lock up with:

 
    MCG_STATUS: unrecoverable
    Northbridge Machine Check exception b40000000005001b 0
    Uncorrectable condition
    Unrecoverable condition
    Northbridge status b40000000005001b
    GART error 11
    Lost an northbridge error
    NB error address 00000000eff60000
    Error uncorrected
    Address: 00000000eff60000
    MCE at EIP ffffffff8010de3e ESP ffffffff80633fc8
    CPU 0: Machine Check Exception: 0000000000000000
    Kernel panic: Unable to continue
    In idle task - not syncing
 
Have tried both the stock smp kernel with driver 1.1.2 and a
custom kernel with driver 1.1.4 (from adaptec's sources), same
results.


Version-Release number of selected component (if applicable):

IBM e325, dual opteron, 5g memory.  Adaptec 2200s card firmware
4.0-4[6008].  Aacraid driver in the redhat-supplied 2.4.21-4.0.1.ELsmp
kernel is 1.1.2, adaptec has driver source on their web site for
1.1.4 which does behave better than 1.1.2 but still has this problem.


How reproducible:

Readily reproducable.

Steps to Reproduce:
1.Build an external RAID group (mine is 550GB) using the adaptec
bios utility. Allow the array initialization to complete.
2. Use fdisk to create a partition of the entire device
3. Run mkfs -t ext3 /dev/sdb1
4. Wait
5. Note system crash on console
  
Actual results:

System crashes/hangs

Expected results:

Mountable filesystem

Additional info:

My raid is 9 x 73G Seagate drives split across 2 channels
built as raid5.

Comment 1 Jeff Thomas 2004-01-06 15:03:22 UTC

I had also opened service request 277987 on this, a tech there
responded suggesting to add "nomce" to the kernel command.  This
is effective is eliminating the crash and the systems appear
stable.  I am not familiar with the details of the machine check
exception but if this is a valid fix then please close this ticket.

Comment 2 Shinya Takagi 2004-09-07 05:54:22 UTC

my customer has also reported a similar case since updated kernel to
2.4.21-20.EL.x86_64 and using Optron64.

Sep  7 04:02:23 opteron kernel: Northbridge status a60000010005001b
Sep  7 04:02:23 opteron kernel: GART error 11
Sep  7 04:02:23 opteron kernel: Lost an northbridge error
Sep  7 04:02:23 opteron kernel: NB status: unrecoverable
Sep  7 04:02:23 opteron kernel: NB error address 00000000fbf61258
Sep  7 04:02:23 opteron kernel: Error uncorrected

Comment 6 David Bond 2004-12-22 20:17:57 UTC

Documentation for AMD Opteron MCE architecture may be found at

http://www.amd.com/us-
en/assets/content_type/white_papers_and_tech_docs/26094.PDF

This appears to decode to be a GART TLB Error with a valid cause 
address of 00000000fbf61258.

Given the address (it's very near where I would expect mmio space 
would be allocated) I would take a look in /proc/iomem and see if the 
controller in question has memory near this address.

Comment 7 Tom Coughlan 2004-12-22 21:03:09 UTC

Thanks for the information David.

Shinya, Jeff, 

Please check /proc/iomem to see if the Adaptec 2200 has memory at the
address shown in the machine check. Also please check with Adaptec and
make sure you have the latest firmware for that board.

Comment 8 Tom Coughlan 2005-09-19 13:56:00 UTC

Since we have not received the feedback we requested, we will assume the problem
was not reproduceable or has been fixed in a later update for this product.

Users who have experienced this problem are encouraged to upgrade to the latest
update release, and if this issue is still reproduceable, please contact the Red
Hat Global Support Services page on our website for technical support options:
https://www.redhat.com/support

If you have a telephone based support contract, you may contact Red Hat at
1-888-GO-REDHAT for technical support for the problem you are experiencing.