Bug 250624 - No Boot /Hang response for PCI-E errors on a QS21
Summary: No Boot /Hang response for PCI-E errors on a QS21
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 6
Hardware: ppc64
OS: Linux
low
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 250611
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-08-02 15:12 UTC by Robbie Williamson
Modified: 2007-11-30 22:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-08-02 16:30:16 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Robbie Williamson 2007-08-02 15:12:32 UTC
+++ This bug was initially created as a clone of Bug #250611 +++

+++ This bug was initially created as a clone of Bug #249667 +++

Description of problem:
The Axon PCIe root complexes used in the IBM QS21 systems report PCI errors
(e.g. poisoned TLP, crc error, etc) it asserts an interrupt that has to be
caught by Linux.

The "driver" will dump out some registers, then panic. It is an extra file in
arch/powerpc/platforms/cell and does not impact other platforms.

Without the patches to support this error reporting these systems witll hang on
boot in the face of PCI errors.

IBM System Integration Test(SIT) has defined this defect as an SIT exit gate.
QS21 GA will be delayed by every day the fix is not available in RHEL 5.1.


Version-Release number of selected component (if applicable):
2.6.18-8.EL

How reproducible:
100% given appropriate test hardware.

Steps to Reproduce:
1. To be provided by IBM
  
Actual results:
Hang/no boot response.

Expected results:
Correct error reporting & resultant panic if fatal.

Additional info:
Hardware for testing is being delivered to Westford (?) as soon as IBM resolve
final firmware issues.

-- Additional comment from breeves on 2007-07-26 06:59 EST --
Created an attachment (id=160005)
proposed patch from IBM


-- Additional comment from breeves on 2007-07-26 07:01 EST --
Created an attachment (id=160006)
proposed patch from IBM [2/3]


-- Additional comment from breeves on 2007-07-26 07:02 EST --
Created an attachment (id=160007)
proposed patch from IBM [3/3]


-- Additional comment from tao on 2007-07-26 12:05 EST --
------- Additional Comments From smoser.com (prefers email at
ssmoser.com)  2007-07-26 12:02 EDT -------
(In reply to comment #27)
> Sorry, I accidently picked the wrong rpm. Now it works for PCIe. Still
have to
> verify for PCI-X though (on a different machine).

Have you been able to do that ? 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from tao on 2007-07-26 12:41 EST --
----- Additional Comments From Jens.Osterkamp.com (prefers email at
jens.com)  2007-07-26 12:37 EDT -------
Yes, it works for PCI-X also. 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from pm-rhel on 2007-07-26 13:07 EST --
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

-- Additional comment from smoser on 2007-07-26 14:19 EST --
posted:
http://post-office.corp.redhat.com/archives/rhkernel-list/2007-July/thread.html#00836

-- Additional comment from tao on 2007-07-26 17:26 EST --
----- Additional Comments From bherren.com (prefers email at
benh.com)  2007-07-26 17:21 EDT -------
Wait, this bugzilla entry is still missing a patch that's already
upstream but
not backported yet. I'll attach it today. 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from jturner on 2007-07-27 11:37 EST --
Patches (at least the ones posted to this point) are POWER specific.  QE
withholding ack based on:

1) need the missing patch referred to in comment 11
2) need testing results from patches applied to current Red Hat code
3) need IBM commitment on testing

-- Additional comment from tao on 2007-07-27 21:20 EST --
------- Additional Comments From smoser.com (prefers email at
ssmoser.com)  2007-07-27 21:17 EDT -------
(In reply to comment #34)
> Wait, this bugzilla entry is still missing a patch that's already
upstream but
> not backported yet. I'll attach it today.
> 
Just a reminder, we're still waiting on this. 

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from tao on 2007-07-27 21:30 EST --
----- Additional Comments From bherren.com (prefers email at
benh.com)  2007-07-27 21:28 EDT -------
Sorry for the confusion, the fix I'm talking about is the one that was
submited
in a separate entry on bug #36932 (mpic protected sources). The comment on
the
later is a bit misleading as that patch doesn't only apply to the DDR
errors,
but also to the PCI-X/PCIe one afaik. 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from smoser on 2007-07-30 09:10 EST --
(In reply to comment #12)
> Patches (at least the ones posted to this point) are POWER specific.  QE
> withholding ack based on:
> 
> 1) need the missing patch referred to in comment 11

This was a misunderstanding, probably my fault.  As Ben mentioned above, he
opened RH bug 249910 (LTC bug 36932) to address the additional issue.  There are
no further changes needed for this bug.

> 2) need testing results from patches applied to current Red Hat code

Redhat comment 5 above mentions Jens Osterkamp's test.  He tested and verified
for both PCI-X and PCIe.  The kernel he verified with was built using brew
(http://brewweb.devel.redhat.com/brew/taskinfo?taskID=887483).  It contains the
patches as submitted to rhkernel-list applied to 2.6.18-36.EL (just for the
record, it also includes patches for RH bugs for 242937 and 247658)

> 3) need IBM commitment on testing

Unless I'm mistaken, IBM has agreed to testing for all Cell platform.


Does that address all your concerns?

-- Additional comment from breeves on 2007-07-30 09:20 EST --
Thanks Scott - all fine from my side

-- Additional comment from jturner on 2007-07-30 09:39 EST --
QE ack for the exception, then.

-- Additional comment from robbiew.com on 2007-08-02 10:58 EST --
The soon-to-be released QS21 Cell/B.E. BladeServer from IBM is supposed to
support F7, so IBM would really appreciate it if a kernel update with this patch
could be made available to F7 users.

-- Additional comment from robbiew.com on 2007-08-02 11:06 EST --
Created an attachment (id=160528)
simple patch to panic when SERR or PERR occurs on PCI-X


-- Additional comment from robbiew.com on 2007-08-02 11:07 EST --
Created an attachment (id=160529)
simple patch to panic when an error occurs on PCIe

Comment 1 Robbie Williamson 2007-08-02 15:13:55 UTC
The QS21 is also supported on Fedora Core 6, so IBM would like this patch
included in the next kernel update, if possible.  Do we need to provide a backport?

Comment 2 Robbie Williamson 2007-08-02 16:30:16 UTC
Just realized that IBM can resolve this as we provide a kernel with the Cell SDK
supported on FC6. Closing.


Note You need to log in before you can comment on or make changes to this bug.