Bug 250611 - No Boot /Hang response for PCI-E errors on a QS21
No Boot /Hang response for PCI-E errors on a QS21
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
7
ppc64 Linux
low Severity urgent
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On: 249667
Blocks: 250624
  Show dependency treegraph
 
Reported: 2007-08-02 10:16 EDT by Robbie Williamson
Modified: 2007-11-30 17:12 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-08-02 12:28:51 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
simple patch to panic when SERR or PERR occurs on PCI-X (5.01 KB, patch)
2007-08-02 11:06 EDT, Robbie Williamson
no flags Details | Diff
simple patch to panic when an error occurs on PCIe (5.10 KB, patch)
2007-08-02 11:07 EDT, Robbie Williamson
no flags Details | Diff

  None (edit)
Description Robbie Williamson 2007-08-02 10:16:12 EDT
+++ This bug was initially created as a clone of Bug #249667 +++

Description of problem:
The Axon PCIe root complexes used in the IBM QS21 systems report PCI errors
(e.g. poisoned TLP, crc error, etc) it asserts an interrupt that has to be
caught by Linux.

The "driver" will dump out some registers, then panic. It is an extra file in
arch/powerpc/platforms/cell and does not impact other platforms.

Without the patches to support this error reporting these systems witll hang on
boot in the face of PCI errors.

IBM System Integration Test(SIT) has defined this defect as an SIT exit gate.
QS21 GA will be delayed by every day the fix is not available in RHEL 5.1.


Version-Release number of selected component (if applicable):
2.6.18-8.EL

How reproducible:
100% given appropriate test hardware.

Steps to Reproduce:
1. To be provided by IBM
  
Actual results:
Hang/no boot response.

Expected results:
Correct error reporting & resultant panic if fatal.

Additional info:
Hardware for testing is being delivered to Westford (?) as soon as IBM resolve
final firmware issues.

-- Additional comment from breeves@redhat.com on 2007-07-26 06:59 EST --
Created an attachment (id=160005)
proposed patch from IBM


-- Additional comment from breeves@redhat.com on 2007-07-26 07:01 EST --
Created an attachment (id=160006)
proposed patch from IBM [2/3]


-- Additional comment from breeves@redhat.com on 2007-07-26 07:02 EST --
Created an attachment (id=160007)
proposed patch from IBM [3/3]


-- Additional comment from tao@redhat.com on 2007-07-26 12:05 EST --
------- Additional Comments From smoser@us.ibm.com (prefers email at
ssmoser@us.ibm.com)  2007-07-26 12:02 EDT -------
(In reply to comment #27)
> Sorry, I accidently picked the wrong rpm. Now it works for PCIe. Still
have to
> verify for PCI-X though (on a different machine).

Have you been able to do that ? 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from tao@redhat.com on 2007-07-26 12:41 EST --
----- Additional Comments From Jens.Osterkamp@de.ibm.com (prefers email at
jens@de.ibm.com)  2007-07-26 12:37 EDT -------
Yes, it works for PCI-X also. 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from pm-rhel@redhat.com on 2007-07-26 13:07 EST --
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

-- Additional comment from smoser@redhat.com on 2007-07-26 14:19 EST --
posted:
http://post-office.corp.redhat.com/archives/rhkernel-list/2007-July/thread.html#00836

-- Additional comment from tao@redhat.com on 2007-07-26 17:26 EST --
----- Additional Comments From bherren@au1.ibm.com (prefers email at
benh@au1.ibm.com)  2007-07-26 17:21 EDT -------
Wait, this bugzilla entry is still missing a patch that's already
upstream but
not backported yet. I'll attach it today. 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from jturner@redhat.com on 2007-07-27 11:37 EST --
Patches (at least the ones posted to this point) are POWER specific.  QE
withholding ack based on:

1) need the missing patch referred to in comment 11
2) need testing results from patches applied to current Red Hat code
3) need IBM commitment on testing

-- Additional comment from tao@redhat.com on 2007-07-27 21:20 EST --
------- Additional Comments From smoser@us.ibm.com (prefers email at
ssmoser@us.ibm.com)  2007-07-27 21:17 EDT -------
(In reply to comment #34)
> Wait, this bugzilla entry is still missing a patch that's already
upstream but
> not backported yet. I'll attach it today.
> 
Just a reminder, we're still waiting on this. 

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from tao@redhat.com on 2007-07-27 21:30 EST --
----- Additional Comments From bherren@au1.ibm.com (prefers email at
benh@au1.ibm.com)  2007-07-27 21:28 EDT -------
Sorry for the confusion, the fix I'm talking about is the one that was
submited
in a separate entry on bug #36932 (mpic protected sources). The comment on
the
later is a bit misleading as that patch doesn't only apply to the DDR
errors,
but also to the PCI-X/PCIe one afaik. 


This event sent from IssueTracker by Glen Johnson 
 issue 126663

-- Additional comment from smoser@redhat.com on 2007-07-30 09:10 EST --
(In reply to comment #12)
> Patches (at least the ones posted to this point) are POWER specific.  QE
> withholding ack based on:
> 
> 1) need the missing patch referred to in comment 11

This was a misunderstanding, probably my fault.  As Ben mentioned above, he
opened RH bug 249910 (LTC bug 36932) to address the additional issue.  There are
no further changes needed for this bug.

> 2) need testing results from patches applied to current Red Hat code

Redhat comment 5 above mentions Jens Osterkamp's test.  He tested and verified
for both PCI-X and PCIe.  The kernel he verified with was built using brew
(http://brewweb.devel.redhat.com/brew/taskinfo?taskID=887483).  It contains the
patches as submitted to rhkernel-list applied to 2.6.18-36.EL (just for the
record, it also includes patches for RH bugs for 242937 and 247658)

> 3) need IBM commitment on testing

Unless I'm mistaken, IBM has agreed to testing for all Cell platform.


Does that address all your concerns?

-- Additional comment from breeves@redhat.com on 2007-07-30 09:20 EST --
Thanks Scott - all fine from my side

-- Additional comment from jturner@redhat.com on 2007-07-30 09:39 EST --
QE ack for the exception, then.
Comment 1 Robbie Williamson 2007-08-02 10:58:11 EDT
The soon-to-be released QS21 Cell/B.E. BladeServer from IBM is supposed to
support F7, so IBM would really appreciate it if a kernel update with this patch
could be made available to F7 users.
Comment 2 Robbie Williamson 2007-08-02 11:06:10 EDT
Created attachment 160528 [details]
simple patch to panic when SERR or PERR occurs on PCI-X
Comment 3 Robbie Williamson 2007-08-02 11:07:34 EDT
Created attachment 160529 [details]
simple patch to panic when an error occurs on PCIe
Comment 4 Robbie Williamson 2007-08-02 11:15:29 EDT
Does IBM need to provide a backport of the patches?

Note You need to log in before you can comment on or make changes to this bug.