Bug 437384 - [QLogic/IBM 5.3 bug] EEH error on JS22 results in panic w/qla2xx 8.02.00-k5-rhel5.2-02 driver
[QLogic/IBM 5.3 bug] EEH error on JS22 results in panic w/qla2xx 8.02.00-k5-r...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
ppc64 All
high Severity urgent
: rc
: ---
Assigned To: Marcus Barrow
Martin Jenner
: OtherQA, Reopened
Depends On: 434857
Blocks: 415811
  Show dependency treegraph
 
Reported: 2008-03-13 15:48 EDT by IBM Bug Proxy
Modified: 2009-06-19 22:42 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-04-09 20:37:35 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
fix EEH (6.92 KB, patch)
2008-03-24 13:06 EDT, Marcus Barrow
no flags Details | Diff
Test/Debug Notes_0325 (13.61 KB, text/plain)
2008-03-25 15:01 EDT, IBM Bug Proxy
no flags Details
EEH mask out (1.58 KB, patch)
2008-03-27 02:07 EDT, Marcus Barrow
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 43135 None None None Never

  None (edit)
Description IBM Bug Proxy 2008-03-13 15:48:34 EDT
=Comment: #0=================================================
Richard A. Lary <rlary@us.ibm.com> - 2008-03-13 15:29 EDT
---Problem Description---
EEH error on JS22 results in panic w/qla2xx 8.02.00-k5-rhel5.2-02 driver

 
Contact Information = rlary@us.ibm.com
 
---uname output---
2.6.18-84.el5
 
Machine Type = JS22
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
Inject error using IBM errinjct tool:
# errinjct ioa-bus-error -f 16 -s scsi_host/host1; fdisk -l /dev/sdb

 
---Kernel - Drivers Component Data--- 
Stack trace output:
    Call Trace:
[C00000000FFCBA20] [C000000000010378] .show_stack+0x68/0x1b0 (unreliable)
[C00000000FFCBAC0] [C00000000004FD60] .eeh_dn_check_failure+0x108/0x29c
[C00000000FFCBB70] [C000000000050104] .eeh_check_failure+0xe0/0x108
[C00000000FFCBBF0] [D00000000047BF60] .qla24xx_soft_reset+0x450/0x49c [qla2xxx]
[C00000000FFCBCA0] [D000000000483454] .qla24xx_fw_dump+0x120c/0x1370 [qla2xxx]
[C00000000FFCBD70] [D000000000478088] .qla24xx_intr_handler+0x1c4/0x570 
[qla2xxx]
[C00000000FFCBE40] [C0000000000AA9C0] .handle_IRQ_event+0x7c/0xf8
[C00000000FFCBEF0] [C0000000000ACB44] .handle_fasteoi_irq+0x100/0x1bc
[C00000000FFCBF90] [C000000000026CD4] .call_handle_irq+0x1c/0x2c
[C00000001C1BEEF0] [C00000000000D0A4] .do_IRQ+0xf4/0x1a4
[C00000001C1BEF80] [C0000000000044F4] hardware_interrupt_entry+0xc/0x10
--- Exception: 501 at .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
    LR = .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
[C00000001C1BF270] [D00000000047306C] .qla24xx_start_scsi+0x610/0x720 [qla2xxx] 
(unreliable)
[C00000001C1BF340] [D000000000462B94] .qla24xx_queuecommand+0x164/0x22c 
[qla2xxx]
[C00000001C1BF3F0] [D0000000001D0EE8] .scsi_dispatch_cmd+0x294/0x370 [scsi_mod]
[C00000001C1BF490] [D0000000001D8FF0] .scsi_request_fn+0x340/0x494 [scsi_mod]
[C00000001C1BF530] [C0000000001A587C] .__generic_unplug_device+0x54/0x6c
[C00000001C1BF5B0] [C0000000001A7710] .generic_unplug_device+0x30/0x78
[C00000001C1BF640] [C0000000001A82D8] .blk_backing_dev_unplug+0x90/0xa8
[C00000001C1BF6D0] [C0000000000F109C] .block_sync_page+0x78/0x90
[C00000001C1BF750] [C0000000000B7658] .sync_page+0x74/0x98
[C00000001C1BF7D0] [C0000000003B9824] .__wait_on_bit_lock+0x8c/0x110
[C00000001C1BF870] [C0000000000B7474] .__lock_page+0x74/0x98
[C00000001C1BF950] [C0000000000B8204] .do_generic_mapping_read+0x254/0x4e8
[C00000001C1BFAA0] [C0000000000B8F70] .__generic_file_aio_read+0x164/0x20c
[C00000001C1BFB70] [C0000000000BA59C] .generic_file_read+0x94/0xcc
[C00000001C1BFCF0] [C0000000000EF94C] .vfs_read+0x118/0x200
[C00000001C1BFD90] [C0000000000EFE30] .sys_read+0x4c/0x8c
[C00000001C1BFE30] [C0000000000086A4] syscall_exit+0x0/0x40
Kernel panic - not syncing: EEH: MMIO halt (2) on device:0003:00:01.0

 <3>Badness in smp_call_function at arch/powerpc/kernel/smp.c:224
Call Trace:
[C00000000FFCB510] [C000000000010378] .show_stack+0x68/0x1b0 (unreliable)
[C00000000FFCB5B0] [C0000000003BBEF4] .program_check_exception+0x1cc/0x588
[C00000000FFCB680] [C0000000000047F4] program_check_common+0xf4/0x100
--- Exception: 700 at .smp_call_function+0x2c/0x20c
    LR = .panic+0x98/0x1b0
[C00000000FFCB970] [C00000000FFCBA20] 0xc00000000ffcba20 (unreliable)
[C00000000FFCBA20] [C00000000006B788] .panic+0x98/0x1b0
[C00000000FFCBAC0] [C00000000004FD90] .eeh_dn_check_failure+0x138/0x29c
[C00000000FFCBB70] [C000000000050104] .eeh_check_failure+0xe0/0x108
[C00000000FFCBBF0] [D00000000047BF60] .qla24xx_soft_reset+0x450/0x49c [qla2xxx]
[C00000000FFCBCA0] [D000000000483454] .qla24xx_fw_dump+0x120c/0x1370 [qla2xxx]
[C00000000FFCBD70] [D000000000478088] .qla24xx_intr_handler+0x1c4/0x570 
[qla2xxx]
[C00000000FFCBE40] [C0000000000AA9C0] .handle_IRQ_event+0x7c/0xf8
[C00000000FFCBEF0] [C0000000000ACB44] .handle_fasteoi_irq+0x100/0x1bc
[C00000000FFCBF90] [C000000000026CD4] .call_handle_irq+0x1c/0x2c
[C00000001C1BEEF0] [C00000000000D0A4] .do_IRQ+0xf4/0x1a4
[C00000001C1BEF80] [C0000000000044F4] hardware_interrupt_entry+0xc/0x10
--- Exception: 501 at .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
    LR = .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
[C00000001C1BF270] [D00000000047306C] .qla24xx_start_scsi+0x610/0x720 [qla2xxx] 
(unreliable)
[C00000001C1BF340] [D000000000462B94] .qla24xx_queuecommand+0x164/0x22c 
[qla2xxx]
[C00000001C1BF3F0] [D0000000001D0EE8] .scsi_dispatch_cmd+0x294/0x370 [scsi_mod]
[C00000001C1BF490] [D0000000001D8FF0] .scsi_request_fn+0x340/0x494 [scsi_mod]
[C00000001C1BF530] [C0000000001A587C] .__generic_unplug_device+0x54/0x6c
[C00000001C1BF5B0] [C0000000001A7710] .generic_unplug_device+0x30/0x78
[C00000001C1BF640] [C0000000001A82D8] .blk_backing_dev_unplug+0x90/0xa8
[C00000001C1BF6D0] [C0000000000F109C] .block_sync_page+0x78/0x90
[C00000001C1BF750] [C0000000000B7658] .sync_page+0x74/0x98
[C00000001C1BF7D0] [C0000000003B9824] .__wait_on_bit_lock+0x8c/0x110
[C00000001C1BF870] [C0000000000B7474] .__lock_page+0x74/0x98
[C00000001C1BF950] [C0000000000B8204] .do_generic_mapping_read+0x254/0x4e8
[C00000001C1BFAA0] [C0000000000B8F70] .__generic_file_aio_read+0x164/0x20c
[C00000001C1BFB70] [C0000000000BA59C] .generic_file_read+0x94/0xcc
[C00000001C1BFCF0] [C0000000000EF94C] .vfs_read+0x118/0x200
[C00000001C1BFD90] [C0000000000EFE30] .sys_read+0x4c/0x8c
[C00000001C1BFE30] [C0000000000086A4] syscall_exit+0x0/0x40
Rebooting in 180 seconds..
 

 
Oops output:
  no
 
System Dump Info:
  The system is not configured to capture a system dump.
 
*Additional Instructions for rlary@us.ibm.com: 
-Post a private note with access information to the machine that the bug is 
occuring on. 
-Attach sysctl -a output output to the bug.
=Comment: #2=================================================
Richard A. Lary <rlary@us.ibm.com> - 2008-03-13 15:32 EDT
Qlogic will supply patches to eliminate the panic to Red Hat.  These patches 
depend upon patch in Red Hat bugzilla 434857 which initialize value of 
pdev->error_state to sane value.

Patches will be submitted directly to Red Hat by Marcus Barrow 
(mbarrow@redhat.com)
Comment 2 Andrius Benokraitis 2008-03-20 13:32:09 EDT
Marcus - please post the patch ASAP when you get it...
Comment 3 Marcus Barrow 2008-03-24 13:06:56 EDT
Created attachment 298907 [details]
fix EEH
Comment 4 IBM Bug Proxy 2008-03-24 13:56:50 EDT
------- Comment From rlary@us.ibm.com 2008-03-24 13:52 EDT-------
Thank you for patch, I will apply to RHEL5.2-snap1 2.6.18-85.el5 kernel, and
test on p6 JS22 blade.
Comment 6 IBM Bug Proxy 2008-03-24 23:56:28 EDT
------- Comment From rlary@us.ibm.com 2008-03-24 23:50 EDT-------
I was able to apply the patch to 2.6.18-85.el5 kernel and 8.02.00-k5-rhel5.2-03
qla2xxx driver(with two minor tweaks), however, testing is blocked due to an
issue with the IBM tool used to inject EEH errors being non-functional on
RHEL5.2.  LTC bugzilla 42373
Comment 7 Marcus Barrow 2008-03-25 08:55:07 EDT
Hi Richard, I found one issue because my tree and RedHat's divureged due to their taking an old version of 
a patch. That was a white space issue, this part of the patch should be reversed or I should just change 
my tree back. It looks like this:
-                                       if (fcport->login_retry == 0 &&
-                                           status != QLA_SUCCESS)
+                                       if (fcport->login_retry == 0 && status != QLA_SUCCESS)


Can you describe the other change you needed? I want these patches to apply cleanly...

Comment 8 IBM Bug Proxy 2008-03-25 09:24:40 EDT
------- Comment From rlary@us.ibm.com 2008-03-25 09:18 EDT-------
Marcus the other tweak was in qla2xxx_version.h, I was confused by the .rej
file contents:
***************
*** 7,13 ****
/*
* Driver version
*/
- #define QLA2XXX_VERSION      "8.02.00-k5-rhel5.2-04"

#define QLA_DRIVER_MAJOR_VER  8
#define QLA_DRIVER_MINOR_VER  1
--- 7,13 ----
/*
* Driver version
*/
+ #define QLA2XXX_VERSION      "8.02.00-k5-rhel5.2-05"

#define QLA_DRIVER_MAJOR_VER  8
#define QLA_DRIVER_MINOR_VER  1

But the issue seemed to be that the driver version in the RHEL5.2-2008013
kernel was 8.02.00-k5-rhel5.2-03.
Comment 9 Marcus Barrow 2008-03-25 09:50:28 EDT
Great. There is a firmware update as version -04. We may have this issue now, but I am hoping that our 
new policy to version number for every patch, especially thru the snapshot cycles, will work better for you 
and everyone else.

More numbers, but maybe now things will be clearer...

Thanks.

Comment 10 IBM Bug Proxy 2008-03-25 10:48:44 EDT
------- Comment From rlary@us.ibm.com 2008-03-25 10:42 EDT-------
Removing blocking LTC bugzilla

------- Comment From rlary@us.ibm.com 2008-03-25 10:46 EDT-------
I am able to inject EEH errors now. I am seeing unexpected behaviour in the way
qla2xxx is responding to EEH errors, investigating.  We may need QLogic to
setup access to one of their JS22 blades in case we need additional test/debug.
Additional details forthcoming...
Comment 12 IBM Bug Proxy 2008-03-25 15:01:03 EDT
Created attachment 299058 [details]
Test/Debug Notes_0325

Test and debug notes for current driver and patches.  It would be good for
Qlogic to replicate the issues I am seeing on the JS22 QLogic has.
Comment 13 Marcus Barrow 2008-03-25 15:12:53 EDT
It appeared to me that the following line in pci_scan_device() performs the proper initialization:

./drivers/pci/probe.c:  dev->error_state = pci_channel_io_normal;


Does applying the SLES 10 fix cure your problem, I am happy to apply it, if required...


Comment 15 Marcus Barrow 2008-03-27 02:07:54 EDT
Created attachment 299284 [details]
EEH mask out
Comment 16 Marcus Barrow 2008-03-27 02:10:42 EDT
I attached a patch to mask out the EEH code until it is known to be working. The patch bumps the version
number ( because it's the same patch I submitted to RedHt kernel ), which is not correct for testing use if 
you use this out of sequence... 


Comment 17 IBM Bug Proxy 2008-03-27 11:49:23 EDT
------- Comment From rlary@us.ibm.com 2008-03-27 11:47 EDT-------
(In reply to comment #20)
> ------- Comment From mbarrow@redhat.com 2008-03-27 02:10 EST-------
> I attached a patch to mask out the EEH code until it is known to be working.
The patch bumps the version
> number ( because it's the same patch I submitted to RedHt kernel ), which is
not correct for testing use if
> you use this out of sequence...

Thank you, will test and update bugzilla.
Comment 19 Marcus Barrow 2008-04-08 12:33:18 EDT
We would like to defer this feature to rhel 5.3, to allow more testing and evaluation.

Thank-you

Comment 20 IBM Bug Proxy 2008-04-08 12:50:51 EDT
------- Comment From rlary@us.ibm.com 2008-04-08 12:44 EDT-------
(In reply to comment #26)
> ------- Comment From mbarrow@redhat.com 2008-04-08 12:33 EST-------
> We would like to defer this feature to rhel 5.3, to allow more testing and
evaluation.

Marcus,
The feature being deferred to RHEL 5.3 is the support for PCI EEH.  For RHEL5.2
we do need Red Hat to pick up the EEH mask out patch attachment above which
disables qla2xxx driver EEH code for RHEL5.2 and changes the driver version to
8.02.00-k5-rhel5.2-06.

I was not sure if that is what you were saying in your comment.
Comment 21 Andrius Benokraitis 2008-04-08 15:12:44 EDT
Marcus, I don't think this is a feature since this causes a system crash.
Comment 22 IBM Bug Proxy 2008-04-08 15:33:06 EDT
------- Comment From rlary@us.ibm.com 2008-04-08 15:26 EDT-------
(In reply to comment #28)
> ------- Comment From andriusb@redhat.com 2008-04-08 15:12 EST-------
> Marcus, I don't think this is a feature since this causes a system crash.

To clarify, PCI EEH handling in qla2xxx driver was one of a list of new
features for the RHEL5.2 qla2xxx driver.  During testing it was discovered that
there were some deficiencies in the underlying kernel EEH driver which were
standing in the way of fixing the qla2xxx eeh code in time to make RHEL5.2.
IBM requested QLogic to disable the EEH code call back, which was accomplished
by the patch named "EEH maskout". This patch effectively renders the drivers
EEH code harmless, by disabling the driver callback, thus, in a sense fixing
the panic.
Further work on the kernel EEH code and qla2xxx driver code will resume after
RHEL5.2 in advance of RHEL5.3.
I have tested the patch above by applying the patch myself, however I was
waiting to mark the patch tested until it appeared in the next Red Hat snapshot
release.
Comment 23 Andrius Benokraitis 2008-04-08 15:54:30 EDT
rlary: Looks like we don't need the mask-out patch, since the original patch to
enable EEH (feature) was never committed to 5.2 to begin with. So as Tom stated
to me on the phone, "if we have both patches, we're safe; but if we have neither
of the patches, we're safe." Since neither have been committed to the tree, we
should be OK for 5.2, and to defer this entire BZ to 5.3 per Marcus.

I would assume this bug would get duped to a more general qla2xxx bugzilla for
5.3 when it is created by QLogic. In the meantime, we'll leave this open and
proposed for 5.3.
Comment 24 IBM Bug Proxy 2008-04-09 15:33:14 EDT
------- Comment From rlary@us.ibm.com 2008-04-09 15:26 EDT-------
(In reply to comment #30)
> ------- Comment From andriusb@redhat.com 2008-04-08 15:54 EST-------
> rlary: Looks like we don't need the mask-out patch, since the original patch
to
> enable EEH (feature) was never committed to 5.2 to begin with. So as Tom
stated
> to me on the phone, "if we have both patches, we're safe; but if we have
neither
> of the patches, we're safe." Since neither have been committed to the tree, we
> should be OK for 5.2, and to defer this entire BZ to 5.3 per Marcus.
> I would assume this bug would get duped to a more general qla2xxx bugzilla for
> 5.3 when it is created by QLogic. In the meantime, we'll leave this open and
> proposed for 5.3.

Sorry I did not reply sooner, I thought the above comment was a statement
rather than a question.

I have discussed the 'neither patches' option with Marcus Barrorw.  We agree
that option will be satisfactory for RHEL5.2 and we will resume work on a more
complete EEH soultion for RHEL5.3.

This bugzilla can be deferred to RHEL5.3
Comment 25 Marcus Barrow 2008-04-09 20:37:35 EDT
The EEH work has been retracted from the 5.2 kernel. Since this work will be re-submitted after further 
testing, these and other fixes will be included in that work and this bugzilla is no longer needed.

So closing this as NOTABUG.

Comment 26 IBM Bug Proxy 2008-04-11 16:25:20 EDT
------- Comment From rlary@us.ibm.com 2008-04-11 16:19 EDT-------
Marking this bugzilla as closed.  Work on this bug has been deferred to
RHEL5.3.  Work will be tracked in [Bug 44012]  - RH 253267: [QLogic 5.3 feat]
[1/n] Update qla2xxx - PCI EE error handling support

Note You need to log in before you can comment on or make changes to this bug.