Bug 437384

Summary: [QLogic/IBM 5.3 bug] EEH error on JS22 results in panic w/qla2xx 8.02.00-k5-rhel5.2-02 driver
Product: Red Hat Enterprise Linux 5 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Marcus Barrow <mbarrow>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.2CC: andriusb, bpeters, coughlan, mbarrow, qlogic-redhat-ext, seokmann.ju
Target Milestone: rcKeywords: OtherQA, Reopened
Target Release: ---   
Hardware: ppc64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-10 00:37:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 434857    
Bug Blocks: 415811    
Attachments:
Description Flags
fix EEH
none
Test/Debug Notes_0325
none
EEH mask out none

Description IBM Bug Proxy 2008-03-13 19:48:34 UTC
=Comment: #0=================================================
Richard A. Lary <rlary.com> - 2008-03-13 15:29 EDT
---Problem Description---
EEH error on JS22 results in panic w/qla2xx 8.02.00-k5-rhel5.2-02 driver

 
Contact Information = rlary.com
 
---uname output---
2.6.18-84.el5
 
Machine Type = JS22
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
Inject error using IBM errinjct tool:
# errinjct ioa-bus-error -f 16 -s scsi_host/host1; fdisk -l /dev/sdb

 
---Kernel - Drivers Component Data--- 
Stack trace output:
    Call Trace:
[C00000000FFCBA20] [C000000000010378] .show_stack+0x68/0x1b0 (unreliable)
[C00000000FFCBAC0] [C00000000004FD60] .eeh_dn_check_failure+0x108/0x29c
[C00000000FFCBB70] [C000000000050104] .eeh_check_failure+0xe0/0x108
[C00000000FFCBBF0] [D00000000047BF60] .qla24xx_soft_reset+0x450/0x49c [qla2xxx]
[C00000000FFCBCA0] [D000000000483454] .qla24xx_fw_dump+0x120c/0x1370 [qla2xxx]
[C00000000FFCBD70] [D000000000478088] .qla24xx_intr_handler+0x1c4/0x570 
[qla2xxx]
[C00000000FFCBE40] [C0000000000AA9C0] .handle_IRQ_event+0x7c/0xf8
[C00000000FFCBEF0] [C0000000000ACB44] .handle_fasteoi_irq+0x100/0x1bc
[C00000000FFCBF90] [C000000000026CD4] .call_handle_irq+0x1c/0x2c
[C00000001C1BEEF0] [C00000000000D0A4] .do_IRQ+0xf4/0x1a4
[C00000001C1BEF80] [C0000000000044F4] hardware_interrupt_entry+0xc/0x10
--- Exception: 501 at .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
    LR = .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
[C00000001C1BF270] [D00000000047306C] .qla24xx_start_scsi+0x610/0x720 [qla2xxx] 
(unreliable)
[C00000001C1BF340] [D000000000462B94] .qla24xx_queuecommand+0x164/0x22c 
[qla2xxx]
[C00000001C1BF3F0] [D0000000001D0EE8] .scsi_dispatch_cmd+0x294/0x370 [scsi_mod]
[C00000001C1BF490] [D0000000001D8FF0] .scsi_request_fn+0x340/0x494 [scsi_mod]
[C00000001C1BF530] [C0000000001A587C] .__generic_unplug_device+0x54/0x6c
[C00000001C1BF5B0] [C0000000001A7710] .generic_unplug_device+0x30/0x78
[C00000001C1BF640] [C0000000001A82D8] .blk_backing_dev_unplug+0x90/0xa8
[C00000001C1BF6D0] [C0000000000F109C] .block_sync_page+0x78/0x90
[C00000001C1BF750] [C0000000000B7658] .sync_page+0x74/0x98
[C00000001C1BF7D0] [C0000000003B9824] .__wait_on_bit_lock+0x8c/0x110
[C00000001C1BF870] [C0000000000B7474] .__lock_page+0x74/0x98
[C00000001C1BF950] [C0000000000B8204] .do_generic_mapping_read+0x254/0x4e8
[C00000001C1BFAA0] [C0000000000B8F70] .__generic_file_aio_read+0x164/0x20c
[C00000001C1BFB70] [C0000000000BA59C] .generic_file_read+0x94/0xcc
[C00000001C1BFCF0] [C0000000000EF94C] .vfs_read+0x118/0x200
[C00000001C1BFD90] [C0000000000EFE30] .sys_read+0x4c/0x8c
[C00000001C1BFE30] [C0000000000086A4] syscall_exit+0x0/0x40
Kernel panic - not syncing: EEH: MMIO halt (2) on device:0003:00:01.0

 <3>Badness in smp_call_function at arch/powerpc/kernel/smp.c:224
Call Trace:
[C00000000FFCB510] [C000000000010378] .show_stack+0x68/0x1b0 (unreliable)
[C00000000FFCB5B0] [C0000000003BBEF4] .program_check_exception+0x1cc/0x588
[C00000000FFCB680] [C0000000000047F4] program_check_common+0xf4/0x100
--- Exception: 700 at .smp_call_function+0x2c/0x20c
    LR = .panic+0x98/0x1b0
[C00000000FFCB970] [C00000000FFCBA20] 0xc00000000ffcba20 (unreliable)
[C00000000FFCBA20] [C00000000006B788] .panic+0x98/0x1b0
[C00000000FFCBAC0] [C00000000004FD90] .eeh_dn_check_failure+0x138/0x29c
[C00000000FFCBB70] [C000000000050104] .eeh_check_failure+0xe0/0x108
[C00000000FFCBBF0] [D00000000047BF60] .qla24xx_soft_reset+0x450/0x49c [qla2xxx]
[C00000000FFCBCA0] [D000000000483454] .qla24xx_fw_dump+0x120c/0x1370 [qla2xxx]
[C00000000FFCBD70] [D000000000478088] .qla24xx_intr_handler+0x1c4/0x570 
[qla2xxx]
[C00000000FFCBE40] [C0000000000AA9C0] .handle_IRQ_event+0x7c/0xf8
[C00000000FFCBEF0] [C0000000000ACB44] .handle_fasteoi_irq+0x100/0x1bc
[C00000000FFCBF90] [C000000000026CD4] .call_handle_irq+0x1c/0x2c
[C00000001C1BEEF0] [C00000000000D0A4] .do_IRQ+0xf4/0x1a4
[C00000001C1BEF80] [C0000000000044F4] hardware_interrupt_entry+0xc/0x10
--- Exception: 501 at .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
    LR = .qla24xx_start_scsi+0x650/0x720 [qla2xxx]
[C00000001C1BF270] [D00000000047306C] .qla24xx_start_scsi+0x610/0x720 [qla2xxx] 
(unreliable)
[C00000001C1BF340] [D000000000462B94] .qla24xx_queuecommand+0x164/0x22c 
[qla2xxx]
[C00000001C1BF3F0] [D0000000001D0EE8] .scsi_dispatch_cmd+0x294/0x370 [scsi_mod]
[C00000001C1BF490] [D0000000001D8FF0] .scsi_request_fn+0x340/0x494 [scsi_mod]
[C00000001C1BF530] [C0000000001A587C] .__generic_unplug_device+0x54/0x6c
[C00000001C1BF5B0] [C0000000001A7710] .generic_unplug_device+0x30/0x78
[C00000001C1BF640] [C0000000001A82D8] .blk_backing_dev_unplug+0x90/0xa8
[C00000001C1BF6D0] [C0000000000F109C] .block_sync_page+0x78/0x90
[C00000001C1BF750] [C0000000000B7658] .sync_page+0x74/0x98
[C00000001C1BF7D0] [C0000000003B9824] .__wait_on_bit_lock+0x8c/0x110
[C00000001C1BF870] [C0000000000B7474] .__lock_page+0x74/0x98
[C00000001C1BF950] [C0000000000B8204] .do_generic_mapping_read+0x254/0x4e8
[C00000001C1BFAA0] [C0000000000B8F70] .__generic_file_aio_read+0x164/0x20c
[C00000001C1BFB70] [C0000000000BA59C] .generic_file_read+0x94/0xcc
[C00000001C1BFCF0] [C0000000000EF94C] .vfs_read+0x118/0x200
[C00000001C1BFD90] [C0000000000EFE30] .sys_read+0x4c/0x8c
[C00000001C1BFE30] [C0000000000086A4] syscall_exit+0x0/0x40
Rebooting in 180 seconds..
 

 
Oops output:
  no
 
System Dump Info:
  The system is not configured to capture a system dump.
 
*Additional Instructions for rlary.com: 
-Post a private note with access information to the machine that the bug is 
occuring on. 
-Attach sysctl -a output output to the bug.
=Comment: #2=================================================
Richard A. Lary <rlary.com> - 2008-03-13 15:32 EDT
Qlogic will supply patches to eliminate the panic to Red Hat.  These patches 
depend upon patch in Red Hat bugzilla 434857 which initialize value of 
pdev->error_state to sane value.

Patches will be submitted directly to Red Hat by Marcus Barrow 
(mbarrow)

Comment 2 Andrius Benokraitis 2008-03-20 17:32:09 UTC
Marcus - please post the patch ASAP when you get it...

Comment 3 Marcus Barrow 2008-03-24 17:06:56 UTC
Created attachment 298907 [details]
fix EEH

Comment 4 IBM Bug Proxy 2008-03-24 17:56:50 UTC
------- Comment From rlary.com 2008-03-24 13:52 EDT-------
Thank you for patch, I will apply to RHEL5.2-snap1 2.6.18-85.el5 kernel, and
test on p6 JS22 blade.

Comment 6 IBM Bug Proxy 2008-03-25 03:56:28 UTC
------- Comment From rlary.com 2008-03-24 23:50 EDT-------
I was able to apply the patch to 2.6.18-85.el5 kernel and 8.02.00-k5-rhel5.2-03
qla2xxx driver(with two minor tweaks), however, testing is blocked due to an
issue with the IBM tool used to inject EEH errors being non-functional on
RHEL5.2.  LTC bugzilla 42373

Comment 7 Marcus Barrow 2008-03-25 12:55:07 UTC
Hi Richard, I found one issue because my tree and RedHat's divureged due to their taking an old version of 
a patch. That was a white space issue, this part of the patch should be reversed or I should just change 
my tree back. It looks like this:
-                                       if (fcport->login_retry == 0 &&
-                                           status != QLA_SUCCESS)
+                                       if (fcport->login_retry == 0 && status != QLA_SUCCESS)


Can you describe the other change you needed? I want these patches to apply cleanly...



Comment 8 IBM Bug Proxy 2008-03-25 13:24:40 UTC
------- Comment From rlary.com 2008-03-25 09:18 EDT-------
Marcus the other tweak was in qla2xxx_version.h, I was confused by the .rej
file contents:
***************
*** 7,13 ****
/*
* Driver version
*/
- #define QLA2XXX_VERSION      "8.02.00-k5-rhel5.2-04"

#define QLA_DRIVER_MAJOR_VER  8
#define QLA_DRIVER_MINOR_VER  1
--- 7,13 ----
/*
* Driver version
*/
+ #define QLA2XXX_VERSION      "8.02.00-k5-rhel5.2-05"

#define QLA_DRIVER_MAJOR_VER  8
#define QLA_DRIVER_MINOR_VER  1

But the issue seemed to be that the driver version in the RHEL5.2-2008013
kernel was 8.02.00-k5-rhel5.2-03.

Comment 9 Marcus Barrow 2008-03-25 13:50:28 UTC
Great. There is a firmware update as version -04. We may have this issue now, but I am hoping that our 
new policy to version number for every patch, especially thru the snapshot cycles, will work better for you 
and everyone else.

More numbers, but maybe now things will be clearer...

Thanks.



Comment 10 IBM Bug Proxy 2008-03-25 14:48:44 UTC
------- Comment From rlary.com 2008-03-25 10:42 EDT-------
Removing blocking LTC bugzilla

------- Comment From rlary.com 2008-03-25 10:46 EDT-------
I am able to inject EEH errors now. I am seeing unexpected behaviour in the way
qla2xxx is responding to EEH errors, investigating.  We may need QLogic to
setup access to one of their JS22 blades in case we need additional test/debug.
Additional details forthcoming...

Comment 12 IBM Bug Proxy 2008-03-25 19:01:03 UTC
Created attachment 299058 [details]
Test/Debug Notes_0325

Test and debug notes for current driver and patches.  It would be good for
Qlogic to replicate the issues I am seeing on the JS22 QLogic has.

Comment 13 Marcus Barrow 2008-03-25 19:12:53 UTC
It appeared to me that the following line in pci_scan_device() performs the proper initialization:

./drivers/pci/probe.c:  dev->error_state = pci_channel_io_normal;


Does applying the SLES 10 fix cure your problem, I am happy to apply it, if required...




Comment 15 Marcus Barrow 2008-03-27 06:07:54 UTC
Created attachment 299284 [details]
EEH mask out

Comment 16 Marcus Barrow 2008-03-27 06:10:42 UTC
I attached a patch to mask out the EEH code until it is known to be working. The patch bumps the version
number ( because it's the same patch I submitted to RedHt kernel ), which is not correct for testing use if 
you use this out of sequence... 




Comment 17 IBM Bug Proxy 2008-03-27 15:49:23 UTC
------- Comment From rlary.com 2008-03-27 11:47 EDT-------
(In reply to comment #20)
> ------- Comment From mbarrow 2008-03-27 02:10 EST-------
> I attached a patch to mask out the EEH code until it is known to be working.
The patch bumps the version
> number ( because it's the same patch I submitted to RedHt kernel ), which is
not correct for testing use if
> you use this out of sequence...

Thank you, will test and update bugzilla.

Comment 19 Marcus Barrow 2008-04-08 16:33:18 UTC
We would like to defer this feature to rhel 5.3, to allow more testing and evaluation.

Thank-you



Comment 20 IBM Bug Proxy 2008-04-08 16:50:51 UTC
------- Comment From rlary.com 2008-04-08 12:44 EDT-------
(In reply to comment #26)
> ------- Comment From mbarrow 2008-04-08 12:33 EST-------
> We would like to defer this feature to rhel 5.3, to allow more testing and
evaluation.

Marcus,
The feature being deferred to RHEL 5.3 is the support for PCI EEH.  For RHEL5.2
we do need Red Hat to pick up the EEH mask out patch attachment above which
disables qla2xxx driver EEH code for RHEL5.2 and changes the driver version to
8.02.00-k5-rhel5.2-06.

I was not sure if that is what you were saying in your comment.

Comment 21 Andrius Benokraitis 2008-04-08 19:12:44 UTC
Marcus, I don't think this is a feature since this causes a system crash.

Comment 22 IBM Bug Proxy 2008-04-08 19:33:06 UTC
------- Comment From rlary.com 2008-04-08 15:26 EDT-------
(In reply to comment #28)
> ------- Comment From andriusb 2008-04-08 15:12 EST-------
> Marcus, I don't think this is a feature since this causes a system crash.

To clarify, PCI EEH handling in qla2xxx driver was one of a list of new
features for the RHEL5.2 qla2xxx driver.  During testing it was discovered that
there were some deficiencies in the underlying kernel EEH driver which were
standing in the way of fixing the qla2xxx eeh code in time to make RHEL5.2.
IBM requested QLogic to disable the EEH code call back, which was accomplished
by the patch named "EEH maskout". This patch effectively renders the drivers
EEH code harmless, by disabling the driver callback, thus, in a sense fixing
the panic.
Further work on the kernel EEH code and qla2xxx driver code will resume after
RHEL5.2 in advance of RHEL5.3.
I have tested the patch above by applying the patch myself, however I was
waiting to mark the patch tested until it appeared in the next Red Hat snapshot
release.

Comment 23 Andrius Benokraitis 2008-04-08 19:54:30 UTC
rlary: Looks like we don't need the mask-out patch, since the original patch to
enable EEH (feature) was never committed to 5.2 to begin with. So as Tom stated
to me on the phone, "if we have both patches, we're safe; but if we have neither
of the patches, we're safe." Since neither have been committed to the tree, we
should be OK for 5.2, and to defer this entire BZ to 5.3 per Marcus.

I would assume this bug would get duped to a more general qla2xxx bugzilla for
5.3 when it is created by QLogic. In the meantime, we'll leave this open and
proposed for 5.3.

Comment 24 IBM Bug Proxy 2008-04-09 19:33:14 UTC
------- Comment From rlary.com 2008-04-09 15:26 EDT-------
(In reply to comment #30)
> ------- Comment From andriusb 2008-04-08 15:54 EST-------
> rlary: Looks like we don't need the mask-out patch, since the original patch
to
> enable EEH (feature) was never committed to 5.2 to begin with. So as Tom
stated
> to me on the phone, "if we have both patches, we're safe; but if we have
neither
> of the patches, we're safe." Since neither have been committed to the tree, we
> should be OK for 5.2, and to defer this entire BZ to 5.3 per Marcus.
> I would assume this bug would get duped to a more general qla2xxx bugzilla for
> 5.3 when it is created by QLogic. In the meantime, we'll leave this open and
> proposed for 5.3.

Sorry I did not reply sooner, I thought the above comment was a statement
rather than a question.

I have discussed the 'neither patches' option with Marcus Barrorw.  We agree
that option will be satisfactory for RHEL5.2 and we will resume work on a more
complete EEH soultion for RHEL5.3.

This bugzilla can be deferred to RHEL5.3

Comment 25 Marcus Barrow 2008-04-10 00:37:35 UTC
The EEH work has been retracted from the 5.2 kernel. Since this work will be re-submitted after further 
testing, these and other fixes will be included in that work and this bugzilla is no longer needed.

So closing this as NOTABUG.



Comment 26 IBM Bug Proxy 2008-04-11 20:25:20 UTC
------- Comment From rlary.com 2008-04-11 16:19 EDT-------
Marking this bugzilla as closed.  Work on this bug has been deferred to
RHEL5.3.  Work will be tracked in [Bug 44012]  - RH 253267: [QLogic 5.3 feat]
[1/n] Update qla2xxx - PCI EE error handling support