=Comment: #0================================================= Richard A. Lary <rlary.com> - 2008-03-13 15:29 EDT ---Problem Description--- EEH error on JS22 results in panic w/qla2xx 8.02.00-k5-rhel5.2-02 driver Contact Information = rlary.com ---uname output--- 2.6.18-84.el5 Machine Type = JS22 ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Inject error using IBM errinjct tool: # errinjct ioa-bus-error -f 16 -s scsi_host/host1; fdisk -l /dev/sdb ---Kernel - Drivers Component Data--- Stack trace output: Call Trace: [C00000000FFCBA20] [C000000000010378] .show_stack+0x68/0x1b0 (unreliable) [C00000000FFCBAC0] [C00000000004FD60] .eeh_dn_check_failure+0x108/0x29c [C00000000FFCBB70] [C000000000050104] .eeh_check_failure+0xe0/0x108 [C00000000FFCBBF0] [D00000000047BF60] .qla24xx_soft_reset+0x450/0x49c [qla2xxx] [C00000000FFCBCA0] [D000000000483454] .qla24xx_fw_dump+0x120c/0x1370 [qla2xxx] [C00000000FFCBD70] [D000000000478088] .qla24xx_intr_handler+0x1c4/0x570 [qla2xxx] [C00000000FFCBE40] [C0000000000AA9C0] .handle_IRQ_event+0x7c/0xf8 [C00000000FFCBEF0] [C0000000000ACB44] .handle_fasteoi_irq+0x100/0x1bc [C00000000FFCBF90] [C000000000026CD4] .call_handle_irq+0x1c/0x2c [C00000001C1BEEF0] [C00000000000D0A4] .do_IRQ+0xf4/0x1a4 [C00000001C1BEF80] [C0000000000044F4] hardware_interrupt_entry+0xc/0x10 --- Exception: 501 at .qla24xx_start_scsi+0x650/0x720 [qla2xxx] LR = .qla24xx_start_scsi+0x650/0x720 [qla2xxx] [C00000001C1BF270] [D00000000047306C] .qla24xx_start_scsi+0x610/0x720 [qla2xxx] (unreliable) [C00000001C1BF340] [D000000000462B94] .qla24xx_queuecommand+0x164/0x22c [qla2xxx] [C00000001C1BF3F0] [D0000000001D0EE8] .scsi_dispatch_cmd+0x294/0x370 [scsi_mod] [C00000001C1BF490] [D0000000001D8FF0] .scsi_request_fn+0x340/0x494 [scsi_mod] [C00000001C1BF530] [C0000000001A587C] .__generic_unplug_device+0x54/0x6c [C00000001C1BF5B0] [C0000000001A7710] .generic_unplug_device+0x30/0x78 [C00000001C1BF640] [C0000000001A82D8] .blk_backing_dev_unplug+0x90/0xa8 [C00000001C1BF6D0] [C0000000000F109C] .block_sync_page+0x78/0x90 [C00000001C1BF750] [C0000000000B7658] .sync_page+0x74/0x98 [C00000001C1BF7D0] [C0000000003B9824] .__wait_on_bit_lock+0x8c/0x110 [C00000001C1BF870] [C0000000000B7474] .__lock_page+0x74/0x98 [C00000001C1BF950] [C0000000000B8204] .do_generic_mapping_read+0x254/0x4e8 [C00000001C1BFAA0] [C0000000000B8F70] .__generic_file_aio_read+0x164/0x20c [C00000001C1BFB70] [C0000000000BA59C] .generic_file_read+0x94/0xcc [C00000001C1BFCF0] [C0000000000EF94C] .vfs_read+0x118/0x200 [C00000001C1BFD90] [C0000000000EFE30] .sys_read+0x4c/0x8c [C00000001C1BFE30] [C0000000000086A4] syscall_exit+0x0/0x40 Kernel panic - not syncing: EEH: MMIO halt (2) on device:0003:00:01.0 <3>Badness in smp_call_function at arch/powerpc/kernel/smp.c:224 Call Trace: [C00000000FFCB510] [C000000000010378] .show_stack+0x68/0x1b0 (unreliable) [C00000000FFCB5B0] [C0000000003BBEF4] .program_check_exception+0x1cc/0x588 [C00000000FFCB680] [C0000000000047F4] program_check_common+0xf4/0x100 --- Exception: 700 at .smp_call_function+0x2c/0x20c LR = .panic+0x98/0x1b0 [C00000000FFCB970] [C00000000FFCBA20] 0xc00000000ffcba20 (unreliable) [C00000000FFCBA20] [C00000000006B788] .panic+0x98/0x1b0 [C00000000FFCBAC0] [C00000000004FD90] .eeh_dn_check_failure+0x138/0x29c [C00000000FFCBB70] [C000000000050104] .eeh_check_failure+0xe0/0x108 [C00000000FFCBBF0] [D00000000047BF60] .qla24xx_soft_reset+0x450/0x49c [qla2xxx] [C00000000FFCBCA0] [D000000000483454] .qla24xx_fw_dump+0x120c/0x1370 [qla2xxx] [C00000000FFCBD70] [D000000000478088] .qla24xx_intr_handler+0x1c4/0x570 [qla2xxx] [C00000000FFCBE40] [C0000000000AA9C0] .handle_IRQ_event+0x7c/0xf8 [C00000000FFCBEF0] [C0000000000ACB44] .handle_fasteoi_irq+0x100/0x1bc [C00000000FFCBF90] [C000000000026CD4] .call_handle_irq+0x1c/0x2c [C00000001C1BEEF0] [C00000000000D0A4] .do_IRQ+0xf4/0x1a4 [C00000001C1BEF80] [C0000000000044F4] hardware_interrupt_entry+0xc/0x10 --- Exception: 501 at .qla24xx_start_scsi+0x650/0x720 [qla2xxx] LR = .qla24xx_start_scsi+0x650/0x720 [qla2xxx] [C00000001C1BF270] [D00000000047306C] .qla24xx_start_scsi+0x610/0x720 [qla2xxx] (unreliable) [C00000001C1BF340] [D000000000462B94] .qla24xx_queuecommand+0x164/0x22c [qla2xxx] [C00000001C1BF3F0] [D0000000001D0EE8] .scsi_dispatch_cmd+0x294/0x370 [scsi_mod] [C00000001C1BF490] [D0000000001D8FF0] .scsi_request_fn+0x340/0x494 [scsi_mod] [C00000001C1BF530] [C0000000001A587C] .__generic_unplug_device+0x54/0x6c [C00000001C1BF5B0] [C0000000001A7710] .generic_unplug_device+0x30/0x78 [C00000001C1BF640] [C0000000001A82D8] .blk_backing_dev_unplug+0x90/0xa8 [C00000001C1BF6D0] [C0000000000F109C] .block_sync_page+0x78/0x90 [C00000001C1BF750] [C0000000000B7658] .sync_page+0x74/0x98 [C00000001C1BF7D0] [C0000000003B9824] .__wait_on_bit_lock+0x8c/0x110 [C00000001C1BF870] [C0000000000B7474] .__lock_page+0x74/0x98 [C00000001C1BF950] [C0000000000B8204] .do_generic_mapping_read+0x254/0x4e8 [C00000001C1BFAA0] [C0000000000B8F70] .__generic_file_aio_read+0x164/0x20c [C00000001C1BFB70] [C0000000000BA59C] .generic_file_read+0x94/0xcc [C00000001C1BFCF0] [C0000000000EF94C] .vfs_read+0x118/0x200 [C00000001C1BFD90] [C0000000000EFE30] .sys_read+0x4c/0x8c [C00000001C1BFE30] [C0000000000086A4] syscall_exit+0x0/0x40 Rebooting in 180 seconds.. Oops output: no System Dump Info: The system is not configured to capture a system dump. *Additional Instructions for rlary.com: -Post a private note with access information to the machine that the bug is occuring on. -Attach sysctl -a output output to the bug. =Comment: #2================================================= Richard A. Lary <rlary.com> - 2008-03-13 15:32 EDT Qlogic will supply patches to eliminate the panic to Red Hat. These patches depend upon patch in Red Hat bugzilla 434857 which initialize value of pdev->error_state to sane value. Patches will be submitted directly to Red Hat by Marcus Barrow (mbarrow)
Marcus - please post the patch ASAP when you get it...
Created attachment 298907 [details] fix EEH
------- Comment From rlary.com 2008-03-24 13:52 EDT------- Thank you for patch, I will apply to RHEL5.2-snap1 2.6.18-85.el5 kernel, and test on p6 JS22 blade.
------- Comment From rlary.com 2008-03-24 23:50 EDT------- I was able to apply the patch to 2.6.18-85.el5 kernel and 8.02.00-k5-rhel5.2-03 qla2xxx driver(with two minor tweaks), however, testing is blocked due to an issue with the IBM tool used to inject EEH errors being non-functional on RHEL5.2. LTC bugzilla 42373
Hi Richard, I found one issue because my tree and RedHat's divureged due to their taking an old version of a patch. That was a white space issue, this part of the patch should be reversed or I should just change my tree back. It looks like this: - if (fcport->login_retry == 0 && - status != QLA_SUCCESS) + if (fcport->login_retry == 0 && status != QLA_SUCCESS) Can you describe the other change you needed? I want these patches to apply cleanly...
------- Comment From rlary.com 2008-03-25 09:18 EDT------- Marcus the other tweak was in qla2xxx_version.h, I was confused by the .rej file contents: *************** *** 7,13 **** /* * Driver version */ - #define QLA2XXX_VERSION "8.02.00-k5-rhel5.2-04" #define QLA_DRIVER_MAJOR_VER 8 #define QLA_DRIVER_MINOR_VER 1 --- 7,13 ---- /* * Driver version */ + #define QLA2XXX_VERSION "8.02.00-k5-rhel5.2-05" #define QLA_DRIVER_MAJOR_VER 8 #define QLA_DRIVER_MINOR_VER 1 But the issue seemed to be that the driver version in the RHEL5.2-2008013 kernel was 8.02.00-k5-rhel5.2-03.
Great. There is a firmware update as version -04. We may have this issue now, but I am hoping that our new policy to version number for every patch, especially thru the snapshot cycles, will work better for you and everyone else. More numbers, but maybe now things will be clearer... Thanks.
------- Comment From rlary.com 2008-03-25 10:42 EDT------- Removing blocking LTC bugzilla ------- Comment From rlary.com 2008-03-25 10:46 EDT------- I am able to inject EEH errors now. I am seeing unexpected behaviour in the way qla2xxx is responding to EEH errors, investigating. We may need QLogic to setup access to one of their JS22 blades in case we need additional test/debug. Additional details forthcoming...
Created attachment 299058 [details] Test/Debug Notes_0325 Test and debug notes for current driver and patches. It would be good for Qlogic to replicate the issues I am seeing on the JS22 QLogic has.
It appeared to me that the following line in pci_scan_device() performs the proper initialization: ./drivers/pci/probe.c: dev->error_state = pci_channel_io_normal; Does applying the SLES 10 fix cure your problem, I am happy to apply it, if required...
Created attachment 299284 [details] EEH mask out
I attached a patch to mask out the EEH code until it is known to be working. The patch bumps the version number ( because it's the same patch I submitted to RedHt kernel ), which is not correct for testing use if you use this out of sequence...
------- Comment From rlary.com 2008-03-27 11:47 EDT------- (In reply to comment #20) > ------- Comment From mbarrow 2008-03-27 02:10 EST------- > I attached a patch to mask out the EEH code until it is known to be working. The patch bumps the version > number ( because it's the same patch I submitted to RedHt kernel ), which is not correct for testing use if > you use this out of sequence... Thank you, will test and update bugzilla.
We would like to defer this feature to rhel 5.3, to allow more testing and evaluation. Thank-you
------- Comment From rlary.com 2008-04-08 12:44 EDT------- (In reply to comment #26) > ------- Comment From mbarrow 2008-04-08 12:33 EST------- > We would like to defer this feature to rhel 5.3, to allow more testing and evaluation. Marcus, The feature being deferred to RHEL 5.3 is the support for PCI EEH. For RHEL5.2 we do need Red Hat to pick up the EEH mask out patch attachment above which disables qla2xxx driver EEH code for RHEL5.2 and changes the driver version to 8.02.00-k5-rhel5.2-06. I was not sure if that is what you were saying in your comment.
Marcus, I don't think this is a feature since this causes a system crash.
------- Comment From rlary.com 2008-04-08 15:26 EDT------- (In reply to comment #28) > ------- Comment From andriusb 2008-04-08 15:12 EST------- > Marcus, I don't think this is a feature since this causes a system crash. To clarify, PCI EEH handling in qla2xxx driver was one of a list of new features for the RHEL5.2 qla2xxx driver. During testing it was discovered that there were some deficiencies in the underlying kernel EEH driver which were standing in the way of fixing the qla2xxx eeh code in time to make RHEL5.2. IBM requested QLogic to disable the EEH code call back, which was accomplished by the patch named "EEH maskout". This patch effectively renders the drivers EEH code harmless, by disabling the driver callback, thus, in a sense fixing the panic. Further work on the kernel EEH code and qla2xxx driver code will resume after RHEL5.2 in advance of RHEL5.3. I have tested the patch above by applying the patch myself, however I was waiting to mark the patch tested until it appeared in the next Red Hat snapshot release.
rlary: Looks like we don't need the mask-out patch, since the original patch to enable EEH (feature) was never committed to 5.2 to begin with. So as Tom stated to me on the phone, "if we have both patches, we're safe; but if we have neither of the patches, we're safe." Since neither have been committed to the tree, we should be OK for 5.2, and to defer this entire BZ to 5.3 per Marcus. I would assume this bug would get duped to a more general qla2xxx bugzilla for 5.3 when it is created by QLogic. In the meantime, we'll leave this open and proposed for 5.3.
------- Comment From rlary.com 2008-04-09 15:26 EDT------- (In reply to comment #30) > ------- Comment From andriusb 2008-04-08 15:54 EST------- > rlary: Looks like we don't need the mask-out patch, since the original patch to > enable EEH (feature) was never committed to 5.2 to begin with. So as Tom stated > to me on the phone, "if we have both patches, we're safe; but if we have neither > of the patches, we're safe." Since neither have been committed to the tree, we > should be OK for 5.2, and to defer this entire BZ to 5.3 per Marcus. > I would assume this bug would get duped to a more general qla2xxx bugzilla for > 5.3 when it is created by QLogic. In the meantime, we'll leave this open and > proposed for 5.3. Sorry I did not reply sooner, I thought the above comment was a statement rather than a question. I have discussed the 'neither patches' option with Marcus Barrorw. We agree that option will be satisfactory for RHEL5.2 and we will resume work on a more complete EEH soultion for RHEL5.3. This bugzilla can be deferred to RHEL5.3
The EEH work has been retracted from the 5.2 kernel. Since this work will be re-submitted after further testing, these and other fixes will be included in that work and this bugzilla is no longer needed. So closing this as NOTABUG.
------- Comment From rlary.com 2008-04-11 16:19 EDT------- Marking this bugzilla as closed. Work on this bug has been deferred to RHEL5.3. Work will be tracked in [Bug 44012] - RH 253267: [QLogic 5.3 feat] [1/n] Update qla2xxx - PCI EE error handling support