We are requesting that Red Hat release a kernel errata fix with the following updates to the lpfc driver (moving it from rev 8.1.10.9 to 8.1.10.12): 1) Correct a Memory Corruption case Symptoms: - System hang after lun discovery. - General protection fault while running on debug kernel - Null pointer dereference while running on debug kernel - Unable to handle kernel paging request in lpfc_get_scsi_buf - Slab corruption while running on debug kernel Cause: A number of FC discovery completion routines referenced and changed the lpfc_nodelist structures even after the structures had been freed. Resolution: Corrected erroneous structure references. 2) Fix incorrect queue tag handling Symptoms: Tape backup software fails due to tape drives reporting "Illegal Request" Error has been reported by Fujitsu/Siemens, IBM, Dell, and EMC. The tape that recently has been reported to encounter this problem the most is the IBM Ultrium LTO3 and LT04 models. Cause : The lpfc driver incorrectly interpreted the cmd->tag field from the midlayer, believing it was the SCSI command priority (simple, ordered, head of queue). Instead, it is supposed to be a scsi-2 tag value. If the lpfc driver saw a tag value of 33 or 34, it would set the command tag value to ORDERED or HEAD_OF_QUEUE, otherwise it would be set to simple. Tape drives do not support ORDERED or HEAD_OF_QUEUE, thus would fail the i/o. I/O failing is catastrophic for a tape, as the tape stops streaming and effectively loses positioning making recovery very difficult. This error was exacerbated by a midlayer bug that did not set the cmd->tag value prior to calling the SCSI and LLD prep_io functions. In many cases, there was junk in the field, making it susceptible to the tag value change. Resolution : Correct driver to use scsi_populate_tag_msg() to obtain actual tag value. The initial fix cam from Fujitsu. 3) SCSI errors due to erroneous FCP Recovery indicators Symptoms: SCSI I/O's errored/rejected by targets due to FCP CMD indicating FCP Recovery, event though FCP Recovery was not negotiated. Reported by IBM Cause: Driver was not clearing command data structure, thus, if on config with a TAPE, FCP Recovery bits would still be set. Resolution: Driver properly reinitializes all command data structures.
Created attachment 260281 [details] patches for 3 bugs described. Updates driver from 8.1.10.9 to 8.1.10.12
James, just want to confirm that these fixes are already proposed in bug 252989 for RHEL 5.2. Correct?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
------- Comment From rlary.com 2007-11-15 15:39 EDT------- IBM requests Red Hat consider Emulex request to include 8.1.10.12 driver, which includes the fix for this issue, as well as a fix for another tape device issue into a future RHEL5.1 Update Release. These two tape related issues potentially affect a large base of customers using IBM tape storage systems and Tivoli Storage Manager for enterprise data backup.
Andrius, The driver proposed for 5.2 has these bug fixes, and further feature content. -- james
EMC requests that Red Hat consider 8.1.10.12 for either a driver update to RHEL 5.1 or inclusion to the next errata kernel for RHEL 5.1. The case of memory corruption is of great concern.
Andrius, By when will Red Hat decide whether this driver update will be included in a RHEL5.1 errata kernel. I have numerous OEM's asking me this question on a daily basis. thank you Laurie
Ludek, what's the timeframe for the next 5.1.z spin that this may be included?
Laurie, can you tell me which OEMs have tested this patch? I'd like to have them included on this bugzilla as well for verification of the bug.
Chip/Laurie, QE has asked for testing results for the patch as well before this can be proposed for async errata.
IBM and Dell have tested the patch. EMC and Fujitsu should also be on the list for verification.
I'm gathering the test results together now. It's a culmination of testing from end customers who have hit the issue, OEMs and Emulex testing over the past month. Should have it for you later today.
Test kernels are available from http://people.redhat.com/coldwell/kernel/bugs/385351/ Chip
HP Emulex CR26988 System hang after lun discovery. Memory corruption reported by debug kernel. Found during Hazard C16 testing running up to 4 servers to 3 storage arrays. Running I/O and bus resets on 2 servers, running continuous reboots on another server. Both HP and Emulex have verified this fix in their test labs. Dell Emulex Case #88431 Dell/Commvalut seeing write failure under RHEL5 with LP1150 to IBM LT03/LT04 tape libraries. Dell has verified the patch provided by Emulex in their test environment. IBM Case #40294 and N33956 I/O errors w/ IBM LT03 and LT04 tape libraries. Issue identified as lpfc bug w/SCSI priority Queue tag handling. Provided patch to end customer who verified the patch resolves the issue in their production environment. IBM TSM Tivioli test team verified the patch resolves the issue in their test beds. Please let me know if you require any further info. Laurie
Created attachment 281251 [details] patch against z-stream 2.6.18-53.1.4 kernel This patch combines the patches in the obsoleted tarball.
The test kernel passes my basic sanity checks. If any of the partners have reproducers for the specific bugs, it would be good to have them test the kernel also, or send us a formula for reproducing the bug. I don't think we have HP's "hazard" test suite in house. Chip
to reproduce the memory corruption issue use the following test configuration 1. Two servers, one should be an HP blade server ( Proliant BL460C) 2. Share 255 EVA luns across both servers. 3. Reboot the blade server every minute. 4. Run 5 dt threads on all 255 luns 5. Issue bus resets every minute from the non re-booting server. to reproduce the tag queue issue run a tape backup app to an IBM LT03 or LT04 tape device. OEMs reported/verified this with Tiviloi and Convult tape backup apps.
*** This bug has been marked as a duplicate of 252989 ***