Red Hat Bugzilla – Bug 385351
Severe issues in 5.1 lpfc driver: Request update to 18.104.22.168
Last modified: 2008-01-07 11:36:24 EST
We are requesting that Red Hat release a kernel errata fix with the following
updates to the lpfc driver (moving it from rev 22.214.171.124 to 126.96.36.199):
1) Correct a Memory Corruption case
- System hang after lun discovery.
- General protection fault while running on debug kernel
- Null pointer dereference while running on debug kernel
- Unable to handle kernel paging request in lpfc_get_scsi_buf
- Slab corruption while running on debug kernel
A number of FC discovery completion routines referenced and changed
the lpfc_nodelist structures even after the structures had been freed.
Corrected erroneous structure references.
2) Fix incorrect queue tag handling
Tape backup software fails due to tape drives reporting "Illegal Request"
Error has been reported by Fujitsu/Siemens, IBM, Dell, and EMC.
The tape that recently has been reported to encounter this problem
the most is the IBM Ultrium LTO3 and LT04 models.
The lpfc driver incorrectly interpreted the cmd->tag field from the
midlayer, believing it was the SCSI command priority (simple, ordered,
head of queue). Instead, it is supposed to be a scsi-2 tag value.
If the lpfc driver saw a tag value of 33 or 34, it would set the
command tag value to ORDERED or HEAD_OF_QUEUE, otherwise it would be
set to simple. Tape drives do not support ORDERED or HEAD_OF_QUEUE,
thus would fail the i/o. I/O failing is catastrophic for a tape, as
the tape stops streaming and effectively loses positioning making
recovery very difficult.
This error was exacerbated by a midlayer bug that did not set the
cmd->tag value prior to calling the SCSI and LLD prep_io functions.
In many cases, there was junk in the field, making it susceptible
to the tag value change.
Correct driver to use scsi_populate_tag_msg() to obtain actual tag value.
The initial fix cam from Fujitsu.
3) SCSI errors due to erroneous FCP Recovery indicators
SCSI I/O's errored/rejected by targets due to FCP CMD indicating
FCP Recovery, event though FCP Recovery was not negotiated.
Reported by IBM
Driver was not clearing command data structure, thus, if on config
with a TAPE, FCP Recovery bits would still be set.
Driver properly reinitializes all command data structures.
Created attachment 260281 [details]
patches for 3 bugs described. Updates driver from 188.8.131.52 to 184.108.40.206
James, just want to confirm that these fixes are already proposed in bug 252989
for RHEL 5.2. Correct?
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
------- Comment From email@example.com 2007-11-15 15:39 EDT-------
IBM requests Red Hat consider Emulex request to include 220.127.116.11 driver, which
includes the fix for this issue, as well as a fix for another tape device issue
into a future RHEL5.1 Update Release.
These two tape related issues potentially affect a large base of customers
using IBM tape storage systems and Tivoli Storage Manager for enterprise data
The driver proposed for 5.2 has these bug fixes, and further feature content.
EMC requests that Red Hat consider 18.104.22.168 for either a driver update to RHEL
5.1 or inclusion to the next errata kernel for RHEL 5.1. The case of memory
corruption is of great concern.
By when will Red Hat decide whether this driver update will be included in a
RHEL5.1 errata kernel. I have numerous OEM's asking me this question on a
Ludek, what's the timeframe for the next 5.1.z spin that this may be included?
Laurie, can you tell me which OEMs have tested this patch? I'd like to have them
included on this bugzilla as well for verification of the bug.
Chip/Laurie, QE has asked for testing results for the patch as well before this
can be proposed for async errata.
IBM and Dell have tested the patch. EMC and Fujitsu should also be on the
list for verification.
I'm gathering the test results together now. It's a culmination of testing
from end customers who have hit the issue, OEMs and Emulex testing over the
past month. Should have it for you later today.
Test kernels are available from
HP Emulex CR26988 System hang after lun discovery. Memory corruption reported
by debug kernel. Found during Hazard C16 testing running up to 4 servers to 3
storage arrays. Running I/O and bus resets on 2 servers, running continuous
reboots on another server. Both HP and Emulex have verified this fix in their
Dell Emulex Case #88431 Dell/Commvalut seeing write failure under RHEL5 with
LP1150 to IBM LT03/LT04 tape libraries. Dell has verified the patch provided
by Emulex in their test environment.
IBM Case #40294 and N33956 I/O errors w/ IBM LT03 and LT04 tape libraries.
Issue identified as lpfc bug w/SCSI priority Queue tag handling. Provided
patch to end customer who verified the patch resolves the issue in their
production environment. IBM TSM Tivioli test team verified the patch resolves
the issue in their test beds.
Please let me know if you require any further info.
Created attachment 281251 [details]
patch against z-stream 2.6.18-53.1.4 kernel
This patch combines the patches in the obsoleted tarball.
The test kernel passes my basic sanity checks. If any of the partners have reproducers for the specific
bugs, it would be good to have them test the kernel also, or send us a formula for reproducing the bug. I
don't think we have HP's "hazard" test suite in house.
to reproduce the memory corruption issue use the following test configuration
1. Two servers, one should be an HP blade server ( Proliant BL460C) 2. Share
255 EVA luns across both servers.
3. Reboot the blade server every minute.
4. Run 5 dt threads on all 255 luns
5. Issue bus resets every minute from the non re-booting server.
to reproduce the tag queue issue run a tape backup app to an IBM LT03 or LT04
tape device. OEMs reported/verified this with Tiviloi and Convult tape backup
*** This bug has been marked as a duplicate of 252989 ***