Bug 385351 - Severe issues in 5.1 lpfc driver: Request update to 8.1.10.12
Severe issues in 5.1 lpfc driver: Request update to 8.1.10.12
Status: CLOSED DUPLICATE of bug 252989
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Chip Coldwell
Martin Jenner
GSSApproved
: ZStream
Depends On: 252989
Blocks: 217217
  Show dependency treegraph
 
Reported: 2007-11-15 14:31 EST by James Smart
Modified: 2008-01-07 11:36 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-12-14 11:36:39 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patches for 3 bugs described. Updates driver from 8.1.10.9 to 8.1.10.12 (4.71 KB, application/octet-stream)
2007-11-15 14:31 EST, James Smart
no flags Details
patch against z-stream 2.6.18-53.1.4 kernel (9.81 KB, patch)
2007-12-07 09:48 EST, Chip Coldwell
no flags Details | Diff

  None (edit)
Description James Smart 2007-11-15 14:31:33 EST
We are requesting that Red Hat release a kernel errata fix with the following
updates to the lpfc driver (moving it from rev 8.1.10.9 to 8.1.10.12):

1) Correct a Memory Corruption case
 Symptoms:
  - System hang after lun discovery.
  - General protection fault while running on debug kernel
  - Null pointer dereference while running on debug kernel
  - Unable to handle kernel paging request in lpfc_get_scsi_buf
  - Slab corruption while running on debug kernel

 Cause:
   A number of FC discovery completion routines referenced and changed
   the lpfc_nodelist structures even after the structures had been freed.
 
 Resolution:
   Corrected erroneous structure references.


2) Fix incorrect queue tag handling
 Symptoms:
   Tape backup software fails due to tape drives reporting "Illegal Request"

   Error has been reported by Fujitsu/Siemens, IBM, Dell, and EMC.

   The tape that recently has been reported to encounter this problem 
   the most is the IBM Ultrium LTO3 and LT04 models.

 Cause :
   The lpfc driver incorrectly interpreted the cmd->tag field from the
   midlayer, believing it was the SCSI command priority (simple, ordered,
   head of queue).  Instead, it is supposed to be a scsi-2 tag value.
   If the lpfc driver saw a tag value of 33 or 34, it would set the
   command tag value to ORDERED or HEAD_OF_QUEUE, otherwise it would be
   set to simple. Tape drives do not support ORDERED or HEAD_OF_QUEUE,
   thus would fail the i/o. I/O failing is catastrophic for a tape, as
   the tape stops streaming and effectively loses positioning making
   recovery very difficult.

   This error was exacerbated by a midlayer bug that did not set the
   cmd->tag value prior to calling the SCSI and LLD prep_io functions.
   In many cases, there was junk in the field, making it susceptible
   to the tag value change.

 Resolution :
   Correct driver to use scsi_populate_tag_msg() to obtain actual tag value.

   The initial fix cam from Fujitsu.

3) SCSI errors due to erroneous FCP Recovery indicators
 Symptoms:
   SCSI I/O's errored/rejected by targets due to FCP CMD indicating
   FCP Recovery, event though FCP Recovery was not negotiated.

   Reported by IBM

 Cause:
   Driver was not clearing command data structure, thus, if on config
   with a TAPE, FCP Recovery bits would still be set.

 Resolution:
   Driver properly reinitializes all command data structures.
Comment 1 James Smart 2007-11-15 14:31:33 EST
Created attachment 260281 [details]
patches for 3 bugs described. Updates driver from 8.1.10.9 to 8.1.10.12
Comment 3 Andrius Benokraitis 2007-11-15 14:55:56 EST
James, just want to confirm that these fixes are already proposed in bug 252989
for RHEL 5.2. Correct?
Comment 5 RHEL Product and Program Management 2007-11-15 15:15:07 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 7 IBM Bug Proxy 2007-11-15 15:40:34 EST
------- Comment From rlary@us.ibm.com 2007-11-15 15:39 EDT-------
IBM requests Red Hat consider Emulex request to include 8.1.10.12 driver, which
includes the fix for this issue, as well as a fix for another tape device issue
into a future RHEL5.1 Update Release.

These two tape related issues potentially affect a large base of customers
using IBM tape storage systems and Tivoli Storage Manager for enterprise data
backup.
Comment 9 James Smart 2007-11-15 16:53:50 EST
Andrius,

The driver proposed for 5.2 has these bug fixes, and further feature content.

-- james
Comment 17 Wayne Berthiaume 2007-11-19 10:38:26 EST
EMC requests that Red Hat consider 8.1.10.12 for either a driver update to RHEL 
5.1 or inclusion to the next errata kernel for RHEL 5.1. The case of memory 
corruption is of great concern.
Comment 18 laurie barry 2007-12-05 12:21:31 EST
Andrius,

By when will Red Hat decide whether this driver update will be included in a 
RHEL5.1 errata kernel.  I have numerous OEM's asking me this question on a 
daily basis.

thank you
Laurie
Comment 20 Andrius Benokraitis 2007-12-06 10:24:10 EST
Ludek, what's the timeframe for the next 5.1.z spin that this may be included?
Comment 21 Andrius Benokraitis 2007-12-06 10:30:05 EST
Laurie, can you tell me which OEMs have tested this patch? I'd like to have them
included on this bugzilla as well for verification of the bug.
Comment 22 Andrius Benokraitis 2007-12-06 10:33:37 EST
Chip/Laurie, QE has asked for testing results for the patch as well before this
can be proposed for async errata. 
Comment 23 laurie barry 2007-12-06 10:39:57 EST
IBM and Dell have tested the patch.  EMC and Fujitsu should also be on the 
list for verification.
Comment 25 laurie barry 2007-12-06 10:47:53 EST
I'm gathering the test results together now.  It's a culmination of testing 
from end customers who have hit the issue, OEMs and Emulex testing over the 
past month.  Should have it for you later today.
Comment 28 Chip Coldwell 2007-12-07 09:42:00 EST
Test kernels are available from

http://people.redhat.com/coldwell/kernel/bugs/385351/

Chip

Comment 29 laurie barry 2007-12-07 09:44:26 EST
HP Emulex CR26988 System hang after lun discovery. Memory corruption reported 
by debug kernel.  Found during Hazard C16 testing running up to 4 servers to 3 
storage arrays.  Running I/O and bus resets on 2 servers, running continuous 
reboots on another server.  Both HP and Emulex have verified this fix in their 
test labs.

Dell Emulex Case #88431  Dell/Commvalut seeing write failure under RHEL5 with 
LP1150 to IBM LT03/LT04 tape libraries.  Dell has verified the patch provided 
by Emulex in their test environment.

IBM Case #40294 and N33956 I/O errors w/ IBM LT03 and LT04 tape libraries. 
Issue identified as lpfc bug w/SCSI priority Queue tag handling.  Provided 
patch to end customer who verified the patch resolves the issue in their 
production environment.  IBM TSM Tivioli test team verified the patch resolves 
the issue in their test beds. 

Please let me know if you require any further info.

Laurie 
Comment 30 Chip Coldwell 2007-12-07 09:48:57 EST
Created attachment 281251 [details]
patch against z-stream 2.6.18-53.1.4 kernel

This patch combines the patches in the obsoleted tarball.
Comment 32 Chip Coldwell 2007-12-07 10:15:46 EST
The test kernel passes my basic sanity checks.  If any of the partners have reproducers for the specific 
bugs, it would be good to have them test the kernel also, or send us a formula for reproducing the bug.  I 
don't think we have HP's "hazard" test suite in house.

Chip

Comment 33 laurie barry 2007-12-07 10:54:15 EST
to reproduce the memory corruption issue use the following test configuration

1. Two servers, one should be an HP blade server ( Proliant BL460C) 2. Share 
255 EVA luns across both servers.
3. Reboot the blade server every minute.
4. Run 5 dt threads on all 255 luns
5. Issue bus resets every minute from the non re-booting server.

to reproduce the tag queue issue run a tape backup app to an IBM LT03 or LT04 
tape device.  OEMs reported/verified this with Tiviloi and Convult tape backup 
apps.
Comment 35 Chip Coldwell 2007-12-14 11:36:39 EST

*** This bug has been marked as a duplicate of 252989 ***

Note You need to log in before you can comment on or make changes to this bug.