Bug 385351

Summary: Severe issues in 5.1 lpfc driver: Request update to 8.1.10.12
Product: Red Hat Enterprise Linux 5 Reporter: James Smart <james.smart>
Component: kernelAssignee: Chip Coldwell <coldwell>
Status: CLOSED DUPLICATE QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.1CC: andriusb, berthiaume_wayne, bugproxy, coughlan, dmair, emcnabb, jamie.wellnitz, laurie.barry, lsmid, marcobillpeter, rkenna, rlary
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: GSSApproved
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-12-14 16:36:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 252989    
Bug Blocks: 217217    
Attachments:
Description Flags
patches for 3 bugs described. Updates driver from 8.1.10.9 to 8.1.10.12
none
patch against z-stream 2.6.18-53.1.4 kernel none

Description James Smart 2007-11-15 19:31:33 UTC
We are requesting that Red Hat release a kernel errata fix with the following
updates to the lpfc driver (moving it from rev 8.1.10.9 to 8.1.10.12):

1) Correct a Memory Corruption case
 Symptoms:
  - System hang after lun discovery.
  - General protection fault while running on debug kernel
  - Null pointer dereference while running on debug kernel
  - Unable to handle kernel paging request in lpfc_get_scsi_buf
  - Slab corruption while running on debug kernel

 Cause:
   A number of FC discovery completion routines referenced and changed
   the lpfc_nodelist structures even after the structures had been freed.
 
 Resolution:
   Corrected erroneous structure references.


2) Fix incorrect queue tag handling
 Symptoms:
   Tape backup software fails due to tape drives reporting "Illegal Request"

   Error has been reported by Fujitsu/Siemens, IBM, Dell, and EMC.

   The tape that recently has been reported to encounter this problem 
   the most is the IBM Ultrium LTO3 and LT04 models.

 Cause :
   The lpfc driver incorrectly interpreted the cmd->tag field from the
   midlayer, believing it was the SCSI command priority (simple, ordered,
   head of queue).  Instead, it is supposed to be a scsi-2 tag value.
   If the lpfc driver saw a tag value of 33 or 34, it would set the
   command tag value to ORDERED or HEAD_OF_QUEUE, otherwise it would be
   set to simple. Tape drives do not support ORDERED or HEAD_OF_QUEUE,
   thus would fail the i/o. I/O failing is catastrophic for a tape, as
   the tape stops streaming and effectively loses positioning making
   recovery very difficult.

   This error was exacerbated by a midlayer bug that did not set the
   cmd->tag value prior to calling the SCSI and LLD prep_io functions.
   In many cases, there was junk in the field, making it susceptible
   to the tag value change.

 Resolution :
   Correct driver to use scsi_populate_tag_msg() to obtain actual tag value.

   The initial fix cam from Fujitsu.

3) SCSI errors due to erroneous FCP Recovery indicators
 Symptoms:
   SCSI I/O's errored/rejected by targets due to FCP CMD indicating
   FCP Recovery, event though FCP Recovery was not negotiated.

   Reported by IBM

 Cause:
   Driver was not clearing command data structure, thus, if on config
   with a TAPE, FCP Recovery bits would still be set.

 Resolution:
   Driver properly reinitializes all command data structures.

Comment 1 James Smart 2007-11-15 19:31:33 UTC
Created attachment 260281 [details]
patches for 3 bugs described. Updates driver from 8.1.10.9 to 8.1.10.12

Comment 3 Andrius Benokraitis 2007-11-15 19:55:56 UTC
James, just want to confirm that these fixes are already proposed in bug 252989
for RHEL 5.2. Correct?

Comment 5 RHEL Program Management 2007-11-15 20:15:07 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 IBM Bug Proxy 2007-11-15 20:40:34 UTC
------- Comment From rlary.com 2007-11-15 15:39 EDT-------
IBM requests Red Hat consider Emulex request to include 8.1.10.12 driver, which
includes the fix for this issue, as well as a fix for another tape device issue
into a future RHEL5.1 Update Release.

These two tape related issues potentially affect a large base of customers
using IBM tape storage systems and Tivoli Storage Manager for enterprise data
backup.

Comment 9 James Smart 2007-11-15 21:53:50 UTC
Andrius,

The driver proposed for 5.2 has these bug fixes, and further feature content.

-- james

Comment 17 Wayne Berthiaume 2007-11-19 15:38:26 UTC
EMC requests that Red Hat consider 8.1.10.12 for either a driver update to RHEL 
5.1 or inclusion to the next errata kernel for RHEL 5.1. The case of memory 
corruption is of great concern.

Comment 18 laurie barry 2007-12-05 17:21:31 UTC
Andrius,

By when will Red Hat decide whether this driver update will be included in a 
RHEL5.1 errata kernel.  I have numerous OEM's asking me this question on a 
daily basis.

thank you
Laurie

Comment 20 Andrius Benokraitis 2007-12-06 15:24:10 UTC
Ludek, what's the timeframe for the next 5.1.z spin that this may be included?

Comment 21 Andrius Benokraitis 2007-12-06 15:30:05 UTC
Laurie, can you tell me which OEMs have tested this patch? I'd like to have them
included on this bugzilla as well for verification of the bug.

Comment 22 Andrius Benokraitis 2007-12-06 15:33:37 UTC
Chip/Laurie, QE has asked for testing results for the patch as well before this
can be proposed for async errata. 

Comment 23 laurie barry 2007-12-06 15:39:57 UTC
IBM and Dell have tested the patch.  EMC and Fujitsu should also be on the 
list for verification.

Comment 25 laurie barry 2007-12-06 15:47:53 UTC
I'm gathering the test results together now.  It's a culmination of testing 
from end customers who have hit the issue, OEMs and Emulex testing over the 
past month.  Should have it for you later today.

Comment 28 Chip Coldwell 2007-12-07 14:42:00 UTC
Test kernels are available from

http://people.redhat.com/coldwell/kernel/bugs/385351/

Chip



Comment 29 laurie barry 2007-12-07 14:44:26 UTC
HP Emulex CR26988 System hang after lun discovery. Memory corruption reported 
by debug kernel.  Found during Hazard C16 testing running up to 4 servers to 3 
storage arrays.  Running I/O and bus resets on 2 servers, running continuous 
reboots on another server.  Both HP and Emulex have verified this fix in their 
test labs.

Dell Emulex Case #88431  Dell/Commvalut seeing write failure under RHEL5 with 
LP1150 to IBM LT03/LT04 tape libraries.  Dell has verified the patch provided 
by Emulex in their test environment.

IBM Case #40294 and N33956 I/O errors w/ IBM LT03 and LT04 tape libraries. 
Issue identified as lpfc bug w/SCSI priority Queue tag handling.  Provided 
patch to end customer who verified the patch resolves the issue in their 
production environment.  IBM TSM Tivioli test team verified the patch resolves 
the issue in their test beds. 

Please let me know if you require any further info.

Laurie 

Comment 30 Chip Coldwell 2007-12-07 14:48:57 UTC
Created attachment 281251 [details]
patch against z-stream 2.6.18-53.1.4 kernel

This patch combines the patches in the obsoleted tarball.

Comment 32 Chip Coldwell 2007-12-07 15:15:46 UTC
The test kernel passes my basic sanity checks.  If any of the partners have reproducers for the specific 
bugs, it would be good to have them test the kernel also, or send us a formula for reproducing the bug.  I 
don't think we have HP's "hazard" test suite in house.

Chip



Comment 33 laurie barry 2007-12-07 15:54:15 UTC
to reproduce the memory corruption issue use the following test configuration

1. Two servers, one should be an HP blade server ( Proliant BL460C) 2. Share 
255 EVA luns across both servers.
3. Reboot the blade server every minute.
4. Run 5 dt threads on all 255 luns
5. Issue bus resets every minute from the non re-booting server.

to reproduce the tag queue issue run a tape backup app to an IBM LT03 or LT04 
tape device.  OEMs reported/verified this with Tiviloi and Convult tape backup 
apps.


Comment 35 Chip Coldwell 2007-12-14 16:36:39 UTC

*** This bug has been marked as a duplicate of 252989 ***