Bug 525350

Summary: stex scsi driver hangs/loops after scsi reset
Product: Red Hat Enterprise Linux 5 Reporter: Mark Goodwin <mgoodwin>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: dmilburn, nmurray, qcai, tatsu-ab1
Target Milestone: rcKeywords: Regression
Target Release: 5.5   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-15 18:27:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 499522    
Attachments:
Description Flags
customer supplied patch none

Description Mark Goodwin 2009-09-24 02:09:42 UTC
Description of problem:
Regression introduced in RHEL5.4GA in the stex scsi HBA driver.
Issuing a scsi reset to a Promise SuperTrak EX4650 results in
stex_mu_intr() looping forever, with 'lagging req' messages
in syslog. The customer has developed a patch and has verified
it fixes the issue - I have asked for more explanation of how
the patch fixes the bug in the associated IT and will post their
reply here once I get it.

Version-Release number of selected component (if applicable):
RHEL5.4 GA
 kernel-2.6.18-164.el5
 kernel-PAE-2.6.18-164.el5

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.4 GA, and reboot.

2. Confirm the sg device of the reset target with sg_map command.

  # sg_map

3. Execute sg_reset command.

  # sg_reset -d /dev/sgX

  Note: X is the sg device number of the reset target.

  sg_reset: starting device reset
  scsi 0:0:1:0:
  sg_reset: completed device reset
  stex(0000:08:00.0): lagging req
  stex(0000:08:00.0): lagging req

  Note: "stex(0000:08:00.0): lagging req" message keeps being output.

4. Execute the command with the disk access.

  # df

  The following message is printed out after a while.

  "rejecting I/O to offline device"

Actual results:
"stex(0000:08:00.0): lagging req" message keeps being output and I/O is not executed.

Expected results:
"stex(0000:08:00.0): lagging req" message is not output and I/O is executed.

Business impact:
It becomes impossible for the customer to access the disk if reset occurs.

Hardware info:
Machine : Express5800/R110a-1
CPU     : Intel(R) Xeon(R) CPU L3110 3.00GHz x 1
Memory  : 4GB
HBA     : Promise SuperTrak EX4650

Comment 1 Mark Goodwin 2009-09-24 02:16:19 UTC
Created attachment 362376 [details]
customer supplied patch

Customer supplied patch - they have verified it fixes the bug, but there
is no accompanying explanation. I have asked the following :
----
could we please have an explanation of how the
patch fixes the problem, and any possible side-effects.

e.g. the patch basically deletes some of the stex reset code,
does this now rely on other code to initialize some of the
hba fields such as hba->req_head, hba->req_tail, etc?

Was the hba->host->host_lock spinlock causing a deadlock in
the irq context, and thus causing stex_mu_intr() to loop
forever?
----

Comment 6 Kosuke TATSUKAWA 2010-02-10 06:18:10 UTC
In IT#337387
> I've asked from engineering, if NEC can reply directly to comment #5 on this BZ to to have a chance at making 5.5.
>
>> In order to have a chance at making 5.5, we need commitment from NEC to help test this. Please have them reply directly in the BZ if they are willing to do so.
>
> I'm so glad if you could help us verifying the fix on your hardware.
>
> ---- comment #5 on BZ#535350 ----
> Looks like there is no such hardware in house QE can access. Can you please
> check with the customers if they can help test this once the Beta is out?
>
> BTW. Since this is a regression, so should be proposed as a blocker
> ----

Yes, NEC would be glad to help verify that the RHEL5.5 stex driver will have this problem fixed.
Please wait a little more while we test the stex driver in RHEL5.5 beta.

Best regards.

Comment 7 Kosuke TATSUKAWA 2010-02-15 07:39:13 UTC
NEC confirmed that this problem is fixed in RHEL5.5 beta release.

Comment 8 Issue Tracker 2010-02-15 08:31:24 UTC
Event posted on 02-15-2010 05:31pm JST by mfuruta

Tatsukawa-san,

Thank you so much for verification, I am really appreciated!

> NEC confirmed that the problem is fixed in RHEL5.5 beta.
> I've added Verified:NEC flag to BZ#525350.

Now I'm setting status to Waiting on Engineering..

Best Regards,
Masaki Furuta

Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by mfuruta 
 issue 337387

Comment 9 David Milburn 2010-02-15 18:27:03 UTC
The stex driver was updated to 4.6.0102.4 in -178.el5 which included changes
to the reset code, thanks for verifying this issue no longer exists.

*** This bug has been marked as a duplicate of bug 516881 ***