Bug 525350

Summary:

stex scsi driver hangs/loops after scsi reset

Product:

Red Hat Enterprise Linux 5

Reporter:

Mark Goodwin <mgoodwin>

Component:

kernel

Assignee:

David Milburn <dmilburn>

Status:

CLOSED DUPLICATE

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.4

CC:

dmilburn, nmurray, qcai, tatsu-ab1

Target Milestone:

Keywords:

Regression

Target Release:

5.5

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-02-15 18:27:03 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

499522

Attachments:

Description	Flags
customer supplied patch	none

Description Mark Goodwin 2009-09-24 02:09:42 UTC

Description of problem:
Regression introduced in RHEL5.4GA in the stex scsi HBA driver.
Issuing a scsi reset to a Promise SuperTrak EX4650 results in
stex_mu_intr() looping forever, with 'lagging req' messages
in syslog. The customer has developed a patch and has verified
it fixes the issue - I have asked for more explanation of how
the patch fixes the bug in the associated IT and will post their
reply here once I get it.

Version-Release number of selected component (if applicable):
RHEL5.4 GA
 kernel-2.6.18-164.el5
 kernel-PAE-2.6.18-164.el5

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.4 GA, and reboot.

2. Confirm the sg device of the reset target with sg_map command.

  # sg_map

3. Execute sg_reset command.

  # sg_reset -d /dev/sgX

  Note: X is the sg device number of the reset target.

  sg_reset: starting device reset
  scsi 0:0:1:0:
  sg_reset: completed device reset
  stex(0000:08:00.0): lagging req
  stex(0000:08:00.0): lagging req

  Note: "stex(0000:08:00.0): lagging req" message keeps being output.

4. Execute the command with the disk access.

  # df

  The following message is printed out after a while.

  "rejecting I/O to offline device"

Actual results:
"stex(0000:08:00.0): lagging req" message keeps being output and I/O is not executed.

Expected results:
"stex(0000:08:00.0): lagging req" message is not output and I/O is executed.

Business impact:
It becomes impossible for the customer to access the disk if reset occurs.

Hardware info:
Machine : Express5800/R110a-1
CPU     : Intel(R) Xeon(R) CPU L3110 3.00GHz x 1
Memory  : 4GB
HBA     : Promise SuperTrak EX4650

Comment 1 Mark Goodwin 2009-09-24 02:16:19 UTC

Created attachment 362376 [details]
customer supplied patch

Customer supplied patch - they have verified it fixes the bug, but there
is no accompanying explanation. I have asked the following :
----
could we please have an explanation of how the
patch fixes the problem, and any possible side-effects.

e.g. the patch basically deletes some of the stex reset code,
does this now rely on other code to initialize some of the
hba fields such as hba->req_head, hba->req_tail, etc?

Was the hba->host->host_lock spinlock causing a deadlock in
the irq context, and thus causing stex_mu_intr() to loop
forever?
----

Comment 6 Kosuke TATSUKAWA 2010-02-10 06:18:10 UTC

In IT#337387
> I've asked from engineering, if NEC can reply directly to comment #5 on this BZ to to have a chance at making 5.5.
>
>> In order to have a chance at making 5.5, we need commitment from NEC to help test this. Please have them reply directly in the BZ if they are willing to do so.
>
> I'm so glad if you could help us verifying the fix on your hardware.
>
> ---- comment #5 on BZ#535350 ----
> Looks like there is no such hardware in house QE can access. Can you please
> check with the customers if they can help test this once the Beta is out?
>
> BTW. Since this is a regression, so should be proposed as a blocker
> ----

Yes, NEC would be glad to help verify that the RHEL5.5 stex driver will have this problem fixed.
Please wait a little more while we test the stex driver in RHEL5.5 beta.

Best regards.

Comment 7 Kosuke TATSUKAWA 2010-02-15 07:39:13 UTC

NEC confirmed that this problem is fixed in RHEL5.5 beta release.

Comment 8 Issue Tracker 2010-02-15 08:31:24 UTC

Event posted on 02-15-2010 05:31pm JST by mfuruta

Tatsukawa-san,

Thank you so much for verification, I am really appreciated!

> NEC confirmed that the problem is fixed in RHEL5.5 beta.
> I've added Verified:NEC flag to BZ#525350.

Now I'm setting status to Waiting on Engineering..

Best Regards,
Masaki Furuta

Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by mfuruta 
 issue 337387

Comment 9 David Milburn 2010-02-15 18:27:03 UTC

The stex driver was updated to 4.6.0102.4 in -178.el5 which included changes
to the reset code, thanks for verifying this issue no longer exists.

*** This bug has been marked as a duplicate of bug 516881 ***