Description of problem:
Regression introduced in RHEL5.4GA in the stex scsi HBA driver.
Issuing a scsi reset to a Promise SuperTrak EX4650 results in
stex_mu_intr() looping forever, with 'lagging req' messages
in syslog. The customer has developed a patch and has verified
it fixes the issue - I have asked for more explanation of how
the patch fixes the bug in the associated IT and will post their
reply here once I get it.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install RHEL5.4 GA, and reboot.
2. Confirm the sg device of the reset target with sg_map command.
3. Execute sg_reset command.
# sg_reset -d /dev/sgX
Note: X is the sg device number of the reset target.
sg_reset: starting device reset
sg_reset: completed device reset
stex(0000:08:00.0): lagging req
stex(0000:08:00.0): lagging req
Note: "stex(0000:08:00.0): lagging req" message keeps being output.
4. Execute the command with the disk access.
The following message is printed out after a while.
"rejecting I/O to offline device"
"stex(0000:08:00.0): lagging req" message keeps being output and I/O is not executed.
"stex(0000:08:00.0): lagging req" message is not output and I/O is executed.
It becomes impossible for the customer to access the disk if reset occurs.
Machine : Express5800/R110a-1
CPU : Intel(R) Xeon(R) CPU L3110 3.00GHz x 1
Memory : 4GB
HBA : Promise SuperTrak EX4650
Created attachment 362376 [details]
customer supplied patch
Customer supplied patch - they have verified it fixes the bug, but there
is no accompanying explanation. I have asked the following :
could we please have an explanation of how the
patch fixes the problem, and any possible side-effects.
e.g. the patch basically deletes some of the stex reset code,
does this now rely on other code to initialize some of the
hba fields such as hba->req_head, hba->req_tail, etc?
Was the hba->host->host_lock spinlock causing a deadlock in
the irq context, and thus causing stex_mu_intr() to loop
> I've asked from engineering, if NEC can reply directly to comment #5 on this BZ to to have a chance at making 5.5.
>> In order to have a chance at making 5.5, we need commitment from NEC to help test this. Please have them reply directly in the BZ if they are willing to do so.
> I'm so glad if you could help us verifying the fix on your hardware.
> ---- comment #5 on BZ#535350 ----
> Looks like there is no such hardware in house QE can access. Can you please
> check with the customers if they can help test this once the Beta is out?
> BTW. Since this is a regression, so should be proposed as a blocker
Yes, NEC would be glad to help verify that the RHEL5.5 stex driver will have this problem fixed.
Please wait a little more while we test the stex driver in RHEL5.5 beta.
NEC confirmed that this problem is fixed in RHEL5.5 beta release.
Event posted on 02-15-2010 05:31pm JST by email@example.com
Thank you so much for verification, I am really appreciated!
> NEC confirmed that the problem is fixed in RHEL5.5 beta.
> I've added Verified:NEC flag to BZ#525350.
Now I'm setting status to Waiting on Engineering..
Internal Status set to 'Waiting on Engineering'
This event sent from IssueTracker by firstname.lastname@example.org
The stex driver was updated to 4.6.0102.4 in -178.el5 which included changes
to the reset code, thanks for verifying this issue no longer exists.
*** This bug has been marked as a duplicate of bug 516881 ***