Description of problem: Regression introduced in RHEL5.4GA in the stex scsi HBA driver. Issuing a scsi reset to a Promise SuperTrak EX4650 results in stex_mu_intr() looping forever, with 'lagging req' messages in syslog. The customer has developed a patch and has verified it fixes the issue - I have asked for more explanation of how the patch fixes the bug in the associated IT and will post their reply here once I get it. Version-Release number of selected component (if applicable): RHEL5.4 GA kernel-2.6.18-164.el5 kernel-PAE-2.6.18-164.el5 How reproducible: Always Steps to Reproduce: 1. Install RHEL5.4 GA, and reboot. 2. Confirm the sg device of the reset target with sg_map command. # sg_map 3. Execute sg_reset command. # sg_reset -d /dev/sgX Note: X is the sg device number of the reset target. sg_reset: starting device reset scsi 0:0:1:0: sg_reset: completed device reset stex(0000:08:00.0): lagging req stex(0000:08:00.0): lagging req Note: "stex(0000:08:00.0): lagging req" message keeps being output. 4. Execute the command with the disk access. # df The following message is printed out after a while. "rejecting I/O to offline device" Actual results: "stex(0000:08:00.0): lagging req" message keeps being output and I/O is not executed. Expected results: "stex(0000:08:00.0): lagging req" message is not output and I/O is executed. Business impact: It becomes impossible for the customer to access the disk if reset occurs. Hardware info: Machine : Express5800/R110a-1 CPU : Intel(R) Xeon(R) CPU L3110 3.00GHz x 1 Memory : 4GB HBA : Promise SuperTrak EX4650
Created attachment 362376 [details] customer supplied patch Customer supplied patch - they have verified it fixes the bug, but there is no accompanying explanation. I have asked the following : ---- could we please have an explanation of how the patch fixes the problem, and any possible side-effects. e.g. the patch basically deletes some of the stex reset code, does this now rely on other code to initialize some of the hba fields such as hba->req_head, hba->req_tail, etc? Was the hba->host->host_lock spinlock causing a deadlock in the irq context, and thus causing stex_mu_intr() to loop forever? ----
In IT#337387 > I've asked from engineering, if NEC can reply directly to comment #5 on this BZ to to have a chance at making 5.5. > >> In order to have a chance at making 5.5, we need commitment from NEC to help test this. Please have them reply directly in the BZ if they are willing to do so. > > I'm so glad if you could help us verifying the fix on your hardware. > > ---- comment #5 on BZ#535350 ---- > Looks like there is no such hardware in house QE can access. Can you please > check with the customers if they can help test this once the Beta is out? > > BTW. Since this is a regression, so should be proposed as a blocker > ---- Yes, NEC would be glad to help verify that the RHEL5.5 stex driver will have this problem fixed. Please wait a little more while we test the stex driver in RHEL5.5 beta. Best regards.
NEC confirmed that this problem is fixed in RHEL5.5 beta release.
Event posted on 02-15-2010 05:31pm JST by mfuruta Tatsukawa-san, Thank you so much for verification, I am really appreciated! > NEC confirmed that the problem is fixed in RHEL5.5 beta. > I've added Verified:NEC flag to BZ#525350. Now I'm setting status to Waiting on Engineering.. Best Regards, Masaki Furuta Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by mfuruta issue 337387
The stex driver was updated to 4.6.0102.4 in -178.el5 which included changes to the reset code, thanks for verifying this issue no longer exists. *** This bug has been marked as a duplicate of bug 516881 ***