Description of problem: More than installation has had silent data corruption with scsi messages in the log as follows: Apr 30 11:14:03 linc01a kernel: Info fld=0x0 Apr 30 11:14:03 linc01a kernel: sdi: Current: sense key: No Sense Apr 30 11:14:03 linc01a kernel: Additional sense: No additional sense information The code path shows that good_bytes is set to the xfer length requested when processing. I think there is an interaction between a FC driver and scsi that causes this code path to be executed in an error condition, but because the FC driver didn't supply sense data scsi assumes that the transfer was okay. I think this is a very dangerous assumption and these messages should not be encountered in normal activity. Emulex has submitted a kernel patch showing that they have encountered a similar problem. http://kerneltrap.org/mailarchive/linux-scsi/2008/9/12/3272434 Version-Release number of selected component (if applicable): How reproducible: Have not been able to reproduce it on demand. Steps to Reproduce: 1. 2. 3. Actual results: Valid data. Expected results: garbage data Additional info: This has been encountered at more than one site. One site replaced hardware and the problem no longer occured. Another site is still experiencing problems but the problem is sporadic - it occurs anywhere from once a month to a few instances in a week.
I think you hit the wrong component. scsi-target-utils is for the scsi target layer/server which makes your box into a scsi device basically. That was not inlvolved right? I am going to assume not and set this to kernel. What kernel did this occur on (uname -a) and what RHEL version (cat /etc/redhat-release) was it? Did you port the patch from here: http://kerneltrap.org/mailarchive/linux-scsi/2008/9/12/3272434 and try it out and verified it worked for the issue you are hitting, or do you want me to port it? Are you comfortable building kernels?
You are right - its the kernel scsi. I must have been blind because I couldn't find anything for the kernel when I went through the list. Linux mdc2a 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21 EST 2007 i686 i686 i386 GNU/Linux I don't have the redhat release info from our customer. I have asked the customer to open a bug directly, but they haven't done it. They want to see it come out in an update and then they will try it. The problem is not reproducable on demand and has occurred sporadically over 6 months. I've seen another customer with silent data corruption that also had these messages in the log. I originally attempted to address it with the linux-scsi mailing list but I'm not a scsi expert and didn't get very far. Then I saw this patch submitted by Emulex with a failing test case and I'm very certain it would fix the customer problem in that there would be real i/o errors instead of silent data corruption. Of course, the source of the problem would still need to be tracked down but the possibility of silent data corruption in this code path needs to be addressed.
qlogic driver info form system. =================================================== qla2xxx 0000:05:05.0: QLogic Fibre Channel HBA Driver: 8.01.07-k1 QLogic DELL2342M - ISP2312: PCI-X (100 MHz) @ 0000:05:05.0 hdma-, host#=1, fw=3.03.20 IPX Vendor: APPLE Model: Xserve RAID Rev: 1.50 Type: Direct-Access ANSI SCSI revision: 05 ===================================================
Adding Emulex guys per their request.
I've ported the upstream sd.c patch to RHEL 5.3's kernel. I just need to regression test it, then I'll post it here.
Redhat release and uname -a from system seeing problems. Red Hat Enterprise Linux Server release 5.1 (Tikanga) Linux front01a 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
RH - can we get this into RHEL5.3 instead of RHEL5.4? Laurie Barry Emulex
Created attachment 321969 [details] Fix for NO_SENSE handling in sd.c ported to 2.6.18-120.el5 Sorry about the delay. This is the upstream patch mnodified to apply to 2.6.18-120.el5. I tested it with an instrumented driver which returns check conditions of NO_SENSE on occasional SCSI commands. 2.6.18-120.el5 sees data corruption without this patch. With the attached patch, the commands are correctly retried without corruption.
Andrius/Tom, Please confirm this is getting into RHEL5.3. Laurie
I believe Tom was hunting down some more info on this... Tom?
Red Hat, What's the latest on this one, I've seen no movement. Laurie
This came in after 5.3 beta shipped, and it has not seen any test time upstream. I am very reluctant to check this in late to 5.3, because of the possibility of regression. That is, I/O that in fact succeeded, despite a NO SENSE, and previously returned success status to the o.s. will now fail. We don't have any way to know how common that scenario is in the field. I polled several storage vendors, including the one who reported this to you. They are not aware of any conditions where they return NO SENSE but fail to complete the I/O. I believe the vendor who reported this to you currently has a workaround in firmware to ensure this as well. I suggest that they keep this workaround in place for 5.3. We will allow this patch to get some more testing exposure upstream, and in the field, and check it in early to 5.4. This just came in too late to take the risk in 5.3, especially since this is a long-standing problem, not a regression specific to 5.3.
Recently learned that the customer is running with the patch on all their systems. They have seen the messages but have not had any data corruption. The test is not conclusive as data corruption wasn't always detected and they haven't run longer yet than they saw between corruptions previously. In other words, if they went 2 months without data corruption previously, they have only gone 1 month without data corruption at this point. But it is promising as the messages have been seen without any data corruption being detected. If I hear that data corruption has been detected while running with the patch, I will immediately update this bug with that information.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Updating PM score.
in kernel-2.6.18-132.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I'm adding info that is directly from our customer without my mangling it in any way. The summary is that they hooked up a solaris box and eventually saw errors. They still get data corruption and I'm wondering if they are also seeing a bad retry case. I think the code should be checking the residual before deciding that all the data is valid. From our(Quantums) customer: We now have one instance of file corruption. We've been hitting several nodes with large file transfers and many small creates/deletes, and have corruption in one of the large files. We also have the kernel messages from the time of the corruption. On a Linux node: Jan 23 12:00:33 linc21a kernel: Info fld=0x0 Jan 23 12:02:21 linc21a kernel: sdb: Current: sense key: No Sense Jan 23 12:02:21 linc21a kernel: Add. Sense: No additional sense information Jan 23 12:02:21 linc21a kernel: Jan 23 12:02:21 linc21a kernel: Info fld=0x0 Jan 23 13:11:46 linc21a kernel: qla2xxx 0000:05:05.0: scsi(1:0:0): Abort command issued -- 1 90fffc0 2002. Jan 23 13:11:46 linc21a kernel: qla2xxx 0000:05:05.0: scsi(1:0:0): Abort command issued -- 1 90fffc2 2002. With the SCSI patch we are seeing the SCSI abort commands being issued. We've been seeing them for a while now, but this week, we hit some threshold, and the device was taken offline, forcing us to do a hard reboot: Feb 4 10:00:30 front01a kernel: lpfc 0000:0d:03.0: 0:0713 SCSI layer issued LUN reset (5, 0) Data: x2002 x1 x10000 Feb 4 10:00:42 front01a kernel: lpfc 0000:0d:03.0: 0:0714 SCSI layer issued Bus Reset Data: x2002 Feb 4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined - not ready after error recovery Feb 4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined - not ready after error recovery Feb 4 10:01:16 front01a kernel: sd 1:0:5:0: SCSI error: return code = 0x00070000 Feb 4 10:01:16 front01a kernel: end_request: I/O error, dev sdh, sector 3642924288 However, there aren't any SENSE KEY messages around this activity, which is unexpected. We set up a Solaris client, and that is providing us more useful information. We had the same issue with many SENSE messages, and a device being taken offline. However, the first SENSE message in the burst is interesting: Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.warning] WARNING: / pci@0,0/pci8086, 25f8@4/pci1077,137@0/fp@0,0/disk@w600039300001416e,0 (sd22): Feb 3 15:56:33 sansvr Error for Command: <undecoded cmd 0x8a> Error Level: Retryable Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] Requested Block: 4321824768 Error Block: 4321824768 Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] Vendor: APPLE Serial Number: Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.warning] WARNING: / pci@0,0/pci8086, 25f8@4/pci1077,137@0/fp@0,0/disk@w600039300001416e,0 (sd22): Feb 3 15:56:33 sansvr Error for Command: <undecoded cmd 0x8a> Error Level:Retryable Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] Requested Block: 4321825280 Error Block: 4321825280 Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] Vendor: APPLE Serial Number: Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Feb 3 15:56:33 sansvr scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 So we have some sense data in the first message: 'Unit Attention', something we never see on the Linux nodes (and I'm not convinced the Linux SCSI layer does the right thing in terms of error handling). We have more info in the Solaris logs: Feb 3 17:14:17 sansvr offlining lun=1 (trace=0), target=70f00 (trace=2800004) Feb 3 17:14:17 sansvr scsi: [ID 243001 kern.info] /pci@0,0/ pci8086,25f8@4/pci10 77,137@0/fp@0,0 (fcp0): Feb 3 17:14:17 sansvr offlining lun=0 (trace=0), target=70f00 (trace=2800004) Feb 3 17:14:17 sansvr genunix: [ID 408114 kern.info] /pci@0,0/ pci8086,25f8@4/pci1077,137@0/fp@0,0/disk@w6000393000013f3d,2 (sd24) offline Feb 3 17:14:17 sansvr genunix: [ID 408114 kern.info] /pci@0,0/ pci8086,25f8@4/pci1077,137@0/fp@0,0/disk@w60003930000141b6,2 (sd27) offline Feb 3 17:15:10 sansvr scsi: [ID 107833 kern.warning] WARNING: / pci@0,0/pci8086, 25f8@4/pci1077,137@0/fp@0,0/disk@w6000393000013f80,0 (sd152): Feb 3 17:15:10 sansvr SCSI transport failed: reason 'timeout': retrying command Feb 3 17:17:32 sansvr scsi: [ID 107833 kern.warning] WARNING: / pci@0,0/pci8086, 25f8@4/pci1077,137@0/fp@0,0/disk@w6000393000013f80,0 (sd152): Feb 3 17:17:32 sansvr SCSI transport failed: reason 'timeout': giving up Feb 3 17:17:32 sansvr genunix: [ID 307647 kern.notice] NOTICE: I/O error on file system 'snfs2' operation WRITE inode 0x26fb3fe83 file offset 4289921024 I/O length 65536 So there we have a complete trace of a device being taking offline, and the error eventually floating up the the file system resulting in a WRITE error. I don't believe we've ever seen this level of error on the Linux box. It's also nice that the Solaris device name is based on the fibre WWN; makes it easier to know what RAID was involved. I looked at the logs on the XRAID that was take offline, and there are no errors. Also, that device was still accessible to other nodes in the cluster. I also looked at the logs in the fibre switch, and again, no errors directly related to this event and this device. However, on one switch there were messages about a port being reset, where an XRAID is attached.
(In reply to comment #22) > With the SCSI patch we are seeing the SCSI abort commands being > issued. We've been seeing them for a while now, > but this week, we hit some threshold, and the device was taken > offline, forcing us to do a hard reboot: > > Feb 4 10:00:30 front01a kernel: lpfc 0000:0d:03.0: 0:0713 SCSI layer > issued LUN reset (5, 0) Data: x2002 x1 x10000 > Feb 4 10:00:42 front01a kernel: lpfc 0000:0d:03.0: 0:0714 SCSI layer > issued Bus Reset Data: x2002 > Feb 4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined - > not ready after error recovery > Feb 4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined - > not ready after error recovery > Feb 4 10:01:16 front01a kernel: sd 1:0:5:0: SCSI error: return code = > 0x00070000 > Feb 4 10:01:16 front01a kernel: end_request: I/O error, dev sdh, > sector 3642924288 > > However, there aren't any SENSE KEY messages around this activity, > which is unexpected. > Do you only see corruption when you see the device offlined messsages? When you see those offlined messages, it means the scsi layer and driver, tried to recover a disk but could not. You get into this situation when a command does not complete within /sys/block/sdb/device/timeout seconds. If this happens the scsi layer will ask the driver to abort the running tasks. If that fails, then the scsi layer asks the driver to reset the logical unit. If that fails, then the driver is asked to reset the bus (in the case of fibre channel drivers they will normally just do a target or lu reset for all the targets or devs on the bus). If that fails then we reset the host. And finally if that fails the devices are offlined and the scsi layer fails the IO and fails all future IOs until the device is manually onlined again. So when this happens the file system or application is going to get failed IO and writes will not have completed and the app/fs may not handle this correctly. This lines up with what we see in the solaris trace I think.
Previously the data corruption was seen with the scsi "no sense data" messages and no i/o errors made it up to the filesystem. My understanding now is that corruption is still being seen in that case but I will verify. I have a suspicion that sometimes we are getting a "successful retry" in that same case statement which is the source of the "no sense data" messages. I know of a verified case where data corruption was happening in conjuntion with messages indicating that retry case statement was executed. This was a different system entirely, but backs up my opinion that the code should be checking the residual instead of assuming it was 0 based on the codes. In that instance, the source of the error from below the scsi layer was fixed. Back to this particular instance - I have no indication that retry messages were ever seen so its only theory on my part. I will ask our customer if he would be willing to let me add him to this bug so that questions can be answered directly. The difficulty they are having is identifying the place errors are being generated. Maybe it is all driven by timeouts. The raid, switches, hba's show no errors in their logs. When i/o errors are returned the filesystem does handle them and is quite verbose about receiving them.
(In reply to comment #25) > I will ask our customer if he would be willing to let me add him to this bug so > that questions can be answered directly. The difficulty they are having is > identifying the place errors are being generated. Maybe it is all driven by > timeouts. The raid, switches, hba's show no errors in their logs. > The aborts, lun reset and bus reset are from the command timeout I mentioned before: /sys/block/sdX/device/timeout It is basically a block/scsi layer timeout on each scsi command that is sent the HBA (lpfc in the case of the log) driver. > When i/o errors are returned the filesystem does handle them and is quite > verbose about receiving them. Ok just to make sure I understand you. When the device is offlined, A write might not get executed so if you try to access the file later, it is going to be bad. So users will report this as corruption. For your case though, are you saying that you know the write did not succeed because the device was offlined and besides that problem you also get files that we said were successfully written out but there is corruption in them? Right? How do you know it was written out successfully? Are you doing a write and a sync to make sure it is on disk, before moving on to the next file/test? I mean if you do a cp or write(), they could return ok status but then the data could be in a file system or buffer cache and not on disk yet.
Emulex guys, For the resid part of this, will that make a difference with your driver? I glanced over it and it looked like you only set resid if there is underrun or overrun detected by lpfc_handle_fcp_err. So we should we be hitting this: if (resp_info & RESID_UNDER) { cmnd->resid = be32_to_cpu(fcprsp->rspResId); lpfc_printf_log(phba, KERN_INFO, LOG_FCP, "(%d):0716 FCP Read Underrun, expected %d, " "residual %d Data: x%x x%x x%x\n", (vport ? vport->vpi : 0), be32_to_cpu(fcpcmd->fcpDl), cmnd->resid, fcpi_parm, cmnd->cmnd[0], cmnd->underflow); in lpfc_scsi.c:lpfc_handle_fcp_err?
Created attachment 333370 [details] check resid when handling no sense failures So do we need this patch then? It checks the resid and only completes good_bytes based on that.
This is "the customer" (and thanks to Laurie for submitting info on our behalf): > > Do you only see corruption when you see the device offlined messsages? No, we have confirmed yesterday that we have file corruption without the device being taken offline. We have NO SENSE errors only, and the node has the SCSI patch applied referenced here: https://bugzilla.redhat.com/attachment.cgi?id=321969 Prior to applying the patch, we had corruption, but devices were never taken offline. > So when this happens the file system or application is going to get failed IO > and writes will not have completed and the app/fs may not handle this > correctly. > > This lines up with what we see in the solaris trace I think. Yes, on Solaris, we have seen errors at the file system layer, something we don't see on the Linux side. We just have NO SENSE messages (and no UNIT ATTENTION as we have on Solaris), and rarely the device is taken offline after some threshold is reached.
Mike, (In reply to comment #28) > Created an attachment (id=333370) [details] > check resid when handling no sense failures > > So do we need this patch then? It checks the resid and only completes > good_bytes based on that. This patch looks like to applies without my patch (comment #9). Comment #29 From Wayne Salamon states that the failing node has the early patch (my patch) applied. So I think your id=333370 patch won't apply to the failing system. I think you're right about lpfc's setting of resid, but it shouldn't matter with my patch applied (in that case good_bytes is always left at 0 in the NO_SENSE handling code).
Re Comment #22: Really doesn't surprise me that you see different things between solaris and linux. The stacks are very different. Timing and recover steps are different. Connectivity loss has handled differently. LLDD's on solaris may completely hide the connectivity loss and the stack may never take something offline - by design. Where as linux, allows only a 30 second disappearance before the device is considered "unplugged" and we start to teardown/offline the stack - which as Mike points out in comment #24 & #26, leaves the users of the device in limbo land, which may appear as "corruption" depending on what made it out or not before the teardown/offline. The whole point of offlining, and the manual steps to re-online, was to require the admin to do whatever was necessary to deal with the limbo state of the application or stack layers above the device. (granted, the admin may have no clue what to do). Re Comment #27 We saw conditions with both underrun not set as well as when it was set. The biggest problem was when underrun was set and we did set the residual, but the midlayer ignored it and set good_bytes to everything thing - thus the patch. Re Comment #28 I believe it too should be there. However, when I looked at it before, I thought good-bytes was already being calculated by the routine where the patch was applied. Perhaps you caught another code path that missed it.
(In reply to comment #31) > Re Comment #28 > I believe it too should be there. However, when I looked at it before, I > thought good-bytes was already being calculated by the routine where the patch > was applied. Perhaps you caught another code path that missed it. James and Jamie, Yeah, I goofed and thought we had Jamie's patch applied to the source I was looking at. good_bytes is going to be set to zero, and so with the patch we fail the command and scsi_end_request will requeue it.
~~ Attention Partners RHEL 5.4 Partner Alpha Released! ~~ RHEL 5.4 Partner Alpha has been released on partners.redhat.com. There should be a fix present that addresses this particular request. Please test and report back your results here, at your earliest convenience. Our Public Beta release is just around the corner! If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have verified the request functions as expected, please set your Partner ID in the Partner field above to indicate successful test results. Do not flip the bug status to VERIFIED. Further questions can be directed to your Red Hat Partner Manager. Thanks!
Partners, This particular request is of a notably high priority. In order to prepare make the most of this Alpha release, please report back initial test results before the scheduled Beta drop. That way if you encounter any issues, we can work to get additional corrections in before we launch our Public Beta release. Speak with your Partner Manager for additional dates and information. Thank you for your cooperation in this effort.
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~ RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching. If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Is there anything in the 5.4 beta beyond the patch referenced above? We've been running with that patch for months now, and have had corruption with the patch applied. I've installed the 5.4 beta kernel on a node, and we have SENSE KEY errors, but otherwise no problems with the kernel.
Clarification: We've been running a kernel with the first patch (2008-10-30), not the second (2009-02-26). We installed the first patch in November of 2008.
Wayne, The patch that's in the latest snapshot build should be the same as the one you're running, if that was the patch everyone agreed to include in 5.4. However, there have been many additional changes committed to the 5.4 kernel that might have unexpectedly impacted the patch since then. If it would be at all possible to verify the latest kernel build continues to fix this issue, it would greatly appreciated. If you are unable to execute an additional test run for this issue, please do me the favour then and confirm with me the exact patch you are running so I can double-check that we're building with the patch you think we are. Thanks.
All of our nodes, but one, are running with the 2008-10-30 patch (applied to the RHEL 2.6.18-53.el5 kernel). One node is running with the RHEL 5.4 beta kernel (2.6.18-155.el5). With the single patch, we still have corruption occasionally, so we've never had a fix. With the 5.4 kernel, we still have SENSE KEY errors, but haven't reproduced corruption yet. However, it often takes a while (weeks even) to reproduce corruption. We'll update the bug report if that happens.
I assum the error you're encountering with the beta kernel, the SENSEKEY error, is unwanted. Has there been a bug filed against this issue? I will go ahead and close this bug out as VERIFIED, based on your comment that the original issue, data corruption, appears to have been resolved. If you do encounter the data corruption again though, clone this bug and escalate the issue as a regression so it will be fixed as soon as possible. Thanks!
No bug report has been issued for the SENSE KEY warnings because that is most likely the desired behavior if the hardware is misbehaving. We haven't determined if the issue truly is bad hardware, however, or is some misreading of the hardware state by the device driver. In either case, changing the SCSI layer probably is not appropriate. Note that I haven't stated that we consider the problem as fixed. We've only tested one node, and haven't exercised it sufficiently to reproduce corruption; that may take several weeks.
Understood. We will leave this item ON_QA then until you confirm that no corruption has occurred after a long enough period or we'll leave this issue as having been unverifiable if we ship it before you can make this judgement. Thank you for the clarification.
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~ RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular issue. Please test and report back your results here, at your earliest convenience. If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA. Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Emulex - has this been confirmed as tested on your end?
I just retested and verified that completing SCSI commands with NO_SENSE sense data on 2.6.18-160.el5 does not result in data corruption. The same driver and test produces data corruption with 2.6.18-128.el5. This is tested with an instrumented lpfc driver as I don't have an array that actually returns NO_SENSE available.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html