Bug 468088 - [EMULEX 5.4 bug] scsi messages correlate with silent data corruption, but no i/o errors
[EMULEX 5.4 bug] scsi messages correlate with silent data corruption, but no ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
i686 Linux
urgent Severity urgent
: rc
: 5.4
Assigned To: Rob Evers
Red Hat Kernel QE team
: OtherQA
Depends On:
Blocks: 461676 480666 483701 483784
  Show dependency treegraph
 
Reported: 2008-10-22 13:55 EDT by Laurie Costello
Modified: 2011-02-09 10:11 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 480666 (view as bug list)
Environment:
Last Closed: 2009-09-02 04:11:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Fix for NO_SENSE handling in sd.c ported to 2.6.18-120.el5 (961 bytes, patch)
2008-10-30 14:10 EDT, Jamie Wellnitz
no flags Details | Diff
check resid when handling no sense failures (454 bytes, application/octet-stream)
2009-02-26 14:02 EST, Mike Christie
no flags Details

  None (edit)
Description Laurie Costello 2008-10-22 13:55:52 EDT
Description of problem: 
More than installation has had silent data corruption with scsi messages in the log as follows:
Apr 30 11:14:03 linc01a kernel: Info fld=0x0
Apr 30 11:14:03 linc01a kernel: sdi: Current: sense key: No Sense
Apr 30 11:14:03 linc01a kernel:     Additional sense: No additional sense information

The code path shows that good_bytes is set to the xfer length requested when processing.  I think there is an interaction between a FC driver and scsi that causes this code path to be executed in an error condition, but because the FC driver didn't supply sense data scsi assumes that the transfer was okay.  I think this is a very dangerous assumption and these messages should not be  encountered in normal activity.  Emulex has submitted a kernel patch showing that they have encountered a similar problem.

http://kerneltrap.org/mailarchive/linux-scsi/2008/9/12/3272434







Version-Release number of selected component (if applicable):


How reproducible:
Have not been able to reproduce it on demand.


Steps to Reproduce:
1. 
2.
3.
  
Actual results:
Valid data.


Expected results:
garbage data


Additional info:
This has been encountered at more than one site.  One site replaced hardware and the problem no longer occured.  Another site is still experiencing problems but the problem is sporadic - it occurs anywhere from once a month to a few instances in a week.
Comment 1 Mike Christie 2008-10-22 14:24:05 EDT
I think you hit the wrong component. scsi-target-utils is for the scsi target layer/server which makes your box into a scsi device basically. That was not inlvolved right? I am going to assume not and set this to kernel.

What kernel did this occur on (uname -a) and what RHEL version (cat /etc/redhat-release) was it?

Did you port the patch from here:
http://kerneltrap.org/mailarchive/linux-scsi/2008/9/12/3272434
and try it out and verified it worked for the issue you are hitting, or do you want me to port it? Are you comfortable building kernels?
Comment 2 Laurie Costello 2008-10-22 14:48:29 EDT
You are right - its the kernel scsi.  I must have been blind because I couldn't find anything for the kernel when I went through the list.

Linux mdc2a 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21 EST 2007 i686 i686 i386 GNU/Linux

I don't have the redhat release info from our customer.  I have asked the customer to open a bug directly, but they haven't done it.  They want to see it come out in an update and then they will try it.    The problem is not reproducable on demand and has occurred sporadically over 6 months.  I've seen another customer with silent data corruption that also had these messages in the log.  I originally attempted to address it with the linux-scsi mailing list but I'm not a scsi expert and didn't get very far.  Then I saw this patch submitted by Emulex with a failing test case and I'm very certain it would fix the customer problem in that there would be real i/o errors instead of silent data corruption.  Of course, the source of the problem would still need to be tracked down but the possibility of silent data corruption in this code path needs to be addressed.
Comment 3 Laurie Costello 2008-10-22 15:08:59 EDT
qlogic driver info form system.

===================================================
qla2xxx 0000:05:05.0:
 QLogic Fibre Channel HBA Driver: 8.01.07-k1
  QLogic DELL2342M -
  ISP2312: PCI-X (100 MHz) @ 0000:05:05.0 hdma-, host#=1, fw=3.03.20 IPX
  Vendor: APPLE     Model: Xserve RAID       Rev: 1.50
  Type:   Direct-Access                      ANSI SCSI revision: 05
===================================================
Comment 4 Mike Christie 2008-10-22 23:00:25 EDT
Adding Emulex guys per their request.
Comment 5 Jamie Wellnitz 2008-10-23 10:43:30 EDT
I've ported the upstream sd.c patch to RHEL 5.3's kernel. I just need to regression test it, then I'll post it here.
Comment 6 Laurie Costello 2008-10-23 11:52:58 EDT
Redhat release and uname -a from system seeing problems.

Red Hat Enterprise Linux Server release 5.1 (Tikanga)

Linux front01a 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
Comment 7 laurie barry 2008-10-23 15:55:17 EDT
RH - can we get this into RHEL5.3 instead of RHEL5.4?

Laurie Barry
Emulex
Comment 9 Jamie Wellnitz 2008-10-30 14:10:15 EDT
Created attachment 321969 [details]
Fix for NO_SENSE handling in sd.c ported to 2.6.18-120.el5

Sorry about the delay.

This is the upstream patch mnodified to apply to 2.6.18-120.el5.

I tested it with an instrumented driver which returns check conditions of NO_SENSE on occasional SCSI commands.  2.6.18-120.el5 sees data corruption without this patch.  With the attached patch, the commands are correctly retried without corruption.
Comment 10 laurie barry 2008-11-13 10:40:52 EST
Andrius/Tom,

Please confirm this is getting into RHEL5.3.

Laurie
Comment 11 Andrius Benokraitis 2008-11-13 11:52:24 EST
I believe Tom was hunting down some more info on this... Tom?
Comment 12 laurie barry 2008-12-05 09:01:35 EST
Red Hat,

What's the latest on this one, I've seen no movement.

Laurie
Comment 13 Tom Coughlan 2008-12-05 09:49:54 EST
This came in after 5.3 beta shipped, and it has not seen any test time upstream. I am very reluctant to check this in late to 5.3, because of the possibility of regression. That is, I/O that in fact succeeded, despite a NO SENSE, and previously returned success status to the o.s. will now fail. We don't have any way to know how common that scenario is in the field. 

I polled several storage vendors, including the one who reported this to you. They are not aware of any conditions where they return NO SENSE but fail to complete the I/O. I believe the vendor who reported this to you currently has a workaround in firmware to ensure this as well. I suggest that they keep this workaround in place for 5.3. We will allow this patch to get some more testing exposure upstream, and in the field, and check it in early to 5.4. This just came in too late to take the risk in 5.3, especially since this is a long-standing problem, not a regression specific to 5.3.
Comment 15 Laurie Costello 2009-01-09 12:49:39 EST
Recently learned that the customer is running with the patch on all their systems.  They have seen the messages but have not had any data corruption.  The test is not conclusive as data corruption wasn't always detected and they haven't run longer yet than they saw between corruptions previously.  In other words, if they went 2 months without data corruption previously, they have only gone 1 month without data corruption at this point.  But it is promising as the messages have been seen without any data corruption being detected.  If I hear that data corruption has been detected while running with the patch, I will immediately update this bug with that information.
Comment 19 RHEL Product and Program Management 2009-01-27 15:41:24 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 20 RHEL Product and Program Management 2009-02-16 10:15:48 EST
Updating PM score.
Comment 21 Don Zickus 2009-02-23 15:02:12 EST
in kernel-2.6.18-132.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 22 Laurie Costello 2009-02-23 15:37:26 EST
I'm adding info that is directly from our customer without my mangling it in any way.  The summary is that they hooked up a solaris box and eventually saw errors.
They still get data corruption and I'm wondering if they are also seeing a bad
retry case.  I think the code should be checking the residual before deciding
that all the data is valid.

From our(Quantums) customer:
We now have one instance of file corruption. We've been hitting  
several nodes with large file transfers and many small
creates/deletes, and have corruption in one of the large files. We  
also have the kernel messages from the time of the
corruption.


On a Linux node:

Jan 23 12:00:33 linc21a kernel: Info fld=0x0
Jan 23 12:02:21 linc21a kernel: sdb: Current: sense key: No Sense
Jan 23 12:02:21 linc21a kernel:     Add. Sense: No additional sense  
information
Jan 23 12:02:21 linc21a kernel:
Jan 23 12:02:21 linc21a kernel: Info fld=0x0
Jan 23 13:11:46 linc21a kernel: qla2xxx 0000:05:05.0: scsi(1:0:0):  
Abort command issued -- 1 90fffc0 2002.
Jan 23 13:11:46 linc21a kernel: qla2xxx 0000:05:05.0: scsi(1:0:0):  
Abort command issued -- 1 90fffc2 2002.
 

With the SCSI patch we are seeing the SCSI abort commands being  
issued. We've been seeing them for a while now,
but this week, we hit some threshold, and the device was taken  
offline, forcing us to do a hard reboot: 

Feb  4 10:00:30 front01a kernel: lpfc 0000:0d:03.0: 0:0713 SCSI layer  
issued LUN reset (5, 0) Data: x2002 x1 x10000
Feb  4 10:00:42 front01a kernel: lpfc 0000:0d:03.0: 0:0714 SCSI layer  
issued Bus Reset Data: x2002
Feb  4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined -  
not ready after error recovery
Feb  4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined -  
not ready after error recovery
Feb  4 10:01:16 front01a kernel: sd 1:0:5:0: SCSI error: return code =  
0x00070000
Feb  4 10:01:16 front01a kernel: end_request: I/O error, dev sdh,  
sector 3642924288 

However, there aren't any SENSE KEY messages around this activity,  
which is unexpected.

 

We set up a Solaris client, and that is providing us more useful  
information. We had the same issue with many SENSE
messages, and a device being taken offline. However, the first SENSE  
message in the burst is interesting:
 

Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.warning] WARNING: / 
pci@0,0/pci8086, 25f8@4/pci1077,137@0/fp@0,0/disk@w600039300001416e,0  
(sd22):
Feb  3 15:56:33 sansvr  Error for Command: <undecoded cmd 0x8a>     
Error Level: Retryable
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    Requested  
Block: 4321824768                Error Block: 4321824768
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    Vendor:  
APPLE                   Serial Number:
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    Sense Key:  
Unit Attention
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    ASC: 0x29  
(power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.warning] WARNING: / 
pci@0,0/pci8086,
25f8@4/pci1077,137@0/fp@0,0/disk@w600039300001416e,0 (sd22):
Feb  3 15:56:33 sansvr  Error for Command: <undecoded cmd 0x8a>     
Error Level:Retryable
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    Requested  
Block: 4321825280                Error Block: 4321825280
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    Vendor:  
APPLE                   Serial Number:
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    Sense Key: No  
Additional Sense
Feb  3 15:56:33 sansvr scsi: [ID 107833 kern.notice]    ASC: 0x0 (no  
additional sense info), ASCQ: 0x0, FRU: 0x0
 

So we have some sense data in the first message: 'Unit Attention',  
something we never see on the Linux nodes (and I'm
not convinced the Linux SCSI layer does the right thing in terms of  
error handling). We have more info in the Solaris
logs: 

Feb  3 17:14:17 sansvr  offlining lun=1 (trace=0), target=70f00  
(trace=2800004)
Feb  3 17:14:17 sansvr scsi: [ID 243001 kern.info] /pci@0,0/ 
pci8086,25f8@4/pci10
77,137@0/fp@0,0 (fcp0):
Feb  3 17:14:17 sansvr  offlining lun=0 (trace=0), target=70f00  
(trace=2800004)
Feb  3 17:14:17 sansvr genunix: [ID 408114 kern.info] /pci@0,0/ 
pci8086,25f8@4/pci1077,137@0/fp@0,0/disk@w6000393000013f3d,2 (sd24)  
offline
Feb  3 17:14:17 sansvr genunix: [ID 408114 kern.info] /pci@0,0/ 
pci8086,25f8@4/pci1077,137@0/fp@0,0/disk@w60003930000141b6,2 (sd27)  
offline
Feb  3 17:15:10 sansvr scsi: [ID 107833 kern.warning] WARNING: / 
pci@0,0/pci8086, 25f8@4/pci1077,137@0/fp@0,0/disk@w6000393000013f80,0  
(sd152):
Feb  3 17:15:10 sansvr  SCSI transport failed: reason 'timeout':  
retrying command
Feb  3 17:17:32 sansvr scsi: [ID 107833 kern.warning] WARNING: / 
pci@0,0/pci8086,
25f8@4/pci1077,137@0/fp@0,0/disk@w6000393000013f80,0 (sd152):
Feb  3 17:17:32 sansvr  SCSI transport failed: reason 'timeout':  
giving up
Feb  3 17:17:32 sansvr genunix: [ID 307647 kern.notice] NOTICE: I/O  
error on file system 'snfs2' operation WRITE inode 0x26fb3fe83 file  
offset 4289921024 I/O length 65536


So there we have a complete trace of a device being taking offline,  
and the error eventually floating up the the file
system resulting in a WRITE error. I don't believe we've ever seen  
this level of error on the Linux box. It's also
nice that the Solaris device name is based on the fibre WWN; makes it  
easier to know what RAID was involved.


I looked at the logs on the XRAID that was take offline, and there are  
no errors. Also, that device was still accessible
to other nodes in the cluster. I also looked at the logs in the fibre  
switch, and again, no errors directly related to this
event and this device. However, on one switch there were messages  
about a port being reset, where an XRAID is
attached.
Comment 24 Mike Christie 2009-02-26 12:16:30 EST
(In reply to comment #22)
> With the SCSI patch we are seeing the SCSI abort commands being  
> issued. We've been seeing them for a while now,
> but this week, we hit some threshold, and the device was taken  
> offline, forcing us to do a hard reboot: 
> 
> Feb  4 10:00:30 front01a kernel: lpfc 0000:0d:03.0: 0:0713 SCSI layer  
> issued LUN reset (5, 0) Data: x2002 x1 x10000
> Feb  4 10:00:42 front01a kernel: lpfc 0000:0d:03.0: 0:0714 SCSI layer  
> issued Bus Reset Data: x2002
> Feb  4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined -  
> not ready after error recovery
> Feb  4 10:01:16 front01a kernel: sd 1:0:5:0: scsi: Device offlined -  
> not ready after error recovery
> Feb  4 10:01:16 front01a kernel: sd 1:0:5:0: SCSI error: return code =  
> 0x00070000
> Feb  4 10:01:16 front01a kernel: end_request: I/O error, dev sdh,  
> sector 3642924288 
> 
> However, there aren't any SENSE KEY messages around this activity,  
> which is unexpected.
> 

Do you only see corruption when you see the device offlined messsages?

When you see those offlined messages, it means the scsi layer and driver, tried to recover a disk but could not. You get into this situation when a command does not complete within /sys/block/sdb/device/timeout seconds. If this happens the scsi layer will ask the driver to abort the running tasks. If that fails, then the scsi layer asks the driver to reset the logical unit. If that fails, then the driver is asked to reset the bus (in the case of fibre channel drivers they will normally just do a target or lu reset for all the targets or devs on the bus). If that fails then we reset the host. And finally if that fails the devices are offlined and the scsi layer fails the IO and fails all future IOs until the device is manually onlined again.

So when this happens the file system or application is going to get failed IO and writes will not have completed and the app/fs may not handle this correctly.

This lines up with what we see in the solaris trace I think.
Comment 25 Laurie Costello 2009-02-26 13:02:23 EST

Previously the data corruption was seen with the scsi "no sense data" messages and no i/o errors made it up to the filesystem.  My understanding now is that corruption is still being seen in that case but I will verify.   I have a suspicion that sometimes we are getting a "successful retry" in that same case statement which is the source of the "no sense data" messages. 

I know of a verified case where data corruption was happening in conjuntion with messages indicating that retry case statement was executed.  This was a different system entirely, but backs up my opinion that the code should be checking the residual instead of assuming it was 0 based on the codes.  In that instance, the source of the error from below the scsi layer was fixed. 

Back to this particular instance - I have no indication that retry messages were ever seen so its only theory on my part.  

I will ask our customer if he would be willing to let me add him to this bug so that questions can be answered directly.  The difficulty they are having is identifying the place errors are being generated.  Maybe it is all driven by timeouts.  The raid, switches, hba's show no errors in their logs.

When i/o errors are returned the filesystem does handle them and is quite verbose about receiving them.
Comment 26 Mike Christie 2009-02-26 13:47:56 EST
(In reply to comment #25)
> I will ask our customer if he would be willing to let me add him to this bug so
> that questions can be answered directly.  The difficulty they are having is
> identifying the place errors are being generated.  Maybe it is all driven by
> timeouts.  The raid, switches, hba's show no errors in their logs.
> 

The aborts, lun reset and bus reset are from the command timeout I mentioned before:
/sys/block/sdX/device/timeout

It is basically a block/scsi layer timeout on each scsi command that is sent the HBA (lpfc in the case of the log) driver.


> When i/o errors are returned the filesystem does handle them and is quite
> verbose about receiving them.

Ok just to make sure I understand you.

When the device is offlined, A write might not get executed so if you try to access the file later, it is going to be bad. So users will report this as corruption. For your case though, are you saying that you know the write did not succeed because the device was offlined and besides that problem you also get files that we said were successfully written out but there is corruption in them? Right?

How do you know it was written out successfully? Are you doing a write and a sync to make sure it is on disk, before moving on to the next file/test? I mean if you do a cp or write(), they could return ok status but then the data could be in a file system or buffer cache and not on disk yet.
Comment 27 Mike Christie 2009-02-26 13:55:29 EST
Emulex guys,

For the resid part of this, will that make a difference with your driver? I glanced over it and it looked like you only set resid if there is underrun or overrun detected by lpfc_handle_fcp_err.

So we should we be hitting this:

       if (resp_info & RESID_UNDER) {
                cmnd->resid = be32_to_cpu(fcprsp->rspResId);

                lpfc_printf_log(phba, KERN_INFO, LOG_FCP,
                                "(%d):0716 FCP Read Underrun, expected %d, "
                                "residual %d Data: x%x x%x x%x\n",
                                (vport ? vport->vpi : 0),
                                be32_to_cpu(fcpcmd->fcpDl),
                                cmnd->resid, fcpi_parm, cmnd->cmnd[0],
                                cmnd->underflow);


in lpfc_scsi.c:lpfc_handle_fcp_err?
Comment 28 Mike Christie 2009-02-26 14:02:06 EST
Created attachment 333370 [details]
check resid when handling no sense failures

So do we need this patch then? It checks the resid and only completes good_bytes based on that.
Comment 29 Wayne Salamon 2009-02-26 15:36:18 EST
This is "the customer" (and thanks to Laurie for submitting info on our behalf):

> 
> Do you only see corruption when you see the device offlined messsages?

No, we have confirmed yesterday that we have file corruption without the device being taken offline.
We have NO SENSE errors only, and the node has the SCSI patch applied referenced here:
https://bugzilla.redhat.com/attachment.cgi?id=321969

Prior to applying the patch, we had corruption, but devices were never taken offline.


> So when this happens the file system or application is going to get failed IO
> and writes will not have completed and the app/fs may not handle this
> correctly.
> 
> This lines up with what we see in the solaris trace I think.

Yes, on Solaris, we have seen errors at the file system layer, something we don't see on the Linux side.
We just have NO SENSE messages (and no UNIT ATTENTION as we have on Solaris), and rarely the device
is taken offline after some threshold is reached.
Comment 30 Jamie Wellnitz 2009-02-26 17:10:14 EST
Mike,

(In reply to comment #28)
> Created an attachment (id=333370) [details]
> check resid when handling no sense failures
> 
> So do we need this patch then? It checks the resid and only completes
> good_bytes based on that.

This patch looks like to applies without my patch (comment #9).  Comment #29 From  Wayne Salamon states that the failing node has the early patch (my patch) applied.

So I think your id=333370 patch won't apply to the failing system.

I think you're right about lpfc's setting of resid, but it shouldn't matter with my patch applied (in that case good_bytes is always left at 0 in the NO_SENSE handling code).
Comment 31 James Smart 2009-03-01 13:03:54 EST
Re Comment #22:
Really doesn't surprise me that you see different things between solaris and linux. The stacks are very different. Timing and recover steps are different. Connectivity loss has handled differently. LLDD's on solaris may completely hide the connectivity loss and the stack may never take something offline - by design.  Where as linux, allows only a 30 second disappearance before the device is considered "unplugged" and we start to teardown/offline the stack - which as Mike points out in comment #24 & #26, leaves the users of the device in limbo land, which may appear as "corruption" depending on what made it out or not before the teardown/offline.  The whole point of offlining, and the manual steps to re-online, was to require the admin to do whatever was necessary to deal with the limbo state of the application or stack layers above the device. (granted, the admin may have no clue what to do).

Re Comment #27
We saw conditions with both underrun not set as well as when it was set. The biggest problem was when underrun was set and we did set the residual, but the midlayer ignored it and set good_bytes to everything thing - thus the patch.

Re Comment #28
I believe it too should be there. However, when I looked at it before, I thought good-bytes was already being calculated by the routine where the patch was applied. Perhaps you caught another code path that missed it.
Comment 32 Mike Christie 2009-03-02 10:35:33 EST
(In reply to comment #31)
> Re Comment #28
> I believe it too should be there. However, when I looked at it before, I
> thought good-bytes was already being calculated by the routine where the patch
> was applied. Perhaps you caught another code path that missed it.

James and Jamie,

Yeah, I goofed and thought we had Jamie's patch applied to the source I was looking at. good_bytes is going to be set to zero, and so with the patch we fail the command and scsi_end_request will requeue it.
Comment 33 Chris Ward 2009-06-14 19:16:02 EDT
~~ Attention Partners RHEL 5.4 Partner Alpha Released! ~~

RHEL 5.4 Partner Alpha has been released on partners.redhat.com. There should
be a fix present that addresses this particular request. Please test and report back your results here, at your earliest convenience. Our Public Beta release is just around the corner!

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have verified the request functions as expected, please set your Partner ID in the Partner field above to indicate successful test results. Do not flip the bug status to VERIFIED. Further questions can be directed to your Red Hat Partner Manager. Thanks!
Comment 34 Chris Ward 2009-06-14 19:18:18 EDT
Partners, 

This particular request is of a notably high priority. In order to prepare make the most of this Alpha release, please report back initial test results before the scheduled Beta drop. That way if you encounter any issues, we can work to get additional corrections in before we launch our Public Beta release. Speak with your Partner Manager for additional dates and information. Thank you for your cooperation in this effort.
Comment 35 Chris Ward 2009-07-03 14:11:29 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 36 Chris Ward 2009-07-10 15:06:03 EDT
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~

RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching.

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Comment 37 Wayne Salamon 2009-07-15 16:19:54 EDT
Is there anything in the 5.4 beta beyond the patch referenced above? We've been running with that
patch for months now, and have had corruption with the patch applied. I've installed the 5.4 beta
kernel on a node, and we have SENSE KEY errors, but otherwise no problems with the kernel.
Comment 38 Wayne Salamon 2009-07-15 16:25:49 EDT
Clarification: We've been running a kernel with the first patch (2008-10-30), not the second 
(2009-02-26). We installed the first patch in November of 2008.
Comment 39 Chris Ward 2009-07-16 10:48:32 EDT
Wayne,

The patch that's in the latest snapshot build should be the same as the one you're running, if that was the patch everyone agreed to include in 5.4. However, there have been many additional changes committed to the 5.4 kernel that might have unexpectedly impacted the patch since then. If it would be at all possible to verify the latest kernel build continues to fix this issue, it would greatly appreciated. If you are unable to execute an additional test run for this issue, please do me the favour then and confirm with me the exact patch you are running so I can double-check that we're building with the patch you think we are. Thanks.
Comment 40 Wayne Salamon 2009-07-16 11:24:48 EDT
All of our nodes, but one, are running with the 2008-10-30 patch (applied to the RHEL 2.6.18-53.el5 kernel). One node is running with the RHEL 5.4 beta kernel (2.6.18-155.el5). With the single patch, we
still have corruption occasionally, so we've never had a fix. With the 5.4 kernel, we still have SENSE KEY
errors, but haven't reproduced corruption yet. However, it often takes a while (weeks even) to reproduce
corruption. We'll update the bug report if that happens.
Comment 41 Chris Ward 2009-07-17 03:52:14 EDT
I assum the error you're encountering with the beta kernel, the SENSEKEY error, is unwanted. Has there been a bug filed against this issue?

I will go ahead and close this bug out as VERIFIED, based on your comment that the original issue, data corruption, appears to have been resolved. If you do encounter the data corruption again though, clone this bug and escalate the issue as a regression so it will be fixed as soon as possible. Thanks!
Comment 42 Wayne Salamon 2009-07-17 08:01:31 EDT
No bug report has been issued for the SENSE KEY warnings because that is most likely the desired
behavior if the hardware is misbehaving. We haven't determined if the issue truly is bad hardware,
however, or is some misreading of the hardware state by the device driver. In either case, changing
the SCSI layer probably is not appropriate.

Note that I haven't stated that we consider the problem as fixed. We've only tested one node, and
haven't exercised it sufficiently to reproduce corruption; that may take several weeks.
Comment 43 Chris Ward 2009-07-17 08:15:07 EDT
Understood. We will leave this item ON_QA then until you confirm that no corruption has occurred after a long enough period or we'll leave this issue as having been unverifiable if we ship it before you can make this judgement. Thank you for the clarification.
Comment 45 Chris Ward 2009-08-03 11:44:24 EDT
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~

RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been 
released on partners.redhat.com. If you have already reported your test results, 
you can safely ignore this request. Otherwise, please notice that there should be 
a fix available now that addresses this particular issue. Please test and report 
back your results here, at your earliest convenience.

If you encounter any issues while testing Beta, please describe the 
issues you have encountered and set the bug into NEED_INFO. If you 
encounter new issues, please clone this bug to open a new issue and 
request it be reviewed for inclusion in RHEL 5.4 or a later update, if it 
is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA.

Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other 
appropriate customer representative.
Comment 46 Andrius Benokraitis 2009-08-04 12:50:03 EDT
Emulex - has this been confirmed as tested on your end?
Comment 48 Jamie Wellnitz 2009-08-12 11:11:59 EDT
I just retested and verified that completing SCSI commands with NO_SENSE sense data on 2.6.18-160.el5 does not result in data corruption.  The same driver and test produces data corruption with 2.6.18-128.el5.  This is tested with an instrumented lpfc driver as I don't have an array that actually returns NO_SENSE available.
Comment 49 errata-xmlrpc 2009-09-02 04:11:03 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.