Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 493451

Summary:	Upgrade to update 3 causes SATA resets.
Product:	Red Hat Enterprise Linux 5	Reporter:	Jeremy Rosengren <jeremy>
Component:	kernel	Assignee:	David Milburn <dmilburn>
Status:	CLOSED ERRATA	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.3	CC:	dzickus, jeremy, konishi, orion, redhat
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-02 08:56:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeremy Rosengren 2009-04-01 19:44:56 UTC

Description of problem:

After upgrading to RHEL 5 update 3 (CentOS 5.3), I started seeing SATA bus resets.

Kernel version:  kernel-2.6.18-128.el5
SATA card version:  03:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09)
/var/log/messages:

Apr  1 00:11:02 raid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr  1 00:11:02 raid kernel: ata10.00: cmd ca/00:08:bf:04:96/00:00:00:00:00/e9 tag 0 dma 4096 out
Apr  1 00:11:02 raid kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr  1 00:11:02 raid kernel: ata10.00: status: { DRDY }
Apr  1 00:11:02 raid kernel: ata10: hard resetting link
Apr  1 00:11:02 raid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Apr  1 00:11:02 raid kernel: ata10.00: configured for UDMA/100
Apr  1 00:11:02 raid kernel: ata10: EH complete
Apr  1 00:11:02 raid kernel: SCSI device sdi: 160836480 512-byte hdwr sectors (82348 MB)
Apr  1 00:11:02 raid kernel: sdi: Write Protect is off


Version-Release number of selected component (if applicable):

kernel-2.6.18-128.el5

How reproducible:

System is a file server - after observing behavior I dropped back to previous kernel (2.6.18-92.1.22.el5) and no longer saw the problem.

Steps to Reproduce:
1.  Upgrade system to RHEL5 update 3
2.  Observe errors in /var/log/messages
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Milburn 2009-04-07 20:30:01 UTC

Hi,

I think this has been fixed upstream

commit b0bccb18bc523d1d5060d25958f12438062829a9
Author: Mark Lord <liml>
Date:   Mon Jan 19 18:04:37 2009 -0500

    sata_mv: fix 8-port timeouts on 508x/6081 chips
    
Would you please test the kernel-2.6.18-138.el5.bz493451.1 test kernel?

http://people.redhat.com/dmilburn/

Comment 2 Jeremy Rosengren 2009-04-08 01:35:57 UTC

The test kernel appears to have fixed the issue.  After installing the kernel and rebooting, I ran a RAID resync on a 13-drive RAID6 volume connected to 2 of these cards and didn't see any timeouts.

Will this patch make it into an update kernel soon(ish)?

Thanks,

-- jeremy

Comment 4 David Milburn 2009-04-08 19:12:01 UTC

Jeremy,

The patch is under review hopefully it will be commited soon, thanks
for the quick feedback.

Comment 5 RHEL Program Management 2009-04-08 19:19:00 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2009-04-20 17:12:23 UTC

in kernel-2.6.18-140.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 Jeremy Rosengren 2009-04-20 17:49:54 UTC

Hi Don,

I don't see a directory at http://people.redhat.com/dzickus/el5 for 140.el5.  Am I just too quick or do you still need to transfer the bits over?

Thanks,

-- jeremy

Comment 9 Don Zickus 2009-04-20 18:11:24 UTC

Doh. Sorry.  -140.el5 should be uploading right now.

Comment 11 errata-xmlrpc 2009-09-02 08:56:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 12 Orion Poplawski 2010-01-13 21:19:45 UTC

I'm seeing similar with 5.4 and 2.6.18-164.10.1.el5 (bug 554872).  Anyone else?

Comment 13 Matt Olson 2010-06-23 18:20:49 UTC

Here's a crude work around that helped to mask the problem:

In /etc/cron.hourly/disable-write-cache:

/sbin/hdparm -W 0 /dev/sda
/sbin/hdparm -W 0 /dev/sdb
/sbin/hdparm -W 0 /dev/sdc

This disables hardware write caching on the drives, which are in this case part of a software RAID5 array.  Resets will still happen, although much less frequently.  When a reset does occur, write cache will be re-enabled, hence the cron.hourly script.  

Note there may be a performance penalty or may not be effective for certain (write heavy) work loads.  YMMV.  

I tried applying Mark Lord's patch to a 2.6.18 kernel without success.  In looking at the driver, I think it has changed significantly as the patch was designed for 2.6.28.