Bug 493451 - Upgrade to update 3 causes SATA resets.
Summary: Upgrade to update 3 causes SATA resets.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: i386
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: David Milburn
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-04-01 19:44 UTC by Jeremy Rosengren
Modified: 2010-06-23 18:20 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:56:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Jeremy Rosengren 2009-04-01 19:44:56 UTC
Description of problem:

After upgrading to RHEL 5 update 3 (CentOS 5.3), I started seeing SATA bus resets.

Kernel version:  kernel-2.6.18-128.el5
SATA card version:  03:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09)
/var/log/messages:

Apr  1 00:11:02 raid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr  1 00:11:02 raid kernel: ata10.00: cmd ca/00:08:bf:04:96/00:00:00:00:00/e9 tag 0 dma 4096 out
Apr  1 00:11:02 raid kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr  1 00:11:02 raid kernel: ata10.00: status: { DRDY }
Apr  1 00:11:02 raid kernel: ata10: hard resetting link
Apr  1 00:11:02 raid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Apr  1 00:11:02 raid kernel: ata10.00: configured for UDMA/100
Apr  1 00:11:02 raid kernel: ata10: EH complete
Apr  1 00:11:02 raid kernel: SCSI device sdi: 160836480 512-byte hdwr sectors (82348 MB)
Apr  1 00:11:02 raid kernel: sdi: Write Protect is off


Version-Release number of selected component (if applicable):

kernel-2.6.18-128.el5

How reproducible:

System is a file server - after observing behavior I dropped back to previous kernel (2.6.18-92.1.22.el5) and no longer saw the problem.

Steps to Reproduce:
1.  Upgrade system to RHEL5 update 3
2.  Observe errors in /var/log/messages
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Milburn 2009-04-07 20:30:01 UTC
Hi,

I think this has been fixed upstream

commit b0bccb18bc523d1d5060d25958f12438062829a9
Author: Mark Lord <liml>
Date:   Mon Jan 19 18:04:37 2009 -0500

    sata_mv: fix 8-port timeouts on 508x/6081 chips
    
Would you please test the kernel-2.6.18-138.el5.bz493451.1 test kernel?

http://people.redhat.com/dmilburn/

Comment 2 Jeremy Rosengren 2009-04-08 01:35:57 UTC
The test kernel appears to have fixed the issue.  After installing the kernel and rebooting, I ran a RAID resync on a 13-drive RAID6 volume connected to 2 of these cards and didn't see any timeouts.

Will this patch make it into an update kernel soon(ish)?

Thanks,

-- jeremy

Comment 4 David Milburn 2009-04-08 19:12:01 UTC
Jeremy,

The patch is under review hopefully it will be commited soon, thanks
for the quick feedback.

Comment 5 RHEL Program Management 2009-04-08 19:19:00 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2009-04-20 17:12:23 UTC
in kernel-2.6.18-140.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 Jeremy Rosengren 2009-04-20 17:49:54 UTC
Hi Don,

I don't see a directory at http://people.redhat.com/dzickus/el5 for 140.el5.  Am I just too quick or do you still need to transfer the bits over?

Thanks,

-- jeremy

Comment 9 Don Zickus 2009-04-20 18:11:24 UTC
Doh. Sorry.  -140.el5 should be uploading right now.

Comment 11 errata-xmlrpc 2009-09-02 08:56:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 12 Orion Poplawski 2010-01-13 21:19:45 UTC
I'm seeing similar with 5.4 and 2.6.18-164.10.1.el5 (bug 554872).  Anyone else?

Comment 13 Matt Olson 2010-06-23 18:20:49 UTC
Here's a crude work around that helped to mask the problem:

In /etc/cron.hourly/disable-write-cache:

/sbin/hdparm -W 0 /dev/sda
/sbin/hdparm -W 0 /dev/sdb
/sbin/hdparm -W 0 /dev/sdc

This disables hardware write caching on the drives, which are in this case part of a software RAID5 array.  Resets will still happen, although much less frequently.  When a reset does occur, write cache will be re-enabled, hence the cron.hourly script.  

Note there may be a performance penalty or may not be effective for certain (write heavy) work loads.  YMMV.  

I tried applying Mark Lord's patch to a 2.6.18 kernel without success.  In looking at the driver, I think it has changed significantly as the patch was designed for 2.6.28.


Note You need to log in before you can comment on or make changes to this bug.