Bug 493451 - Upgrade to update 3 causes SATA resets.
Upgrade to update 3 causes SATA resets.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
i386 Linux
low Severity medium
: ---
: ---
Assigned To: David Milburn
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-01 15:44 EDT by Jeremy Rosengren
Modified: 2010-06-23 14:20 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:56:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jeremy Rosengren 2009-04-01 15:44:56 EDT
Description of problem:

After upgrading to RHEL 5 update 3 (CentOS 5.3), I started seeing SATA bus resets.

Kernel version:  kernel-2.6.18-128.el5
SATA card version:  03:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09)
/var/log/messages:

Apr  1 00:11:02 raid kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr  1 00:11:02 raid kernel: ata10.00: cmd ca/00:08:bf:04:96/00:00:00:00:00/e9 tag 0 dma 4096 out
Apr  1 00:11:02 raid kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr  1 00:11:02 raid kernel: ata10.00: status: { DRDY }
Apr  1 00:11:02 raid kernel: ata10: hard resetting link
Apr  1 00:11:02 raid kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Apr  1 00:11:02 raid kernel: ata10.00: configured for UDMA/100
Apr  1 00:11:02 raid kernel: ata10: EH complete
Apr  1 00:11:02 raid kernel: SCSI device sdi: 160836480 512-byte hdwr sectors (82348 MB)
Apr  1 00:11:02 raid kernel: sdi: Write Protect is off


Version-Release number of selected component (if applicable):

kernel-2.6.18-128.el5

How reproducible:

System is a file server - after observing behavior I dropped back to previous kernel (2.6.18-92.1.22.el5) and no longer saw the problem.

Steps to Reproduce:
1.  Upgrade system to RHEL5 update 3
2.  Observe errors in /var/log/messages
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 David Milburn 2009-04-07 16:30:01 EDT
Hi,

I think this has been fixed upstream

commit b0bccb18bc523d1d5060d25958f12438062829a9
Author: Mark Lord <liml@rtr.ca>
Date:   Mon Jan 19 18:04:37 2009 -0500

    sata_mv: fix 8-port timeouts on 508x/6081 chips
    
Would you please test the kernel-2.6.18-138.el5.bz493451.1 test kernel?

http://people.redhat.com/dmilburn/
Comment 2 Jeremy Rosengren 2009-04-07 21:35:57 EDT
The test kernel appears to have fixed the issue.  After installing the kernel and rebooting, I ran a RAID resync on a 13-drive RAID6 volume connected to 2 of these cards and didn't see any timeouts.

Will this patch make it into an update kernel soon(ish)?

Thanks,

-- jeremy
Comment 4 David Milburn 2009-04-08 15:12:01 EDT
Jeremy,

The patch is under review hopefully it will be commited soon, thanks
for the quick feedback.
Comment 5 RHEL Product and Program Management 2009-04-08 15:19:00 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 6 Don Zickus 2009-04-20 13:12:23 EDT
in kernel-2.6.18-140.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 7 Jeremy Rosengren 2009-04-20 13:49:54 EDT
Hi Don,

I don't see a directory at http://people.redhat.com/dzickus/el5 for 140.el5.  Am I just too quick or do you still need to transfer the bits over?

Thanks,

-- jeremy
Comment 9 Don Zickus 2009-04-20 14:11:24 EDT
Doh. Sorry.  -140.el5 should be uploading right now.
Comment 11 errata-xmlrpc 2009-09-02 04:56:33 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html
Comment 12 Orion Poplawski 2010-01-13 16:19:45 EST
I'm seeing similar with 5.4 and 2.6.18-164.10.1.el5 (bug 554872).  Anyone else?
Comment 13 Matt Olson 2010-06-23 14:20:49 EDT
Here's a crude work around that helped to mask the problem:

In /etc/cron.hourly/disable-write-cache:

/sbin/hdparm -W 0 /dev/sda
/sbin/hdparm -W 0 /dev/sdb
/sbin/hdparm -W 0 /dev/sdc

This disables hardware write caching on the drives, which are in this case part of a software RAID5 array.  Resets will still happen, although much less frequently.  When a reset does occur, write cache will be re-enabled, hence the cron.hourly script.  

Note there may be a performance penalty or may not be effective for certain (write heavy) work loads.  YMMV.  

I tried applying Mark Lord's patch to a 2.6.18 kernel without success.  In looking at the driver, I think it has changed significantly as the patch was designed for 2.6.28.

Note You need to log in before you can comment on or make changes to this bug.