Bug 247296

Summary: dmesg error messages for SATA drive.
Product: [Fedora] Fedora Reporter: Roger Cowles <rojcowles>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 6CC: jonstanley
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-08 04:27:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 427887    
Attachments:
Description Flags
dmesg output
none
A later dmesg log file, this time with timeout/retry message on the drive
none
lspci log file
none
lspci -vv log file
none
uname -a ouput none

Description Roger Cowles 2007-07-06 18:37:25 UTC
Description of problem:

On my home built linux server I see many disk errors relating to the SATA drive
attached along the lines of

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x2400000 action 0x0
ata2.00: (BMDMA2 stat 0x750001)
ata2.00: cmd 35/00:c8:59:aa:7d/00:01:02:00:00/e0 tag 0 cdb 0x0 data 233472 out
         res 51/04:47:da:ab:7d/00:00:02:00:00/e0 Emask 0x1 (device error)
ata2.00: configured for UDMA/100
ata2: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA

The server was FC5 and the problem was severe enough to cause major timeouts and
eventually the disk was rendered corrupt. Replaced the SATA disk with a brand
new one of the same make and performed a clean install of FC6 and I still see
the same errors in the log.


Version-Release number of selected component (if applicable):
Linux server002 2.6.20-1.2962.fc6 #1 SMP Tue Jun 19 18:24:12 EDT 2007 i686 i686
i386 GNU/Linux
HD : WD2500KS (250Gb SATA2 drive with 16Mb cache) 
Controller : SYBA SD-SATA150 RT (Based on Silicon Image, Inc. SiI 3512
[SATALink/SATARaid] Serial ATA Controller (rev 01))
Motherboard : PC CHips M789 with VIA C3 CPU

All running at stock speeds/BIOS settings.

How reproducible:

Very, I see the messages in the current clean FC6 install with a brand new SATA
drive and worse on a previous FC5 install with a 10 month old SATA drive of the
same make and type, this being the reason for the Fedora update and the new drive.

Steps to Reproduce:
1. Start Linux server
2. Copy files to/from mounted partition on the SATA drive
3. Observe exception messages in /var/log/messages
  
Actual results:

ATA errors in the log. With previous disk got long timeouts attempting to access
the disk, especially noticeable when using SMB clients.

Expected results:

Clean access to disk.

Additional info:

I only recall seeing this after upgrading to the 2.6.20 series kernel on both
the original FC5 install and now the FC6 install. Having replaced the hard disk
and the Linux version I don't *think* its a hardware problem and Googling the
error there appears to be some chatter related to SATA drives and the newer
versions of libata/linux kernel. Having said that the timeouts I saw with
FC5/original drive don't appear to happen (yet) with FC6/new drive so these ATA
messages may be harmless and I really do have a bad drive. Unfortunately WD's
Diag tool failed to boot on the server box when I tried to check it out so I
don't know if the orig. drive is OK.

Comment 1 Roger Cowles 2007-07-06 18:37:25 UTC
Created attachment 158684 [details]
dmesg output

Comment 2 Roger Cowles 2007-07-07 18:06:19 UTC
Created attachment 158723 [details]
A later dmesg log file, this time with timeout/retry message on the drive

I started seeing drive reset messages that make the SATA drive unavailable for
periods of time. This after I noticed that copying backup data from an attached
USB HD to the SATA drive had slowed from ~700Mb a minute down to ~70Mb/minute
or thereabouts (very unscientific measurement, I had a while loop doing df -k .
on the SATA drive with a sleep 60 to give a minutes pause)

Comment 3 Roger Cowles 2007-07-08 17:47:29 UTC
Created attachment 158735 [details]
lspci log file

Comment 4 Roger Cowles 2007-07-08 17:47:51 UTC
Created attachment 158736 [details]
lspci -vv log file

Comment 5 Roger Cowles 2007-07-08 17:48:08 UTC
Created attachment 158737 [details]
uname -a ouput

Comment 6 Roger Cowles 2007-08-29 12:03:10 UTC
Bought a Promise Technology, Inc. PDC40775 (SATA 300 TX2plus) (rev 02)
to replace the Silicon Image, Inc. SiI 3512 Syba  controller (working on the
principle that I had a bad controller card) but the SATA errors persisted.

Comment 7 Roger Cowles 2007-08-29 12:07:22 UTC
Downloaded a 2.16.9 Kernel from kernel.org, used the FC6
config-2.6.18-1.2798.fc6 configuration to build the kernel and installed as my
default kernel. The disk errors have vanished and I can now reliably access
either SATA drive locally on the server or from Samba shares.

Not sure how viable a workaround this is for most people though.

Comment 8 Roger Cowles 2007-08-29 12:09:19 UTC
I'd also tried kernels 2.6.18, 2.6.20 and 2.6.22 and all of these showed the
above disk errors.

Comment 9 Chuck Ebbert 2007-08-29 16:40:01 UTC
(In reply to comment #8)
> I'd also tried kernels 2.6.18, 2.6.20 and 2.6.22 and all of these showed the
> above disk errors.

Were the latest FC6 2.6.22 kernels tried, or just the vanilla ones?


Comment 10 Roger Cowles 2007-08-30 16:26:30 UTC
I tried an intermediate FC6 2.6.22 from updates-testing during the rebase from
2.6.20, 2.6.22.1-32.fc6, and then 2.6.22.2-42.fc6 from the standard updates
area. The 2.6.16.9 is the only vanilla + FC6 config kernel I've tried so far.

Comment 11 Roger Cowles 2007-08-30 16:32:23 UTC
Having said all that I later got a ton of disk errors copying some data back
from a backup onto one of the two WD SATA drives while running the 2.6.16.9
kernel so it looks like I jumped to a conclusion too early :(

I'm planning on hooking these drives up to my Win XP box and using the Windows
based WD diagnostic tools to check out both of my drives as it may be possible
that both were either duff when I got them or went duff when they were in my
server. If it turns out that they are bad (what are the odds ? Maybe not that
high ...) I'll RMA them as they're both under warranty and start over with new
drives. Maybe I can then see if its a kernel driver issue between 2.6.16 and
2.6.{18,20,22} or it was bogus hardware all along.

Comment 12 Jon Stanley 2008-01-08 01:54:22 UTC
(This is a mass-update to all current FC6 kernel bugs in NEW state)

Hello,

I'm reviewing this bug list as part of the kernel bug triage project, an attempt
to isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug, however this version of Fedora is no longer
maintained.

Please attempt to reproduce this bug with a current version of Fedora (presently
Fedora 8). If the bug no longer exists, please close the bug or I'll do so in a
few days if there is no further information lodged.

Thanks for using Fedora!

Comment 13 Jon Stanley 2008-02-08 04:27:30 UTC
Per the previous comment in this bug, I am closing it as INSUFFICIENT_DATA,
since no information has been lodged for over 30 days.

Please re-open this bug or file a new one if you can provide the requested data,
and thanks for filing the original report!