Bug 470623

Summary: problems with Western Digital Caviar hard drive -- possibly kernel-related?
Product: [Fedora] Fedora Reporter: D. Wagner <daw-redhatbugzilla>
Component: kernelAssignee: Peter Martuccelli <peterm>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 9CC: jshowalter, kernel-maint, peterd, peterm, quintela
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-14 14:32:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The output from lspci -vv, when booted into a 2.6.26.5-45.fc9.x86_64 kernel
none
The output from dmesg, when booted into a 2.6.26.5-45.fc9.x86_64 kernel none

Description D. Wagner 2008-11-08 01:03:07 UTC
Description of problem:

Shortly after installing kernel-2.6.26.6-79.fc9.x86_64 and rebooting, I noticed errors related to my hard drive in my system logs.  I'm not sure whether this is a software issue or a hardware issue.  The machine and hard drive are only about 6 months old.

Do you have any suggestions for how to troubleshoot this?  Is there any reason to believe this might be a software issue?

I'll share as much information as I have in case it is a software issue or an issue with the kernel:

The first time it happened (afternoon/evening of Nov 6), I started to see hard drive errors (e.g., I couldn't access most of the filesystem, except for a few files which I suspect were cached in RAM), and my system froze up not long after.  I was forced to power-cycle the machine.  After power-cycling it, my BIOS diagnostics reported an error with the hard drive when I ran the Dell Diagnostics from the BIOS.  Eventually I managed to reboot again into the 2.6.26.6-79 kernel, and everything seemed OK for a few hours, but I got the messages again; that time I didn't wait for the system to freeze/crash, but I forced a reboot.

I didn't notice anything like that with the previous kernel, kernel-2.6.25-14.fc9.x86_64.

Here are the messages from /var/log/messages:

Nov  3 23:26:16 senfl ntpd[2016]: kernel time sync status change 4001
Nov  3 23:34:48 senfl ntpd[2016]: kernel time sync status change 0001
Nov  3 23:35:47 senfl kernel: ata1.00: exception Emask 0x42 SAct 0xffff SErr 0x1880c00 action 0x6 frozen
Nov  3 23:35:47 senfl kernel: ata1: SError: { Proto HostInt 10B8B LinkSeq TrStaTrns }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/10:00:23:35:fb/00:00:04:00:00/40 tag 0 ncq 8192 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/50:08:63:d6:fb/00:00:04:00:00/40 tag 1 ncq 40960 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/18:10:bb:d6:fb/01:00:04:00:00/40 tag 2 ncq 143360 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:18:ab:19:fc/00:00:04:00:00/40 tag 3 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/40:20:73:28:fc/01:00:04:00:00/40 tag 4 ncq 163840 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:28:0b:27:02/00:00:05:00:00/40 tag 5 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:30:b3:3d:02/00:00:05:00:00/40 tag 6 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:38:73:da:02/00:00:05:00:00/40 tag 7 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/28:40:7b:da:02/00:00:05:00:00/40 tag 8 ncq 20480 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/c8:48:93:db:02/02:00:05:00:00/40 tag 9 ncq 364544 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:50:ab:58:89/00:00:19:00:00/40 tag 10 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:58:93:fe:8b/00:00:19:00:00/40 tag 11 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/10:60:1b:51:8c/00:00:19:00:00/40 tag 12 ncq 8192 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:68:eb:34:04/00:00:05:00:00/40 tag 13 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:70:3b:d8:82/00:00:19:00:00/40 tag 14 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1.00: cmd 61/08:78:b3:58:89/00:00:19:00:00/40 tag 15 ncq 4096 out
Nov  3 23:35:47 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  3 23:35:47 senfl kernel: ata1.00: status: { DRDY }
Nov  3 23:35:47 senfl kernel: ata1: hard resetting link
Nov  3 23:35:52 senfl kernel: ata1: link is slow to respond, please be patient (ready=0)
Nov  3 23:35:57 senfl kernel: ata1: softreset failed (device not ready)
Nov  3 23:35:57 senfl kernel: ata1: hard resetting link
Nov  3 23:36:04 senfl kernel: ata1: link is slow to respond, please be patient (ready=0)
Nov  3 23:36:07 senfl kernel: ata1: softreset failed (device not ready)
Nov  3 23:36:07 senfl kernel: ata1: hard resetting link
Nov  3 23:36:13 senfl kernel: ata1: link is slow to respond, please be patient (ready=0)
Nov  3 23:36:42 senfl kernel: ata1: softreset failed (device not ready)
Nov  3 23:36:42 senfl kernel: ata1: limiting SATA link speed to 1.5 Gbps
Nov  3 23:36:42 senfl kernel: ata1: hard resetting link
Nov  3 23:36:43 senfl kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov  3 23:36:43 senfl kernel: ata1.00: configured for UDMA/133
Nov  3 23:36:43 senfl kernel: ata1: EH complete
Nov  3 23:36:43 senfl kernel: sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Nov  3 23:36:43 senfl kernel: sd 0:0:0:0: [sda] Write Protect is off
Nov  3 23:36:43 senfl kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

[...]

Nov  7 15:33:58 senfl kernel: npviewer.bin[9017]: segfault at 13 ip 146ddb7 sp f
fc1bd34 error 4 in libflashplayer.so[dc9000+951000]
Nov  7 15:44:57 senfl ntpd[2017]: kernel time sync status change 4001
Nov  7 16:02:02 senfl ntpd[2017]: kernel time sync status change 0001
Nov  7 16:06:42 senfl kernel: ata1.00: exception Emask 0x42 SAct 0x7 SErr 0x1880c00 action 0x6 frozen
Nov  7 16:06:42 senfl kernel: ata1: SError: { Proto HostInt 10B8B LinkSeq TrStaTrns }
Nov  7 16:06:42 senfl kernel: ata1.00: cmd 60/00:00:1b:8e:3c/01:00:02:00:00/40 tag 0 ncq 131072 in
Nov  7 16:06:42 senfl kernel:         res 40/00:04:23:02:e7/00:00:38:00:00/40 Emask 0x46 (timeout)
Nov  7 16:06:42 senfl kernel: ata1.00: status: { DRDY }
Nov  7 16:06:42 senfl kernel: ata1.00: cmd 60/00:08:1b:8f:3c/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov  7 16:06:42 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  7 16:06:42 senfl kernel: ata1.00: status: { DRDY }
Nov  7 16:06:42 senfl kernel: ata1.00: cmd 60/10:10:4b:dc:6a/00:00:04:00:00/40 tag 2 ncq 8192 in
Nov  7 16:06:42 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x46 (timeout)
Nov  7 16:06:42 senfl kernel: ata1.00: status: { DRDY }
Nov  7 16:06:42 senfl kernel: ata1: hard resetting link
Nov  7 16:06:43 senfl kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov  7 16:06:43 senfl kernel: ata1.00: configured for UDMA/133
Nov  7 16:06:43 senfl kernel: ata1: EH complete
Nov  7 16:06:43 senfl kernel: sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Nov  7 16:06:43 senfl kernel: sd 0:0:0:0: [sda] Write Protect is off
Nov  7 16:06:43 senfl kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Unfortunately I don't have the error messages from the afternoon/evening of Nov 6th (shortly after installing and rebooting into kernel-2.6.26.6-79.fc9.x86_64).  I suspect they never got committed to disk due to the hard drive errors.

I only noticed the Nov 3 errors today.  From my logs it looks like I was running 2.6.26.5-45.fc9.x86_64 when I got the Nov 3 errors, and I've been running 2.6.26.5-45.fc9.x86_64 since Oct 14, so I'm not sure whether this is related to a specific kernel version.


How reproducible:

I don't know how reproducible this is.  I've seen it twice within 24 hours, which makes me think it is at least somewhat reproducible.  I've now booted into kernel-debug-2.6.26.6-79.fc9.x86_64 and will see if I get the same error messages.


Additional info:

I believe I have a 500GB Western Digital WDC WD5000 hard drive: at least, I'm guessing based on the messages from dmesg, such as the following:

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: WDC WD5000AAKS-75YGA0, 12.01C02, max UDMA/133
ata1.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133

scsi 0:0:0:0: Direct-Access     ATA      WDC WD5000AAKS-7 12.0 PQ: 0 ANSI: 5

SMART doesn't report any errors:

# smartctl -l selftest /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1632         -
# 2  Short offline       Completed without error       00%         0         -

# smartctl -l error /dev/sda   
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

Comment 1 D. Wagner 2008-11-08 01:04:02 UTC
Here's my Smolt profile, in case it's helpful:

http://www.smolts.org/client/show/?uuid=pub_4a9c20b6-c6a4-4952-a5c5-6cf822c5e3f7

Comment 2 Chuck Ebbert 2008-11-09 22:46:41 UTC
Looks like a cable problem? You could try to force the link to 1.5Gbps and see if that helps.

Comment 3 Alan Cox 2008-11-10 13:57:48 UTC
"After power-cycling it, my BIOS diagnostics reported an error with the hard drive when I ran the Dell
Diagnostics from the BIOS"

Looks to me like the drive went gaga and fell off the bus. Could be cabling or power supply noise or heat, or indeed just a dying drive although you'd usually expect smart data then. The rest of the trace then looks sensible. The drive fell off the bus and refused to come back. That would explain why the diagnostics considered it was bust and it reappeared on a power cycle.

Does the dell diagnostics for the disk report a failure or success when run from a clean power on ?

Comment 4 D. Wagner 2008-11-27 02:43:22 UTC
I have some new information, which suggests to me that is at least partially related to which version of the kernel I'm running.

To rule out hard disk or cable failure, I bought a new hard drive (Samsung HD753LJ 750GB) and a new SATA II cable (rated for 3 Gbps).  I tried various combinations of which hard drive (WD vs Samsung), which cable, and which SATA port it was plugged into on my motherboard -- none of that made any difference.  I disconnected the old hard drive, plugged the new hard drive and new cable into a different SATA port on my motherboard, and I can continue to replicate the system freezes.

Moreover, after some experimentation, I discovered that the system freezes occur on kernel 2.6.26.6-79 but not on 2.6.26.5-45.

In particular, I discovered that I could reliably reproduce the system freeze with about an hour or so of heavy activity to the hard disk.  I did the following in a directory on the main partition of the hard drive:
  while true; do echo -n "."; dd if=/dev/zero of=zeros count=20M > /dev/null 2>/dev/null; sync; rm zeros; sync; done
and after about an hour with the more recent kernels, I get the SATA errors and a system freeze.

I tested with five versions of the Linux kernel that I happen to have installed on this system.  If the system went > 4 hours (with the above script running) without a system freeze, then I labelled the system as "stable"; otherwise I labelled it as "unstable".
  
The chart below shows the results:

2.6.25-14.fc9.x86_64: stable (without nvidia)
2.6.26.5-45.fc9.x86_64: stable (without nvidia)
2.6.26.6-79.fc9.x86_64: unstable, system freezes after ~ 1 hour (without nvidia)
2.6.27.5-37.fc9.x86_64: unstable, system freezes after ~ 1 hour (with nvidia)
2.6.27.5-41.fc9.x86_64: unstable, system freezes after ~ 1 hour (with nvidia)

Partway through the testing, it remembered that I'm using Nvidia's proprietary drivers and it occurred to me that this could be contributing.  So, I removed the proprietary drivers for all of the test runs after this occurred to me (namely, 2.6.25-14, 2.6.25.5-45, 2.6.26.6-79), just in case it had any relation.  However, this didn't appear make any difference.

All of the above testing is with a brand new SATA II cable and a new Samsung hard drive, so I don't think it's related to the cable.

On the older kernels, I don't seem to experience any problems, despite many hours of continuous load on the hard drive, and I don't get any SATA errors in my system logs.  One exception: during one of my tests with 2.6.25-14, after about 2.5 hours of heavy load, I got this message in /var/log/messages:

Nov 25 15:17:59 senfl kernel: ata4.00: exception Emask 0x40 SAct 0x1 SErr 0x880800 action 0x6 frozen
Nov 25 15:17:59 senfl kernel: ata4: SError: { HostInt 10B8B LinkSeq }
Nov 25 15:17:59 senfl kernel: ata4.00: cmd 61/08:00:48:b6:db/00:00:22:00:00/40 tag 0 ncq 4096 out
Nov 25 15:17:59 senfl kernel: ata4.00: status: { DRDY }
Nov 25 15:17:59 senfl kernel: ata4: hard resetting link
Nov 25 15:18:06 senfl kernel: ata4: port is slow to respond, please be patient (Status 0x80)
Nov 25 15:18:09 senfl kernel: ata4: softreset failed (device not ready)
Nov 25 15:18:09 senfl kernel: ata4: hard resetting link
Nov 25 15:18:16 senfl kernel: ata4: port is slow to respond, please be patient (Status 0x80)
Nov 25 15:18:19 senfl kernel: ata4: softreset failed (device not ready)
Nov 25 15:18:19 senfl kernel: ata4: hard resetting link
Nov 25 15:18:26 senfl kernel: ata4: port is slow to respond, please be patient (Status 0x80)
Nov 25 15:18:54 senfl kernel: ata4: softreset failed (device not ready)
Nov 25 15:18:54 senfl kernel: ata4: limiting SATA link speed to 1.5 Gbps
Nov 25 15:18:54 senfl kernel: ata4: hard resetting link
Nov 25 15:18:56 senfl kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 25 15:18:56 senfl kernel: ata4.00: configured for UDMA/133
Nov 25 15:18:56 senfl kernel: ata4: EH complete

but I didn't experience any of the other symptoms.  The system froze for a little while (it felt like 10 seconds or so?), then returned and continued working correctly for another 4-5 hours of heavy load with no other issues, so I figured that everything was fine.  [Note that on this test run, and every other, the SATA link is automatically discovered at 3 Gbps.]

In comparison, on the newer kernels, I consistently got problems.  In each case, after an hour or so of pounding the hard disk, the system would freeze (for 10-20 seconds or so?), then when it came back, all attempts to access the filesystem would return I/O errors.  Attempts to run almost any command would return an error.  Basically, in each of those cases the symptoms seemed consistent with the disk dropping off the SATA bus, as Alan says, allowing me to access only whatever data happened to already be cached (I'm guessing here).  Most of the time I also got SATA errors in /var/log/messages.  In a few cases the system froze before I could capture SATA errors /var/log/messages.  (Running "dmesg" didn't help either: with no filesystem, I couldn't even run basic commands like "dmesg" or "sync".)  Here's a sample of error messages that did get captured to /var/log/messages (to me it looks basically the same as the stuff I reported earlier with the WD disk):

Nov 26 05:54:45 senfl kernel: ata4.00: exception Emask 0x40 SAct 0x7 SErr 0x880800 action 0x6 frozen
Nov 26 05:54:45 senfl kernel: ata4: SError: { HostInt 10B8B LinkSeq }
Nov 26 05:54:45 senfl kernel: ata4.00: cmd 61/08:00:40:f6:14/00:00:09:00:00/40 tag 0 ncq 4096 out
Nov 26 05:54:45 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout)
Nov 26 05:54:45 senfl kernel: ata4.00: status: { DRDY }
Nov 26 05:54:45 senfl kernel: ata4.00: cmd 61/08:08:60:54:15/00:00:09:00:00/40 tag 1 ncq 4096 out
Nov 26 05:54:45 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout)
Nov 26 05:54:45 senfl kernel: ata4.00: status: { DRDY }
Nov 26 05:54:45 senfl kernel: ata4.00: cmd 61/08:10:90:8a:23/00:00:09:00:00/40 tag 2 ncq 4096 out
Nov 26 05:54:45 senfl kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout)
Nov 26 05:54:45 senfl kernel: ata4.00: status: { DRDY }
Nov 26 05:54:45 senfl kernel: ata4: hard resetting link
Nov 26 05:54:45 senfl kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 26 05:54:45 senfl kernel: ata4.00: configured for UDMA/133
Nov 26 05:54:45 senfl kernel: ata4: EH complete
Nov 26 05:54:45 senfl kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sectors (750156 MB)
Nov 26 05:54:45 senfl kernel: sd 3:0:0:0: [sda] Write Protect is off
Nov 26 05:54:45 senfl kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

or

Nov 25 10:59:40 senfl kernel: ata4.00: exception Emask 0x50 SAct 0x7f1ff SErr 0x
480900 action 0x6 frozen
Nov 25 10:59:40 senfl kernel: ata4.00: irq_stat 0x08000000, interface fatal erro
r
Nov 25 10:59:40 senfl kernel: ata4: SError: { UnrecovData HostInt 10B8B Handshk 
}
Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:00:48:bf:98/04:00:09:00:00/40 t
ag 0 ncq 524288 out
Nov 25 10:59:40 senfl kernel:         res 40/00:44:50:eb:98/00:00:09:00:00/40 Em
ask 0x50 (ATA bus error)
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:08:48:cf:98/04:00:09:00:00/40 t
ag 1 ncq 524288 out
Nov 25 10:59:40 senfl kernel:         res 40/00:44:50:eb:98/00:00:09:00:00/40 Em
ask 0x50 (ATA bus error)
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:10:48:d3:98/04:00:09:00:00/40 t
ag 2 ncq 524288 out
Nov 25 10:59:40 senfl kernel:         res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error)
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:18:50:d7:98/04:00:09:00:00/40 tag 3 ncq 524288 out
Nov 25 10:59:40 senfl kernel:         res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error)
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:20:50:db:98/04:00:09:00:00/40 tag 4 ncq 524288 out
Nov 25 10:59:40 senfl kernel:         res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error)
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:28:50:df:98/04:00:09:00:00/40 tag 5 ncq 524288 out
Nov 25 10:59:40 senfl kernel:         res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error)
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
...
Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY }
Nov 25 10:59:40 senfl kernel: ata4: hard resetting link
Nov 25 10:59:46 senfl kernel: ata4: link is slow to respond, please be patient (
ready=0)
Nov 25 10:59:50 senfl kernel: ata4: softreset failed (device not ready)
Nov 25 10:59:50 senfl kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 
300)
Nov 25 10:59:55 senfl kernel: ata4.00: qc timeout (cmd 0xec)
Nov 25 10:59:55 senfl kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0
x4)
Nov 25 10:59:55 senfl kernel: ata4.00: revalidation failed (errno=-5)
Nov 25 10:59:55 senfl kernel: ata4: hard resetting link
Nov 25 10:59:56 senfl kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 
300)
Nov 25 10:59:56 senfl kernel: ata4.00: configured for UDMA/133
Nov 25 10:59:56 senfl kernel: ata4: EH complete
Nov 25 10:59:56 senfl kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sec
tors (750156 MB)
Nov 25 10:59:56 senfl kernel: sd 3:0:0:0: [sda] Write Protect is off
Nov 25 10:59:56 senfl kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache
: enabled, doesn't support DPO or FUA
...

Some speculations/random notes (these are my musings; they may be totally wrong or represent confusion on my part):

* For some reason, the older kernels don't seem to trigger SATA errors nearly as often (only in one case did I see SATA errors with the older kernels, whereas with newer kernels I found it easy to trigger SATA errors).

* In the one case where I did see a SATA error on the older kernels, the way the kernel responded seems to be different: the older kernel dropped the SATA link back to 1.5 Gbps, where as the newer kernels didn't fall back to 1.5 Gbps (I don't know if this means anything or not).

Unfortunately I don't know how to force the link to 1.5 Gbps at boot time to see if that helps.  My BIOS doesn't seem to have an option for that.  I'm guessing there are some jumper settings on my drive that I can set to force it to 1.5 Gbps, but I don't have any jumpers laying around and the drive didn't come with jumpers (sigh).

I tried hard power-cycling the machine (shutting down, removing the power cable, then plugging it back in and booting back up) and then running Dell Diagnostics.  Alan, you are right: Dell Diagnostics does give a clean bill of health at that point.  I can't remember what I did the prior time I ran Dell Diagnostics but I suspect I didn't do a hard power-cycle of the machine, so your explanation makes a lot of sense.

Let me know if there is any further information I can gather, or anything else I can try, to help narrow this down further.

And thank you to both of you for your help.  I know there's no way I'd get this level of support if I was using some other OS (Windows, cough cough) -- so it's all the more impressive to see such attention from developers when I'm using a free OS.  I really appreciate your help!

Comment 5 D. Wagner 2008-11-28 00:00:07 UTC
Created attachment 324927 [details]
The output from lspci -vv, when booted into a 2.6.26.5-45.fc9.x86_64 kernel

(In case it is relevant, lspci -vv reports that my machine has an "Intel Corporation 82801 SATA RAID Controller".)

Comment 6 D. Wagner 2008-11-28 00:00:46 UTC
Created attachment 324928 [details]
The output from dmesg, when booted into a 2.6.26.5-45.fc9.x86_64 kernel

Comment 7 D. Wagner 2009-01-03 05:10:19 UTC
Any further suggestions?   Should I report this on the Linux kernel bugzilla?

Some additional information:

I've been doing a whole bunch of testing with different kernel versions.  This sure seems kernel-related.  With some kernels (primarily older kernels), everything works fine, but with some newer kernels, I get SATA errors and system freezes.

With kernel RPMs from Fedora, the system is stable with 2.6.25-14.fc9.x86_64 and 2.6.26.5-45.fc9.x86_64 kernel RPMs.  With 2.6.26.6-79.fc9.x86_64, 2.6.27.5-37.fc9.x86_64, and 2.6.27.5-41.fc9.x86_64 kernel RPMs, the system is unstable: I get SATA errors and the system freezes with filesystem I/O errors after a few hours of heavy load.

I've downloaded a number of kernels from kernel.org and tried to configure them following a similar configuration to that used by a known-stable Fedora kernel RPM [1].  With 2.6.26.5, 2.6.26.6, 2.6.27.7 the system is utterly stable.  (In some cases I get SATA errors but the disk always recovers, and I don't get I/O errors or system freezes.)  I've tested 2.6.27.7 for over a week with the hard disk 100% loaded full-time throughout with no problems.  With 2.6.28-rc7, the system is unstable: I tested three boots, and 2 of them froze after 6-13 hours (couldn't tell if there were any SATA errors because I wasn't present and the screen had blanked by the time I returned); one time, I got a whole mess of SATA errors but the system brought the SATA link back at 1.5 Gbps and after that the system seemed fine for days.

I've also tried the 2.6.28-rc7 kernel from kernel.org in two other configurations, with mixed results.  First, I took the default config and tested 3 boots: with 2 of them, I had SATA errors and system freezes within 10 minutes; with 1 of them, I had no SATA errors for over a day.  Second, I tried to configure it myself with as close as I could get to a bare-minimum config based on my knowledge of what hardware I've got on this machine.  I tested 1 boot, and I had no SATA errors and no problems for over a day; I'm still testing this.  The inconsistent results here are puzzling and make me think I need to gather more data here.  I'll report back when I gather more data.

Experimental methodology: In all cases, when I report a time period, the hard disk is fully loaded over that entire time period (with a dd if=/dev/zero of=somefile wrapped in an infinite-while loop running in one window and a find / -print > /dev/null wrapped in an infinite-while loop running in another window).  When I reboot, I generally try shut down the machine, power it off, disconnect the power cord, and wait 10-15 seconds before plugging the power cord back in and booting (though I can't guarantee I've always remembered to do this in every single test cycle).

Footnote [1]: Here's how I configured the kernel from kernel.org based on Fedora RPMs.  I took the /boot/config file from the Fedora 2.6.26.5-45.fc9.x86_64 kernel RPM, which the testing above suggests is stable.  I copy this Fedora config file over the .config file in the kernel source for the kernel from kernel.org, run make menuconfig, make mrproper, make all, etc.

Any ideas?  Can you think of anything I ought to be trying to help troubleshoot this?

Comment 8 D. Wagner 2009-01-17 22:46:49 UTC
Further information: It looks like forcing the SATA link to 1.5Gbps _might_ make the problem go away.  I was able to jumper my WD drive to force it to 1.5Gbps and I haven't been able to trigger SATA errors with it ever since.  This is not yet definitive (I'd prefer to test it for at least a week of continuous load before concluding that it has fixed the problem) but it's looking good so far.

However, I can't figure out how to force my Samsung drive down to 1.5Gbps.  It does not seem to have any jumper setting to force use of 1.5Gbps.  My Dell BIOS does not have a BIOS option for this.  Do you know how to force my SATA link to 1.5Gbps?

Reading the kernel docs suggested that the way to do this might be to create a file /etc/modprobe.d/local containing:
  options libata force=1.5Gbps
However my Samsung SATA drive still gets loaded at 3Gbps.  I'm not sure whether I've got the syntax right, or whether this option is actually being applied, and I can't see any way to check what options were applied to a loaded module.

I'm worried I might be in a catch-22 situation.  Is this file even consulted before loading the libata module?  It seems like there might be a cyclic dependency here: since I'm using SATA drives, presumably the kernel has to load the libata module before it can look up modprobe.d in the filesystem, hence it may be too late to apply the force=1.5Gbps option by the time the filesystem is loaded.

Some other things I've tried or considered trying:
 * I can't set libata.force=1.5Gbps as a kernel argument because on Fedora kernels is built as a module rather than built into the kernel.
 * I can't run "modprobe libata force=1.5Gbps" by hand because I've got SATA drives, so the libata module is automatically loaded before I get shell access.
 * I could build a kernel by hand and then build libata into the kernel rather than as a module, but this is certainly less convenient than using a Fedora-packaged kernel.  I could do this for debugging purposes only, I suppose.

Any suggestions?

Comment 9 D. Wagner 2009-01-23 20:55:58 UTC
OK, I managed to force the SATA link to 1.5Gbps by rebuilding initrd with the libata force=1.5Gbps option.  That seems to have fixed the problems and made my system reliable again.

So I have a workaround that works for my system.  However, for the benefit of other users, should some kind of blacklist be added, for this SATA controller or somesuch, so that other users don't have to go through this?  Should I report this on the Linux kernel bugzilla, too?

Comment 11 jshowalter 2009-02-03 20:05:27 UTC
I encountered this problem with two SATA drives, a Western Digital (model WDC WD1600JS-COM) and a Seagate Barracuda (model ST3160811AS). Both drives worked fine with OpenSUSE 10.2 and Kubuntu 8.04. But when installing Fedora 9, the following reports were repeated several times to the splash-screen while the install kernel booted:

ata1: hard resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: SRST failed (errno=-16)
ata1: SATA link up 3.0 Gbps (SStatus 113 SControl 300)
ata1: link online but device misclassified, retrying

The install proceeded, but could only see my 80GB PATA drive.

Google led me to this thread and setting the drive jumper to 1.5Gbps fixed the problem for both drives.

My questions are:

1. Would all SATA drives have this problem with Fedora 9 on this system?

2. Which is the best solution: using the drive jumper or setting the option?

3. Where/how do I set the option to "force=1.5Gbps"?

Comment 12 Chuck Ebbert 2009-02-03 20:54:50 UTC
(In reply to comment #11)
> 
> 2. Which is the best solution: using the drive jumper or setting the option?
>

The jumper is always the best option because it won't revert on a reinstall.
On the other hand, if the next update works at full speed you'd still be running at the slower one.

> 3. Where/how do I set the option to "force=1.5Gbps"?

Create a file in /etc/modprobe.d and put this in it:

options libata force=1.5Gbps

Then rebuild the initrd with mkinitrd.

Comment 13 Bug Zapper 2009-06-10 03:12:38 UTC
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 14 Bug Zapper 2009-07-14 14:32:37 UTC
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.