Bug 470623
Summary: | problems with Western Digital Caviar hard drive -- possibly kernel-related? | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | D. Wagner <daw-redhatbugzilla> | ||||||
Component: | kernel | Assignee: | Peter Martuccelli <peterm> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 9 | CC: | jshowalter, kernel-maint, peterd, peterm, quintela | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-07-14 14:32:37 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
D. Wagner
2008-11-08 01:03:07 UTC
Here's my Smolt profile, in case it's helpful: http://www.smolts.org/client/show/?uuid=pub_4a9c20b6-c6a4-4952-a5c5-6cf822c5e3f7 Looks like a cable problem? You could try to force the link to 1.5Gbps and see if that helps. "After power-cycling it, my BIOS diagnostics reported an error with the hard drive when I ran the Dell Diagnostics from the BIOS" Looks to me like the drive went gaga and fell off the bus. Could be cabling or power supply noise or heat, or indeed just a dying drive although you'd usually expect smart data then. The rest of the trace then looks sensible. The drive fell off the bus and refused to come back. That would explain why the diagnostics considered it was bust and it reappeared on a power cycle. Does the dell diagnostics for the disk report a failure or success when run from a clean power on ? I have some new information, which suggests to me that is at least partially related to which version of the kernel I'm running. To rule out hard disk or cable failure, I bought a new hard drive (Samsung HD753LJ 750GB) and a new SATA II cable (rated for 3 Gbps). I tried various combinations of which hard drive (WD vs Samsung), which cable, and which SATA port it was plugged into on my motherboard -- none of that made any difference. I disconnected the old hard drive, plugged the new hard drive and new cable into a different SATA port on my motherboard, and I can continue to replicate the system freezes. Moreover, after some experimentation, I discovered that the system freezes occur on kernel 2.6.26.6-79 but not on 2.6.26.5-45. In particular, I discovered that I could reliably reproduce the system freeze with about an hour or so of heavy activity to the hard disk. I did the following in a directory on the main partition of the hard drive: while true; do echo -n "."; dd if=/dev/zero of=zeros count=20M > /dev/null 2>/dev/null; sync; rm zeros; sync; done and after about an hour with the more recent kernels, I get the SATA errors and a system freeze. I tested with five versions of the Linux kernel that I happen to have installed on this system. If the system went > 4 hours (with the above script running) without a system freeze, then I labelled the system as "stable"; otherwise I labelled it as "unstable". The chart below shows the results: 2.6.25-14.fc9.x86_64: stable (without nvidia) 2.6.26.5-45.fc9.x86_64: stable (without nvidia) 2.6.26.6-79.fc9.x86_64: unstable, system freezes after ~ 1 hour (without nvidia) 2.6.27.5-37.fc9.x86_64: unstable, system freezes after ~ 1 hour (with nvidia) 2.6.27.5-41.fc9.x86_64: unstable, system freezes after ~ 1 hour (with nvidia) Partway through the testing, it remembered that I'm using Nvidia's proprietary drivers and it occurred to me that this could be contributing. So, I removed the proprietary drivers for all of the test runs after this occurred to me (namely, 2.6.25-14, 2.6.25.5-45, 2.6.26.6-79), just in case it had any relation. However, this didn't appear make any difference. All of the above testing is with a brand new SATA II cable and a new Samsung hard drive, so I don't think it's related to the cable. On the older kernels, I don't seem to experience any problems, despite many hours of continuous load on the hard drive, and I don't get any SATA errors in my system logs. One exception: during one of my tests with 2.6.25-14, after about 2.5 hours of heavy load, I got this message in /var/log/messages: Nov 25 15:17:59 senfl kernel: ata4.00: exception Emask 0x40 SAct 0x1 SErr 0x880800 action 0x6 frozen Nov 25 15:17:59 senfl kernel: ata4: SError: { HostInt 10B8B LinkSeq } Nov 25 15:17:59 senfl kernel: ata4.00: cmd 61/08:00:48:b6:db/00:00:22:00:00/40 tag 0 ncq 4096 out Nov 25 15:17:59 senfl kernel: ata4.00: status: { DRDY } Nov 25 15:17:59 senfl kernel: ata4: hard resetting link Nov 25 15:18:06 senfl kernel: ata4: port is slow to respond, please be patient (Status 0x80) Nov 25 15:18:09 senfl kernel: ata4: softreset failed (device not ready) Nov 25 15:18:09 senfl kernel: ata4: hard resetting link Nov 25 15:18:16 senfl kernel: ata4: port is slow to respond, please be patient (Status 0x80) Nov 25 15:18:19 senfl kernel: ata4: softreset failed (device not ready) Nov 25 15:18:19 senfl kernel: ata4: hard resetting link Nov 25 15:18:26 senfl kernel: ata4: port is slow to respond, please be patient (Status 0x80) Nov 25 15:18:54 senfl kernel: ata4: softreset failed (device not ready) Nov 25 15:18:54 senfl kernel: ata4: limiting SATA link speed to 1.5 Gbps Nov 25 15:18:54 senfl kernel: ata4: hard resetting link Nov 25 15:18:56 senfl kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Nov 25 15:18:56 senfl kernel: ata4.00: configured for UDMA/133 Nov 25 15:18:56 senfl kernel: ata4: EH complete but I didn't experience any of the other symptoms. The system froze for a little while (it felt like 10 seconds or so?), then returned and continued working correctly for another 4-5 hours of heavy load with no other issues, so I figured that everything was fine. [Note that on this test run, and every other, the SATA link is automatically discovered at 3 Gbps.] In comparison, on the newer kernels, I consistently got problems. In each case, after an hour or so of pounding the hard disk, the system would freeze (for 10-20 seconds or so?), then when it came back, all attempts to access the filesystem would return I/O errors. Attempts to run almost any command would return an error. Basically, in each of those cases the symptoms seemed consistent with the disk dropping off the SATA bus, as Alan says, allowing me to access only whatever data happened to already be cached (I'm guessing here). Most of the time I also got SATA errors in /var/log/messages. In a few cases the system froze before I could capture SATA errors /var/log/messages. (Running "dmesg" didn't help either: with no filesystem, I couldn't even run basic commands like "dmesg" or "sync".) Here's a sample of error messages that did get captured to /var/log/messages (to me it looks basically the same as the stuff I reported earlier with the WD disk): Nov 26 05:54:45 senfl kernel: ata4.00: exception Emask 0x40 SAct 0x7 SErr 0x880800 action 0x6 frozen Nov 26 05:54:45 senfl kernel: ata4: SError: { HostInt 10B8B LinkSeq } Nov 26 05:54:45 senfl kernel: ata4.00: cmd 61/08:00:40:f6:14/00:00:09:00:00/40 tag 0 ncq 4096 out Nov 26 05:54:45 senfl kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) Nov 26 05:54:45 senfl kernel: ata4.00: status: { DRDY } Nov 26 05:54:45 senfl kernel: ata4.00: cmd 61/08:08:60:54:15/00:00:09:00:00/40 tag 1 ncq 4096 out Nov 26 05:54:45 senfl kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) Nov 26 05:54:45 senfl kernel: ata4.00: status: { DRDY } Nov 26 05:54:45 senfl kernel: ata4.00: cmd 61/08:10:90:8a:23/00:00:09:00:00/40 tag 2 ncq 4096 out Nov 26 05:54:45 senfl kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) Nov 26 05:54:45 senfl kernel: ata4.00: status: { DRDY } Nov 26 05:54:45 senfl kernel: ata4: hard resetting link Nov 26 05:54:45 senfl kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 26 05:54:45 senfl kernel: ata4.00: configured for UDMA/133 Nov 26 05:54:45 senfl kernel: ata4: EH complete Nov 26 05:54:45 senfl kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sectors (750156 MB) Nov 26 05:54:45 senfl kernel: sd 3:0:0:0: [sda] Write Protect is off Nov 26 05:54:45 senfl kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA or Nov 25 10:59:40 senfl kernel: ata4.00: exception Emask 0x50 SAct 0x7f1ff SErr 0x 480900 action 0x6 frozen Nov 25 10:59:40 senfl kernel: ata4.00: irq_stat 0x08000000, interface fatal erro r Nov 25 10:59:40 senfl kernel: ata4: SError: { UnrecovData HostInt 10B8B Handshk } Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:00:48:bf:98/04:00:09:00:00/40 t ag 0 ncq 524288 out Nov 25 10:59:40 senfl kernel: res 40/00:44:50:eb:98/00:00:09:00:00/40 Em ask 0x50 (ATA bus error) Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:08:48:cf:98/04:00:09:00:00/40 t ag 1 ncq 524288 out Nov 25 10:59:40 senfl kernel: res 40/00:44:50:eb:98/00:00:09:00:00/40 Em ask 0x50 (ATA bus error) Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:10:48:d3:98/04:00:09:00:00/40 t ag 2 ncq 524288 out Nov 25 10:59:40 senfl kernel: res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error) Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:18:50:d7:98/04:00:09:00:00/40 tag 3 ncq 524288 out Nov 25 10:59:40 senfl kernel: res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error) Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:20:50:db:98/04:00:09:00:00/40 tag 4 ncq 524288 out Nov 25 10:59:40 senfl kernel: res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error) Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } Nov 25 10:59:40 senfl kernel: ata4.00: cmd 61/00:28:50:df:98/04:00:09:00:00/40 tag 5 ncq 524288 out Nov 25 10:59:40 senfl kernel: res 40/00:44:50:eb:98/00:00:09:00:00/40 Emask 0x50 (ATA bus error) Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } ... Nov 25 10:59:40 senfl kernel: ata4.00: status: { DRDY } Nov 25 10:59:40 senfl kernel: ata4: hard resetting link Nov 25 10:59:46 senfl kernel: ata4: link is slow to respond, please be patient ( ready=0) Nov 25 10:59:50 senfl kernel: ata4: softreset failed (device not ready) Nov 25 10:59:50 senfl kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 25 10:59:55 senfl kernel: ata4.00: qc timeout (cmd 0xec) Nov 25 10:59:55 senfl kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0 x4) Nov 25 10:59:55 senfl kernel: ata4.00: revalidation failed (errno=-5) Nov 25 10:59:55 senfl kernel: ata4: hard resetting link Nov 25 10:59:56 senfl kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 25 10:59:56 senfl kernel: ata4.00: configured for UDMA/133 Nov 25 10:59:56 senfl kernel: ata4: EH complete Nov 25 10:59:56 senfl kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sec tors (750156 MB) Nov 25 10:59:56 senfl kernel: sd 3:0:0:0: [sda] Write Protect is off Nov 25 10:59:56 senfl kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache : enabled, doesn't support DPO or FUA ... Some speculations/random notes (these are my musings; they may be totally wrong or represent confusion on my part): * For some reason, the older kernels don't seem to trigger SATA errors nearly as often (only in one case did I see SATA errors with the older kernels, whereas with newer kernels I found it easy to trigger SATA errors). * In the one case where I did see a SATA error on the older kernels, the way the kernel responded seems to be different: the older kernel dropped the SATA link back to 1.5 Gbps, where as the newer kernels didn't fall back to 1.5 Gbps (I don't know if this means anything or not). Unfortunately I don't know how to force the link to 1.5 Gbps at boot time to see if that helps. My BIOS doesn't seem to have an option for that. I'm guessing there are some jumper settings on my drive that I can set to force it to 1.5 Gbps, but I don't have any jumpers laying around and the drive didn't come with jumpers (sigh). I tried hard power-cycling the machine (shutting down, removing the power cable, then plugging it back in and booting back up) and then running Dell Diagnostics. Alan, you are right: Dell Diagnostics does give a clean bill of health at that point. I can't remember what I did the prior time I ran Dell Diagnostics but I suspect I didn't do a hard power-cycle of the machine, so your explanation makes a lot of sense. Let me know if there is any further information I can gather, or anything else I can try, to help narrow this down further. And thank you to both of you for your help. I know there's no way I'd get this level of support if I was using some other OS (Windows, cough cough) -- so it's all the more impressive to see such attention from developers when I'm using a free OS. I really appreciate your help! Created attachment 324927 [details]
The output from lspci -vv, when booted into a 2.6.26.5-45.fc9.x86_64 kernel
(In case it is relevant, lspci -vv reports that my machine has an "Intel Corporation 82801 SATA RAID Controller".)
Created attachment 324928 [details]
The output from dmesg, when booted into a 2.6.26.5-45.fc9.x86_64 kernel
Any further suggestions? Should I report this on the Linux kernel bugzilla? Some additional information: I've been doing a whole bunch of testing with different kernel versions. This sure seems kernel-related. With some kernels (primarily older kernels), everything works fine, but with some newer kernels, I get SATA errors and system freezes. With kernel RPMs from Fedora, the system is stable with 2.6.25-14.fc9.x86_64 and 2.6.26.5-45.fc9.x86_64 kernel RPMs. With 2.6.26.6-79.fc9.x86_64, 2.6.27.5-37.fc9.x86_64, and 2.6.27.5-41.fc9.x86_64 kernel RPMs, the system is unstable: I get SATA errors and the system freezes with filesystem I/O errors after a few hours of heavy load. I've downloaded a number of kernels from kernel.org and tried to configure them following a similar configuration to that used by a known-stable Fedora kernel RPM [1]. With 2.6.26.5, 2.6.26.6, 2.6.27.7 the system is utterly stable. (In some cases I get SATA errors but the disk always recovers, and I don't get I/O errors or system freezes.) I've tested 2.6.27.7 for over a week with the hard disk 100% loaded full-time throughout with no problems. With 2.6.28-rc7, the system is unstable: I tested three boots, and 2 of them froze after 6-13 hours (couldn't tell if there were any SATA errors because I wasn't present and the screen had blanked by the time I returned); one time, I got a whole mess of SATA errors but the system brought the SATA link back at 1.5 Gbps and after that the system seemed fine for days. I've also tried the 2.6.28-rc7 kernel from kernel.org in two other configurations, with mixed results. First, I took the default config and tested 3 boots: with 2 of them, I had SATA errors and system freezes within 10 minutes; with 1 of them, I had no SATA errors for over a day. Second, I tried to configure it myself with as close as I could get to a bare-minimum config based on my knowledge of what hardware I've got on this machine. I tested 1 boot, and I had no SATA errors and no problems for over a day; I'm still testing this. The inconsistent results here are puzzling and make me think I need to gather more data here. I'll report back when I gather more data. Experimental methodology: In all cases, when I report a time period, the hard disk is fully loaded over that entire time period (with a dd if=/dev/zero of=somefile wrapped in an infinite-while loop running in one window and a find / -print > /dev/null wrapped in an infinite-while loop running in another window). When I reboot, I generally try shut down the machine, power it off, disconnect the power cord, and wait 10-15 seconds before plugging the power cord back in and booting (though I can't guarantee I've always remembered to do this in every single test cycle). Footnote [1]: Here's how I configured the kernel from kernel.org based on Fedora RPMs. I took the /boot/config file from the Fedora 2.6.26.5-45.fc9.x86_64 kernel RPM, which the testing above suggests is stable. I copy this Fedora config file over the .config file in the kernel source for the kernel from kernel.org, run make menuconfig, make mrproper, make all, etc. Any ideas? Can you think of anything I ought to be trying to help troubleshoot this? Further information: It looks like forcing the SATA link to 1.5Gbps _might_ make the problem go away. I was able to jumper my WD drive to force it to 1.5Gbps and I haven't been able to trigger SATA errors with it ever since. This is not yet definitive (I'd prefer to test it for at least a week of continuous load before concluding that it has fixed the problem) but it's looking good so far. However, I can't figure out how to force my Samsung drive down to 1.5Gbps. It does not seem to have any jumper setting to force use of 1.5Gbps. My Dell BIOS does not have a BIOS option for this. Do you know how to force my SATA link to 1.5Gbps? Reading the kernel docs suggested that the way to do this might be to create a file /etc/modprobe.d/local containing: options libata force=1.5Gbps However my Samsung SATA drive still gets loaded at 3Gbps. I'm not sure whether I've got the syntax right, or whether this option is actually being applied, and I can't see any way to check what options were applied to a loaded module. I'm worried I might be in a catch-22 situation. Is this file even consulted before loading the libata module? It seems like there might be a cyclic dependency here: since I'm using SATA drives, presumably the kernel has to load the libata module before it can look up modprobe.d in the filesystem, hence it may be too late to apply the force=1.5Gbps option by the time the filesystem is loaded. Some other things I've tried or considered trying: * I can't set libata.force=1.5Gbps as a kernel argument because on Fedora kernels is built as a module rather than built into the kernel. * I can't run "modprobe libata force=1.5Gbps" by hand because I've got SATA drives, so the libata module is automatically loaded before I get shell access. * I could build a kernel by hand and then build libata into the kernel rather than as a module, but this is certainly less convenient than using a Fedora-packaged kernel. I could do this for debugging purposes only, I suppose. Any suggestions? OK, I managed to force the SATA link to 1.5Gbps by rebuilding initrd with the libata force=1.5Gbps option. That seems to have fixed the problems and made my system reliable again. So I have a workaround that works for my system. However, for the benefit of other users, should some kind of blacklist be added, for this SATA controller or somesuch, so that other users don't have to go through this? Should I report this on the Linux kernel bugzilla, too? I encountered this problem with two SATA drives, a Western Digital (model WDC WD1600JS-COM) and a Seagate Barracuda (model ST3160811AS). Both drives worked fine with OpenSUSE 10.2 and Kubuntu 8.04. But when installing Fedora 9, the following reports were repeated several times to the splash-screen while the install kernel booted: ata1: hard resetting link ata1: link is slow to respond, please be patient (ready=0) ata1: SRST failed (errno=-16) ata1: SATA link up 3.0 Gbps (SStatus 113 SControl 300) ata1: link online but device misclassified, retrying The install proceeded, but could only see my 80GB PATA drive. Google led me to this thread and setting the drive jumper to 1.5Gbps fixed the problem for both drives. My questions are: 1. Would all SATA drives have this problem with Fedora 9 on this system? 2. Which is the best solution: using the drive jumper or setting the option? 3. Where/how do I set the option to "force=1.5Gbps"? (In reply to comment #11) > > 2. Which is the best solution: using the drive jumper or setting the option? > The jumper is always the best option because it won't revert on a reinstall. On the other hand, if the next update works at full speed you'd still be running at the slower one. > 3. Where/how do I set the option to "force=1.5Gbps"? Create a file in /etc/modprobe.d and put this in it: options libata force=1.5Gbps Then rebuild the initrd with mkinitrd. This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |