Bug 465838 - IDE (PATA+SATA) drive stops working after a few minutes
Summary: IDE (PATA+SATA) drive stops working after a few minutes
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 10
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Peter Martuccelli
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-06 17:22 UTC by Johan Eenfeldt
Modified: 2009-12-18 06:31 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-12-18 06:31:07 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Boot dmesg of 2.6.27-0.392.rc8.git7.fc10.x86_64 (bad) (38.84 KB, text/plain)
2008-10-06 17:23 UTC, Johan Eenfeldt
no flags Details
Boot dmesg of 2.6.27-0.354.rc7.git3.fc10.x86_64 (good) (45.65 KB, text/plain)
2008-10-06 17:24 UTC, Johan Eenfeldt
no flags Details
Boot dmesg of 2.6.27-0.382.rc8.git4.fc10.x86_64 (bad) (36.57 KB, text/plain)
2008-10-06 17:25 UTC, Johan Eenfeldt
no flags Details
Some dmesg ata errors (4.10 KB, text/plain)
2008-10-06 17:26 UTC, Johan Eenfeldt
no flags Details
dmesg of 2.6.28.4-51.fc10.x86_64.debug (bad-sata) (84.39 KB, application/octet-stream)
2008-10-29 03:39 UTC, Johan Eenfeldt
no flags Details

Description Johan Eenfeldt 2008-10-06 17:22:32 UTC
Description of problem:
After a few minutes (varies <1 - ~10minutes) the IDE (PATA) drive totaly 100% stops responding. dmesg shows timeouts and retries. Processes goes into D states when doing anything requiring disk activity.

Version-Release number of selected component (if applicable):
2.6.27-0.392.rc8.git7.fc10.x86_64 bad
2.6.27-0.391.rc8.git7.fc10.x86_64 bad
2.6.27-0.382.rc8.git4.fc10.x86_64 bad
2.6.27-0.354.rc7.git3.fc10.x86_64 good

How reproducible:
100%

Steps to Reproduce:
Cannot find pattern. With or without gui, disk activity, CPU pressure. It happens after a seemingly random number of minutes (<1 - ~10minutes).
  
Actual results:
A non-working system.

Expected results:
A working system.

Additional info:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         cdb 1e 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
         res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1.01: qc timeout (cmd 0x27)
ata1.01: failed to read native max address (err_mask=0x4)
ata1.01: HPA support seems broken, skipping HPA handling
ata1.01: revalidation failed (errno=-5)
ata1: soft resetting link
ata1: nv_mode_filter: 0x1f01f&0x1f01f->0x1f01f, BIOS=0x1f000 (0xc5c60000) ACPI=0x1f01f (30:20:0x15)
ata1: nv_mode_filter: 0x3f01f&0x3f01f->0x3f01f, BIOS=0x3f000 (0xc5c60000) ACPI=0x3f01f (30:20:0x15)
ata1.00: configured for UDMA/66
ata1.01: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         cdb 1e 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
         res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1: nv_mode_filter: 0x1f01f&0x1f01f->0x1f01f, BIOS=0x1f000 (0xc5c60000) ACPI=0x1f01f (30:20:0x15)
ata1: nv_mode_filter: 0x3f01f&0x3f01f->0x3f01f, BIOS=0x3f000 (0xc5c60000) ACPI=0x3f01f (30:20:0x15)
ata1.00: configured for UDMA/66
ata1.01: configured for UDMA/100
ata1: EH complete

Repeats with UDMA/44, UDMA/33, PIO4, PIO3, PIO0. Again and again. See file for a few more.

A bit hard to capture after a while as most everything starts going into D states as they apparently does something requiring disk access. Things more or less identical to the above keeps repeating. 

Some slightly different stuff after a while, copied by hand to another computer:

SR0: cdrom (IOCTL) ERROR, COMMAND: GET EVENT STATUS NOTIFICATION 4A 01 00 00 10 00 00 00 08 00

...

sr 0:0:0:0: ioctl_internal_command return code = 8000002
   : Sense Key : Aborted Command [current] [descriptor]
   : Add. Sense: No additional sense information

...

sd 0:0:1:0: [sda] Result: hostbyte=DID_OK driverrbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:1:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
        00 00 00 00
sd 0:0:1:0: [sda] Add. Sense: No additional sense information
end_request: I/O error, dev sda, sector 519537

Comment 1 Johan Eenfeldt 2008-10-06 17:23:25 UTC
Created attachment 319573 [details]
Boot dmesg of 2.6.27-0.392.rc8.git7.fc10.x86_64 (bad)

Comment 2 Johan Eenfeldt 2008-10-06 17:24:15 UTC
Created attachment 319574 [details]
Boot dmesg of 2.6.27-0.354.rc7.git3.fc10.x86_64 (good)

Comment 3 Johan Eenfeldt 2008-10-06 17:25:33 UTC
Created attachment 319575 [details]
Boot dmesg of 2.6.27-0.382.rc8.git4.fc10.x86_64 (bad)

Comment 4 Johan Eenfeldt 2008-10-06 17:26:54 UTC
Created attachment 319576 [details]
Some dmesg ata errors

Comment 5 Johan Eenfeldt 2008-10-06 21:10:12 UTC
Tried a few more kernels in between, and unfortunately (?) it seems the difference between a working and non-working kernel is if it includes debug code or not (with debug code = no bug).

This includes latest kernel-debug (2.6.27-0.392.rc8.git7.fc10.x86_64.debug) which seems to be working, where the non-debug version bugs out within minutes.

Comment 6 Alan Cox 2008-10-08 18:41:15 UTC
Looks like another stuck DRQ case - good news if so as I'm currently tesitng kernel changes to do DRQ data draining

Comment 7 Johan Eenfeldt 2008-10-08 20:20:54 UTC
Ok. Please advice if you need any further information or if there is anything that needs testing.

Comment 8 Johan Eenfeldt 2008-10-18 10:00:39 UTC
2.6.27-3.fc10.x86_64 -- bad (3 min uptime)
2.6.27-3.fc10.x86_64.debug -- good
2.6.27.2-23.rc1.fc10.x86_64 -- bad (7 min uptime)
2.6.27.2-23.rc1.fc10.x86_64.debug -- good

Some older kernels
2.6.26.6-79.fc9.x86_64 -- bad (4 min uptime)
2.6.25.14-108.fc9.x86-64 -- bad (2 min uptime)

Comment 9 Alan Cox 2008-10-21 08:25:32 UTC
If its predictably the case that only the debug kernels work after multiple tests (and I assume you've been running work debug kernels for a few days now ?) that points outside the ATA layer, could I suppose be timing but sounds almost like a compiler bug

Comment 10 Johan Eenfeldt 2008-10-27 12:30:54 UTC
(sorry for the delay)

I have run with debug kernels for a number hours without problems (the machine otherwise runs windows from a sata drive).

I've tried with a minimal .config kernel 2.6.28.rc2 latest git -- same result, though it survived bonnie++ and took all of 21 minutes before locking up. Same config with the debug options from fedora enabled seems to be working (1h+) though I'll test it some more.

I'll look into trying different compiler. The rc2 test was with gcc Red Hat 4.3.2-6 and Ubuntu 4.3.2-1ubuntu11 in a distcc thing.

Could it be hardware related? The HD in question is oldish -- rest of machine is new. Still, it seems 100% stable with those debug options turned on.

Is there anything I can do to find out what is happening here? I can patch the kernel easily enough, or look into using kgdb, but I really have no idea what to look for.

For the record:
2.6.28.4-47.rc3.fc10.x86_64 -- bad (3 min uptime)

Comment 11 Johan Eenfeldt 2008-10-27 15:55:01 UTC
non-debug 2.6.28.rc2 kernel compiled with gcc Red Had 3.4.6-9 locked up after ~4 minutes uptime.

debug 2.6.28.rc2 was still ok after 4h+ of hard testing.

Comment 12 Johan Eenfeldt 2008-10-29 03:37:46 UTC
New lockup, with a complete break in pattern:
1. 2.6.27.4-51.fc10.x86_64.debug (all debug kernels so far had worked)
2. SATA (sata_nv) dmraid (raid-0 nvidia ntfs ro) instead of main PATA-IDE ext3 (which kept working)

Lockup was for ~15-20 minutes(?), then worked again for ~8 minutes, then locked up again.

I cannot seem to get main drive to lock up like that with this kernel.

I cannot tell if this is new to this kernel, I only recently set this up. (The dmraid did not activate out of the box). It locked up within minutes on this kernel after working for a few hours on a 28.rc2-git thing.

Comment 13 Johan Eenfeldt 2008-10-29 03:39:32 UTC
Created attachment 321742 [details]
dmesg of 2.6.28.4-51.fc10.x86_64.debug (bad-sata)

Comment 14 Bug Zapper 2008-11-26 03:36:40 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Joe Ogulin 2008-11-29 01:41:35 UTC
I have had a similar problem since Fedora 8, as have a few others.  Please see bug 440408 as well.  These appear to be the same thing.

Comment 16 Jeff White 2008-12-30 11:33:19 UTC
I also experience this problem since Fedora 8. I'm pleased it is getting some attention. The only way to get it working again is a reboot.  


Since Fedora 10 and DBus issue I now get this message from GUI:

Unable to mount location
Cannot invoke CheckForMedia on HAL: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Comment 17 scott mcmahan 2009-01-22 18:56:21 UTC
I am having the same problem. I have built a machine around an Asus P5N-EM motherboard. I am running a 32-bit kernel, not 64-bit.

Originally, I had an IDE boot drive with Fedora 10 installed on it, and a spare SATA drive for server disk space.

I had this install of F10 completely updated with the latest "yum update" but can't access that drive now to see what kernel version it had - it was whatever version was current this week (ca. Jan 20 - I worked on this problem off and on all week).

The machine would run for a few minutes, and then the disk light would come on and stay on. The machine still ran, but disk I/O quit working. Anything in memory (cached, I guess) would still work - I could open xterm, and read files like "messages" that were not yet written to disk. But new commands that were not in cache would not work (said "Input/output error." at the shell prompt), and I could not ctl-alt-F8 to the text console and log in as root.

Suspecting a hardware problem, I spent a lot of time running diagnostics like smartctrl and booting from Hiren's boot disk and running Seagate and Maxtor utilities. Every single disk diagnostic comes back clean. The problem is not in the IDE drive itself, or if it is, it's a problem that diagnostics can't find.

Today, I unplugged the IDE drive, and put a base install of F10 on the SATA drive which has 2.6.27.5-117.fc10.i686. The machine no longer locks up while it is running. But it prints this message every 10-20 seconds:

  ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
  ata5.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
           cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
           res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
  ata5.01: status: { DRDY }
  ata5: soft resetting link
  ata5: nv_mode_filter: 0x1&0x1f01f->0x1, BIOS=0x1f000 (0xc50000) ACPI=0x1f01f (600:30:0x1c)
  ata5.01: configured for PIO0
  ata5: EH complete

This machine was built back in November/December, and worked fine for a while - but I never had time to finish doing anything with it. This week, I booted and ran the "yum update" and it was the first time I noticed these problems. However, they could have been there all along, but it certainly didn't lock up as it was running.

Comment 18 scott mcmahan 2009-01-27 17:09:28 UTC
I have installed CentOS 5.2 on this machine, and have no ATA errors like this.

What bothers me is this is the third SATA bug I've encountered where a working system breaks for no apparent reason because of an upgrade. One was fixed, and the other two are open. I have a Blu-Ray burner that is a brick because of one of these errors. All three of these are situations where Fedora worked fine on the hardware, but an upgrade broke the existing system. The first one was a year or two ago, and was eventually fixed. But these two are open and inactive. I can't live with this any more, so this is the end of the line with Fedora for me. I have to run Fedora for some IBM software I develop with, but will try to see if I can get it to run on SuSE, or CentOS. I've used Red Hat since 5.2, but can't deal with these broken systems any longer.

I could understand if Fedora had bleeding-edge new stuff that wasn't working. I don't have any problems with that. But I have problems with existing, working code suddenly breaking to the point the systems aren't usable.

Comment 20 Martin Tack 2009-04-05 18:49:49 UTC
Hi to all,

I'm having the same troubles ,as in comment's 14 16 17  18.


For note 12 ,I now for "sure" ,that the main disk never gets 

locked that way. In my experiences ,the disk witch contain the '/'

never gets involved in this .I say for "sure" because I do shuffle

a lot with disks ,file systems ,and partitioning schemes .
  

In my case ,the "normal" disks set-up is :

sda ,sdb ,sdc = sata build-in HD's 

ata 5 = DVD-RW (Aopen) 

the '/' is on sdb ,
/tmp ;/var/tmp ;/var/spool ;/var/cache/yum ;/usr ;/usr/lib64 ;/usr/share ;
and some other subdir's are all in there own partition divided over the
3 disks .All of this for flexibility ,performance (by using parallelism) etc.

For now ,since the update 2009/02/23 ,disks sda and sdc are no longer locking as

in message 17 . But ata5 (DVD) does .

Because ,from time to time ,I also use other disk's and file-systems ,I'm

"nearly sure" there is no hardware ore file system issue .

Also noteworthy is that there is "something & somewhere" polling via the D-bus

all of the time ,which slows down seriously other system functions.
 
It seems the D-bus is occupied by this problem ,but I can't find a clue .  

   


OS = F10 x86_64 all in ext4 except /boot ,latest update yesterday.

I also have a machine around an Asus P5VDC-MX motherboard.

Is there a solution somehow ? 

Thanks a lot

martin

Comment 21 Stanislaw Gruszka 2009-04-17 09:14:39 UTC
Only change related with ata between 2.6.27-0.392.rc8.git7.fc10 and 2.6.27-0.354.rc7.git3.fc10 is sata_nv hardreset commit:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commitdiff;h=4c1eb90a0908c0c60db2169dce08fb672e7582f1

It is know that the commit cause a problems, which where reported in two places:

http://bugzilla.kernel.org/show_bug.cgi?id=12176
http://bugzilla.kernel.org/show_bug.cgi?id=11195

And fixed in two further commits:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commit;h=2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commitdiff;h=2da462eba7e5b585d54c17d76c6a662e4fbb3c32

So the bug should fixed in the newest updates of fedora kernel (2.6.27.21 based). Johan could you confirm that ?

BTW: Johan in your dmesg are lot of messages like that:

attempt to access beyond end of device
sdc: rw=0, want=625160072, limit=312581808
Buffer I/O error on device sdc1, logical block 78144752
attempt to access beyond end of device
sdc: rw=0, want=625160072, limit=312581808
Buffer I/O error on device sdc1, logical block 78144752
attempt to access beyond end of device

It is serious problem which can cause data corruption. It can be something wrong
it the software working on the top of ata devices (filesystem, device maper) or maybe in ata itself or a hardware problem (memory corruption, chipsets etc...)

Comment 22 Jeff White 2009-07-12 19:05:03 UTC
When will this problem get a resolution?

Comment 23 Stanislaw Gruszka 2009-07-14 17:52:46 UTC
(In reply to comment #22)
> When will this problem get a resolution?  

As base kernel version for fedora 10 and 11 is now 2.6.29, I believe this problem it is already solved. Jeff, can you reproduce this issue on fedora 10 or 11?

Comment 24 Jeff White 2009-07-15 18:27:13 UTC
I checked my version it is: 
uname -r
2.6.27.25-170.2.72.fc10.x86_64

I will upgrade the kernel and give update.


Regards
Jeff

Comment 25 Jeff White 2009-08-15 18:10:03 UTC
i have upgraded to F11, 
$ uname -r
2.6.29.6-217.2.3.fc11.x86_64
The CDrom and DVD appears to work for longer period of time before locking up. but it still locks up. In the past i would get DBUS error. Now, I get no error at all on eject or rescan. I can not eject the device manually from externally. Tell what logs or traces i can provide to help resolve this issue. I'm attempting to reboot to capture screen shot of working scenario.



regards
Jeff

Comment 26 Jeff White 2009-08-16 04:20:12 UTC
It appears that after firefox file download, and I perform a "open folder containing" if .ISOs are present they are automatically mounted. In my case about 15 ISOs. Then immediately the mplayer also runs. This kills the cdrom/DVD devices.


Regards
Jeff

Comment 27 Bug Zapper 2009-11-18 07:56:09 UTC
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 28 Bug Zapper 2009-12-18 06:31:07 UTC
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.