Red Hat Bugzilla – Bug 397521
Disk goes offline and refuses to come back
Last modified: 2008-08-02 19:40:36 EDT
Description of problem:
Kernel reports exceptions about devices connected to the IDE controller (hard
drive, DVD-RW) being frozen soon after booting. Exploring directories on the
hard drive show some, but not all contents while the kernel reports read failed
errors. The DVD drive works for a short period of time, but eventually fails to
read and cannot be mounted. After a random period of time, the system becomes
unresponsive and eventually freezes.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Boot system.
2. Mount and explore offending file system while watching the dmesg output.
3. Wait for system to freeze.
Errors reported about devices connected to IDE controller and eventually the
No errors and no freezing.
I first saw these results when I moved from FC6 to F7 back in late September,
but I was only using the DVD drive at that point and didn't know there was a
similar problem with the hard drive. I moved to F8 soon after and the problems
continued. Initially I thought it was a bad DVD drive and replaced the drive
and IDE cable. I've made sure that the master/slave settings are correct for
the devices and have tested each device independently on the controller to make
sure it wasn't one of the devices.
I haven't been able to capture any panic message, but the screen will freeze for
a 30 seconds or so and then return for a few minutes and then freeze completely.
The devices worked correctly on earlier versions of FC6 and still work under
Created attachment 267841 [details]
dmesg and lspci -vvv output
This appears to be the result of the tickless timer that was recently added to
the kernel. This is fixed with kernel options:
clocksource=acpi_pm nohz=off hightres=off
I assume this is still an issue and that this is just a workaround, but I'm not
sure what how to proceed with this report.
Created attachment 274401 [details]
Kernel messages for Nov 30 with failures.
I guess I jumped the gun on that diagnosis. I'm still experiencing the failure
of the IDE devices, but more sporadically. For a short period after I boot, the
devices work fine, but errors slowly start to appear about failed reads:
FAT: Directory bread(block 1404516) failed
With the changed clocksource, the system doesn't freeze (or hasn't yet).
If you just do "nohz=off" is that sufficient or do you need both that and
acpi_pm selected ?
I spent a few days testing different combinations of "nohz=off" and
"clocksource=acpi_pm" and the following is what I observed:
"nohz=off" alone: The system boots and I am able to use the IDE devices, but
after a some period of time (usually about 30 min to an hour) device errors
appear in the log and the devices become unusable.
"clocksource=acpi_pm" alone: The system boots up and within a few minutes
devices errors occur and the devices become unusable.
"nohz=off" and "clocksource=acpi" together: This seems to be the most stable.
I've been able to run for many hours without device errors, but they do
eventually show up.
These are just my personal observations as the amount of time it takes before
the errors appear seems to vary significantly.
I noticed another issue during this time that makes me think it's more than just
an issue with the IDE devices is that my external USB drive also fails with
these errors at the same time the DVD and IDE hard drive fail.
Any ideas Ingo ?
I recently decided to install the debug kernel to see if I could get any more
information about this problem and I ran into something interesting.
If I run with the debug kernel, I don't have any errors at all and the devices
function normally. However, if I boot to the standard kernel the problems occur
Is there any significant difference between the standard kernel and the debug
kernel, other than optimization and debugging information?
The following bug reports are potentially related:
I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.
I am CC'ing myself to this bug and will try and assist you in resolving it if I can.
There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?
If the problem no longer exists then please close this bug or I'll do so in a
few weeks if there is no additional information lodged.
I recently downloaded the 2.6.24 vanilla kernel and this problem appears to be
fixed. I don't know what changed, but I've been running for over a week now
without any errors.
Okay, thanks for testing. There should be a 2.6.24 kernel in the Fedora
repositories sometime in the next week or so (check updates-testing) so if you
could test with that it would be greatly appreciated. Then if everythings okay
we can close out this bug.
Closing. Fixed in kernel 126.96.36.199-12.