Description of problem: Kernel reports exceptions about devices connected to the IDE controller (hard drive, DVD-RW) being frozen soon after booting. Exploring directories on the hard drive show some, but not all contents while the kernel reports read failed errors. The DVD drive works for a short period of time, but eventually fails to read and cannot be mounted. After a random period of time, the system becomes unresponsive and eventually freezes. Version-Release number of selected component (if applicable): Kernel 2.6.23.1-49.fc8 How reproducible: Fully reproducible. Steps to Reproduce: 1. Boot system. 2. Mount and explore offending file system while watching the dmesg output. 3. Wait for system to freeze. Actual results: Errors reported about devices connected to IDE controller and eventually the system freezes. Expected results: No errors and no freezing. Additional info: I first saw these results when I moved from FC6 to F7 back in late September, but I was only using the DVD drive at that point and didn't know there was a similar problem with the hard drive. I moved to F8 soon after and the problems continued. Initially I thought it was a bad DVD drive and replaced the drive and IDE cable. I've made sure that the master/slave settings are correct for the devices and have tested each device independently on the controller to make sure it wasn't one of the devices. I haven't been able to capture any panic message, but the screen will freeze for a 30 seconds or so and then return for a few minutes and then freeze completely. The devices worked correctly on earlier versions of FC6 and still work under windows.
Created attachment 267841 [details] dmesg and lspci -vvv output
This appears to be the result of the tickless timer that was recently added to the kernel. This is fixed with kernel options: clocksource=acpi_pm nohz=off hightres=off I assume this is still an issue and that this is just a workaround, but I'm not sure what how to proceed with this report.
Created attachment 274401 [details] Kernel messages for Nov 30 with failures.
I guess I jumped the gun on that diagnosis. I'm still experiencing the failure of the IDE devices, but more sporadically. For a short period after I boot, the devices work fine, but errors slowly start to appear about failed reads: FAT: Directory bread(block 1404516) failed With the changed clocksource, the system doesn't freeze (or hasn't yet).
If you just do "nohz=off" is that sufficient or do you need both that and acpi_pm selected ?
I spent a few days testing different combinations of "nohz=off" and "clocksource=acpi_pm" and the following is what I observed: "nohz=off" alone: The system boots and I am able to use the IDE devices, but after a some period of time (usually about 30 min to an hour) device errors appear in the log and the devices become unusable. "clocksource=acpi_pm" alone: The system boots up and within a few minutes devices errors occur and the devices become unusable. "nohz=off" and "clocksource=acpi" together: This seems to be the most stable. I've been able to run for many hours without device errors, but they do eventually show up. These are just my personal observations as the amount of time it takes before the errors appear seems to vary significantly. I noticed another issue during this time that makes me think it's more than just an issue with the IDE devices is that my external USB drive also fails with these errors at the same time the DVD and IDE hard drive fail.
Any ideas Ingo ?
I recently decided to install the debug kernel to see if I could get any more information about this problem and I ran into something interesting. If I run with the debug kernel, I don't have any errors at all and the devices function normally. However, if I boot to the standard kernel the problems occur regularly. Is there any significant difference between the standard kernel and the debug kernel, other than optimization and debugging information?
The following bug reports are potentially related: https://bugzilla.redhat.com/show_bug.cgi?id=411001 https://bugzilla.redhat.com/show_bug.cgi?id=397191 https://bugzilla.redhat.com/show_bug.cgi?id=250349
Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the Fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage I am CC'ing myself to this bug and will try and assist you in resolving it if I can. There hasn't been much activity on this bug for a while. Could you tell me if you are still having problems with the latest kernel? If the problem no longer exists then please close this bug or I'll do so in a few weeks if there is no additional information lodged.
I recently downloaded the 2.6.24 vanilla kernel and this problem appears to be fixed. I don't know what changed, but I've been running for over a week now without any errors.
Hi Daniel, Okay, thanks for testing. There should be a 2.6.24 kernel in the Fedora repositories sometime in the next week or so (check updates-testing) so if you could test with that it would be greatly appreciated. Then if everythings okay we can close out this bug.
Closing. Fixed in kernel 2.6.24.3-12.