397521 – Disk goes offline and refuses to come back

Bug 397521 - Disk goes offline and refuses to come back

Summary: Disk goes offline and refuses to come back

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-24 00:42 UTC by Daniel C Weeks
Modified:	2008-08-02 23:40 UTC (History)
CC List:	2 users (show)
Fixed In Version:	2.6.24.3-12
Clone Of:
Environment:
Last Closed:	2008-03-10 05:29:48 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg and lspci -vvv output (51.05 KB, text/plain) 2007-11-24 00:42 UTC, Daniel C Weeks	no flags	Details
Kernel messages for Nov 30 with failures. (651.03 KB, text/plain) 2007-11-30 22:00 UTC, Daniel C Weeks	no flags	Details
View All

Description Daniel C Weeks 2007-11-24 00:42:58 UTC

Description of problem:
Kernel reports exceptions about devices connected to the IDE controller (hard
drive, DVD-RW) being frozen soon after booting.  Exploring directories on the
hard drive show some, but not all contents while the kernel reports read failed
errors. The DVD drive works for a short period of time, but eventually fails to
read and cannot be mounted.  After a random period of time, the system becomes
unresponsive and eventually freezes.

Version-Release number of selected component (if applicable):
Kernel 2.6.23.1-49.fc8

How reproducible:
Fully reproducible.

Steps to Reproduce:
1. Boot system.
2. Mount and explore offending file system while watching the dmesg output.
3. Wait for system to freeze.
  
Actual results:
Errors reported about devices connected to IDE controller and eventually the
system freezes.

Expected results:
No errors and no freezing.

Additional info:
I first saw these results when I moved from FC6 to F7 back in late September,
but I was only using the DVD drive at that point and didn't know there was a
similar problem with the hard drive.  I moved to F8 soon after and the problems
continued.  Initially I thought it was a bad DVD drive and replaced the drive
and IDE cable.  I've made sure that the master/slave settings are correct for
the devices and have tested each device independently on the controller to make
sure it wasn't one of the devices.  


I haven't been able to capture any panic message, but the screen will freeze for
a 30 seconds or so and then return for a few minutes and then freeze completely.

The devices worked correctly on earlier versions of FC6 and still work under
windows.

Comment 1 Daniel C Weeks 2007-11-24 00:42:58 UTC

Created attachment 267841 [details]
dmesg and lspci -vvv output

Comment 2 Daniel C Weeks 2007-11-27 06:06:10 UTC

This appears to be the result of the tickless timer that was recently added to
the kernel.  This is fixed with kernel options: 

clocksource=acpi_pm nohz=off hightres=off

I assume this is still an issue and that this is just a workaround, but I'm not
sure what how to proceed with this report.

Comment 3 Daniel C Weeks 2007-11-30 22:00:05 UTC

Created attachment 274401 [details]
Kernel messages for Nov 30 with failures.

Comment 4 Daniel C Weeks 2007-11-30 22:01:04 UTC

I guess I jumped the gun on that diagnosis.  I'm still experiencing the failure
of the IDE devices, but more sporadically.  For a short period after I boot, the
devices work fine, but errors slowly start to appear about failed reads:

FAT: Directory bread(block 1404516) failed

With the changed clocksource, the system doesn't freeze (or hasn't yet).

Comment 5 Alan Cox 2007-11-30 22:23:24 UTC

If you just do "nohz=off" is that sufficient or do you need both that and
acpi_pm selected ?

Comment 6 Daniel C Weeks 2007-12-03 15:20:43 UTC

I spent a few days testing different combinations of "nohz=off" and
"clocksource=acpi_pm" and the following is what I observed:

"nohz=off" alone:  The system boots and I am able to use the IDE devices, but
after a some period of time (usually about 30 min to an hour) device errors
appear in the log and the devices become unusable.  

"clocksource=acpi_pm" alone:  The system boots up and within a few minutes
devices errors occur and the devices become unusable.

"nohz=off" and "clocksource=acpi"  together:  This seems to be the most stable.
 I've been able to run for many hours without device errors, but they do
eventually show up.

These are just my personal observations as the amount of time it takes before
the errors appear seems to vary significantly.

I noticed another issue during this time that makes me think it's more than just
an issue with the IDE devices is that my external USB drive also fails with
these errors at the same time the DVD and IDE hard drive fail.

Comment 7 Alan Cox 2007-12-03 17:30:45 UTC

Any ideas Ingo ?

Comment 8 Daniel C Weeks 2007-12-17 17:48:11 UTC

I recently decided to install the debug kernel to see if I could get any more
information about this problem and I ran into something interesting.

If I run with the debug kernel, I don't have any errors at all and the devices
function normally.  However, if I boot to the standard kernel the problems occur
regularly. 

Is there any significant difference between the standard kernel and the debug
kernel, other than optimization and debugging information?

Comment 9 Daniel C Weeks 2007-12-17 23:07:16 UTC

The following bug reports are potentially related:

https://bugzilla.redhat.com/show_bug.cgi?id=411001
https://bugzilla.redhat.com/show_bug.cgi?id=397191
https://bugzilla.redhat.com/show_bug.cgi?id=250349

Comment 10 Christopher Brown 2008-02-14 00:33:04 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?

If the problem no longer exists then please close this bug or I'll do so in a
few weeks if there is no additional information lodged.

Comment 11 Daniel C Weeks 2008-02-16 22:20:40 UTC

I recently downloaded the 2.6.24 vanilla kernel and this problem appears to be
fixed.  I don't know what changed, but I've been running for over a week now
without any errors.

Comment 12 Christopher Brown 2008-02-17 16:51:01 UTC

Hi Daniel,

Okay, thanks for testing. There should be a 2.6.24 kernel in the Fedora
repositories sometime in the next week or so (check updates-testing) so if you
could test with that it would be greatly appreciated. Then if everythings okay
we can close out this bug.

Comment 13 Daniel C Weeks 2008-03-10 05:29:48 UTC

Closing.  Fixed in kernel 2.6.24.3-12.

Note You need to log in before you can comment on or make changes to this bug.