186839 – Kernel ATA Errors Prevent Usage After Install :(

Bug 186839 - Kernel ATA Errors Prevent Usage After Install :(

Summary: Kernel ATA Errors Prevent Usage After Install :(

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	186841 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-03-27 00:45 UTC by Peter Gordon
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-09-27 02:46:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg from 2.6.16-1.2111_FC5 kernel (39.29 KB, text/plain) 2006-05-15 19:16 UTC, Andrew W. Buchanan	no flags	Details
lspci (9.15 KB, text/plain) 2006-05-15 19:17 UTC, Andrew W. Buchanan	no flags	Details
View All

Description Peter Gordon 2006-03-27 00:45:10 UTC

Description of problem:
After installing FC5 anew (not a upgrade from FC4), it boots up nicely but after
a interval of about a few minutes after logging in and playing with Pirut,
everything seems to slow down immensely, and after a few seconds the machine
locks up (I am able to type in things, move the mouse, and click things, but
with no response from both within X11 and in the terminal). Booting from
Knoppix, I checked my logs and found many of the follow error littered at short
intervals (each a few minutes apart):

ata1: command 0x35 timeout, stat 0x50 host_stat

A Google search turned up a thread on the LKML about LibPATA code issues in
2.6.15.4, which resulted in the poster resolved it by finding out that his disk
was dying. Fortunately, this does not seem to be the case for me, as this disk
(Western Digital Raptor, model WD740GD-41FLC2) works just fine in Core 4 after a
reinstallation and full update. This is on an ABIT VT7 motherboard (VIA
PT880-8237 chipset with a VT6420 SATA RAID Controller; not using RAID).



Version-Release number of selected component (if applicable):
kernel-2.6.15-1.2054_FC5

How reproducible:
Every time.

Steps to Reproduce:
1. Install FC5.
2. Reboot.
3. Try to login and do things.
  
Actual results:
The ata1 errors in my kernel logs as mentioned and a general system crash, as
outlined above.

Expected results:
Expected FC5 to work well on it, as FC4 does.

Additional info:
I re-installed FC4 and am using that for the time being. The following is the
relevant part of my kernel log from FC4:

SCSI subsystem initialized
libata version 1.20 loaded.
sata_via 0000:00:0f.0: version 1.1
ACPI: PCI Interrupt 0000:00:0f.0[B] -> Link [LNKB] -> GSI 11 (level, low) -> IRQ 11
sata_via 0000:00:0f.0: routed to hard irq line 11
ata1: SATA max UDMA/133 cmd 0xB400 ctl 0xB802 bmdma 0xC400 irq 11
ata1: SATA link up 1.5 Gbps (SStatus 113)
ata1: dev 0 cfg 49:2f00 82:74eb 83:7f63 84:4003 85:74e9 86:3c43 87:4003 88:407f
ata1: dev 0 ATA-6, max UDMA/133, 145226112 sectors: LBA48
ata1: dev 0 configured for UDMA/133
scsi0 : sata_via
  Vendor: ATA       Model: WDC WD740GD-41FL  Rev: 31.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 >
sd 0:0:0:0: Attached scsi disk sda

Comment 1 Rahul Sundaram 2006-03-27 01:58:04 UTC

*** Bug 186841 has been marked as a duplicate of this bug. ***

Comment 2 Peter Gordon 2006-03-30 04:20:06 UTC

Kernel 2.6.16-1.2069_FC4 also has similar troubles on Fedora Core 4. I can't
verify that they are the exact same, as no "ata1: ..." messsage is left in the
logs, but the slowdown-then-hardlock-soon-after-boot symptom is the same.

Comment 3 Peter Gordon 2006-04-20 03:38:38 UTC

As an addendum, if I boot with 'acpi=off pci=routeirq', my troubles with SATA
appear to vanish (uptime of almost 20 minutes so far with 2.6.16-1.2069_FC4).

Comment 4 Peter Gordon 2006-04-20 03:40:29 UTC

Another addendum: if I boot without those options and without 'quiet rhgb', then
I get the exact same "ata1: command 0x35 timeout... " error messsages on the
console.

Comment 5 Peter Gordon 2006-04-21 14:43:11 UTC

I'm rather happy to report that booting with 'noapic acpi=off pci=routeirq'
appears to be a nice workaround in FC5, both for the installed default kernel
and the updated 2.6.16-1.2096_FC5 kernel.

Comment 6 Peter Gordon 2006-04-23 23:13:07 UTC

So after a few days of playing around with it, it appears that the 'noapic'
option is the proper workaround for this issue. 

I don't remember local APIC being enabled on  uniprocessor x86 builds being
default prior to this (though my Prescott says that it supports APIC). I'm not
sure if this is a hardware or software bug now, as booting with 'lapic' in FC4
kernel builds seemed to work just fine; or was this option silently ignored in
those builds?

Thanks.

Comment 7 Andrew W. Buchanan 2006-05-15 19:14:50 UTC

I'm having this issue as well, although the noapic workaround doesn't seem to 
be an option for me. If I try that I can't even boot.

After a random amount of time working (half hour or quicker usually), the 
disks stop responding, then the UI some time after that. If I'm logged in from 
another box I can do a dmesg and see that the following lines (which never 
make it to the log files of course):

ata2: command 0x35 timeout, stat 0x50 host_stat 0x24
ata1: command 0x35 timeout, stat 0x50 host_stat 0x24

I ran Hitachi's disk check software against the hard drives and they both came 
back clean. I'm also not having any trouble with them when I boot under 
Windows, so I'm pretty sure the disks are good.

I ran memcheck to look for memory problems, but things look good there too.

The problem has occured for every FC5 SMP kernel through the current one 
2.6.16-1.2111_FC5. The problem does not seem to occur when I boot a non SMP 
kernel, but obviously I'd like to use my other processor. I will attach an 
lspci -v and dmesg dump.

Comment 8 Andrew W. Buchanan 2006-05-15 19:16:10 UTC

Created attachment 129107 [details]
dmesg from 2.6.16-1.2111_FC5 kernel

Comment 9 Andrew W. Buchanan 2006-05-15 19:17:33 UTC

Created attachment 129108 [details]
lspci

Comment 10 Peter Gordon 2006-09-18 17:04:54 UTC

I'm quite thrilled to report that this appears to be fixed in the updated
Rawhide kernels for me. My box has successfully stayed on and responsive for a
couple of days so far, without the noapic workaround active.

Andrew: Do you have a Rawhide/FC6 install that you could test this on? If so,
does the kernel on it work for you? Thanks.

Comment 11 Peter Gordon 2006-09-27 02:46:32 UTC

I'll mark this as RESOLVED/RAWHIDE, as the Developments kernels have yet to give
me any more of such errors. (Yay!)

Andrew: If you still experience this bug, please feel free to repoen with more
details or clone it as new bug, etc. as needed. Thanks.

Note You need to log in before you can comment on or make changes to this bug.