Bug 132697 - sata_sil driver locks kernel with sector write error message
Summary: sata_sil driver locks kernel with sector write error message
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Jeff Garzik
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-09-15 21:55 UTC by Luke Ross
Modified: 2013-07-03 02:21 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2005-08-17 09:31:05 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Luke Ross 2004-09-15 21:55:21 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7)
Gecko/20040808 Firefox/0.9.3

Description of problem:
This problem appeared after upgrading from kernel-2.6.5-1.358. After
about 5-15 mins use after bootup, a message warning that there had
been a sector write error would be issued to the console, and then the
machine would lock up so that Ctrl-Alt-Del would no longer work. The
hard disk light would light up about 5 seconds after the machine locks
and stay on.

This occurs on both smp and up kernels, and disappears after
downgrading again. The disks have been thoroughly checked on both this
machine and another machine with the manufacturers diagnostics disk.
The more disk activity there is, the quicker it seems to lock up.
Affects both the downloadable x86_64 kernels and rebuilds of the SRPMS.

Version-Release number of selected component (if applicable):
kernel-2.6.8-1.521

How reproducible:
Always

Additional info:

Machine is a dual Opteron on a Tyan K8W motherboard. The SATA chipset
is a Silicon Image 3114 with 2xSeagate 80Gb drives set up as a RAID1
md array.

Comment 1 Thomas Cameron 2004-10-19 03:51:46 UTC
I am seeing the same issue on a Shuttle SN85G4 (FN85 motherboard) with
a single AMD64 x86_64 3000+ processor.  The SATA controller is a
Silicon Image Serial ATARaid Controller [ CMD/Sil 3512 ].

Comment 2 Thornton Prime 2004-11-07 15:56:41 UTC
I am having similar problems consistently on my SiI 3114 with Maxtor
SATA drives. I don't have a md configuration, just lvm across the disks.

I will post kernel errors once I can get the messages logged to serial.

I have duplicated the problem on FC3-test3-x86_64.

Comment 3 Luke Ross 2004-11-08 11:56:50 UTC
I've changed the version so it appears on the FC3 radar (I hope!).

Comment 4 Luke Ross 2004-11-09 20:27:00 UTC
Right I've sat down with a serial cable and got the following out of
the serial console:

Loading sd_mod.ko module
Loading libata.ko module
Loading sata_sil.ko module
ata1: dev 0 ATA, max UDMA/133, 156301488 sectors: lba48
ata1: dev 0 configured for UDMA/100
scsi0 : sata_sil
ata2: dev 0 ATA, max UDMA/133, 156301488 sectors: lba48
ata2: dev 0 configured for UDMA/100 
scsi1 : sata_sil 
ata3: no device found (phy stat 00000000) 
scsi2 : sata_sil 
ata4: no device found (phy stat 00000000) 
scsi3 : sata_sil 
  Vendor: ATA       Model: ST380013AS        Rev: 3.18  
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) 
SCSI device sda: drive cache: write back  
sda: sda1 sda2                                                       
       
  Vendor: ATA       Model: ST380013AS        Rev: 3.18
  Type:   Direct-Access                      ANSI SCSI revision: 05
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
  Vendor: ATA       Model: ST380013AS        Rev: 3.18
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB)
SCSI device sdb: drive cache: write back
 sdb: sdb1 sdb2
Attached scsi disk sdb at scsi1, channel 0, id 0, lun 0

...

ata1: command 0x35 timeout, stat 0xd8 host_stat 0x61
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 05 d9 b4 ef
00 01 98 00
Current sda: sense key Medium Error
Additional sense: Write error - auto reallocation failed
end_request: I/O error, dev sda, sector 98153711
ATA: abnormal status 0xD8 on port 0xFFFFFF000005CC87
ATA: abnormal status 0xD8 on port 0xFFFFFF000005CC87
ATA: abnormal status 0xD8 on port 0xFFFFFF000005CC87

Bug #13291 is a possible dupe? There's also been mention of this
problem with sata_sil in a number of places, including the LKML, but
no answer and no bug for this on kernel bugzilla.

Comment 5 Luke Ross 2004-11-09 20:28:35 UTC
Sorry bug #132910 is the possible dupe.

Comment 6 Jeff Garzik 2005-03-25 07:37:33 UTC
FYI, http://lkml.org/lkml/2005/3/25/33

Comment 7 Luke Ross 2005-03-30 08:57:05 UTC
I give the patch in comment 6, but it isn't working for me. I applied the patch
to kernel-2.6.10-1.771_FC2, but managed to lock the machine after ten minutes
with the same error message.

Comment 8 Dave Jones 2005-07-15 19:57:32 UTC
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 9 Luke Ross 2005-07-17 21:04:17 UTC
kernel-2.6.12-1.1372_FC3 fails to boot here. On PCI probe at kernel startup  
there are repeated errors about allocating PCI resources, and the kernel  
panics during initrd. I'll do a full log via serial in the next few days and 
let you know what I find. 

Comment 10 Trevor Cordes 2005-07-18 00:21:44 UTC
Are you sure you aren't just getting hit with bug #163437?  It's new in 1372. 
If you're using SMP, I bet it's 163437.


Comment 11 Luke Ross 2005-07-18 17:53:01 UTC
Yes my kernel panic issues were down to that bug.  
  
I have been giving the up-kernel a whirl today, and it seems stable so far. It  
says "ata1(0): applying Seagate errata fix" at startup which it hasn't done  
before - I assume this is regarding the mod15 problem? If so, is this likely  
to be the thing that was giving me the lockup? 

Comment 12 Trevor Cordes 2005-07-18 20:20:07 UTC
Is there a bugzilla for this "mod15 problem"?  I haven't heard of that one yet.
 Which kernel version gave you that output?


Comment 13 Luke Ross 2005-07-26 11:39:16 UTC
The mod15 problem doesn't have a bugzilla entry, but googling turns up some     
stuff. It was "fixed" a while back. The basic problem is certain drives (based     
on the seagate PHY) throw write errors when writing frames where sector_count     
> 1 && sector_count % 15 == 1. There is a blacklist in the kernel code of     
known-bad drives, at the expense of very poor performance with these drives.     
However I'm a bit confused as I believe 2.6.5 had a blacklist, which my drive     
isn't in. Why does blacklisting it in 2.6.12 help, when it works without the     
fix under 2.6.5? It seems like theres more than one bug going on that the 
buglist fixes or masks.  
  
I was using 2.6.12-1.1372_FC3 to get the message in comment 11.    

Comment 14 Dave Jones 2005-07-30 00:41:22 UTC
After installing todays mkinitrd update, if you remove, and then reinstall the
smp kernel, it should work again.

As the UP kernel sounds like it now works for you, I'll bet the SMP version is
fine too, and we can close this bug. Please test and let me know.

Thanks.

Comment 15 Trevor Cordes 2005-08-01 17:33:25 UTC
Careful... I don't think anyone here has reported that the sata_sil bug is
solved yet, as of 1372.  The topic got shifted a bit to the 1372 SMP kernel boot
problem, which is a completely separate problem.


Comment 16 Luke Ross 2005-08-02 12:34:32 UTC
Regarding comment 15, 1372 seemed to fix this problem on a up kernel (comment 
11). As this problem seems to be independent of up/smp (comment 1), it seems 
likely that 1372smp should work as well. However the mkinitrd issue meant it 
couldn't be tested.  
  
I've installed 1372smp with the new mkinitrd and so far so good. I'll give it  
a proper putting through its paces in the next couple of days, but the signs  
so far are positive.  

Comment 17 Luke Ross 2005-08-17 09:31:05 UTC
1372smp works for me. I've tried testing it under heavy write, and also using 
it daily for a couple of weeks, and it hasn't (touch wood) died at all. 

Comment 18 Trevor Cordes 2005-08-17 10:37:12 UTC
Now we wait for FC5t1 for a bare-metal (no PATA) installable version!


Note You need to log in before you can comment on or make changes to this bug.