From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7) Gecko/20040808 Firefox/0.9.3 Description of problem: This problem appeared after upgrading from kernel-2.6.5-1.358. After about 5-15 mins use after bootup, a message warning that there had been a sector write error would be issued to the console, and then the machine would lock up so that Ctrl-Alt-Del would no longer work. The hard disk light would light up about 5 seconds after the machine locks and stay on. This occurs on both smp and up kernels, and disappears after downgrading again. The disks have been thoroughly checked on both this machine and another machine with the manufacturers diagnostics disk. The more disk activity there is, the quicker it seems to lock up. Affects both the downloadable x86_64 kernels and rebuilds of the SRPMS. Version-Release number of selected component (if applicable): kernel-2.6.8-1.521 How reproducible: Always Additional info: Machine is a dual Opteron on a Tyan K8W motherboard. The SATA chipset is a Silicon Image 3114 with 2xSeagate 80Gb drives set up as a RAID1 md array.
I am seeing the same issue on a Shuttle SN85G4 (FN85 motherboard) with a single AMD64 x86_64 3000+ processor. The SATA controller is a Silicon Image Serial ATARaid Controller [ CMD/Sil 3512 ].
I am having similar problems consistently on my SiI 3114 with Maxtor SATA drives. I don't have a md configuration, just lvm across the disks. I will post kernel errors once I can get the messages logged to serial. I have duplicated the problem on FC3-test3-x86_64.
I've changed the version so it appears on the FC3 radar (I hope!).
Right I've sat down with a serial cable and got the following out of the serial console: Loading sd_mod.ko module Loading libata.ko module Loading sata_sil.ko module ata1: dev 0 ATA, max UDMA/133, 156301488 sectors: lba48 ata1: dev 0 configured for UDMA/100 scsi0 : sata_sil ata2: dev 0 ATA, max UDMA/133, 156301488 sectors: lba48 ata2: dev 0 configured for UDMA/100 scsi1 : sata_sil ata3: no device found (phy stat 00000000) scsi2 : sata_sil ata4: no device found (phy stat 00000000) scsi3 : sata_sil Vendor: ATA Model: ST380013AS Rev: 3.18 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) SCSI device sda: drive cache: write back sda: sda1 sda2 Vendor: ATA Model: ST380013AS Rev: 3.18 Type: Direct-Access ANSI SCSI revision: 05 Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 Vendor: ATA Model: ST380013AS Rev: 3.18 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB) SCSI device sdb: drive cache: write back sdb: sdb1 sdb2 Attached scsi disk sdb at scsi1, channel 0, id 0, lun 0 ... ata1: command 0x35 timeout, stat 0xd8 host_stat 0x61 scsi0: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 05 d9 b4 ef 00 01 98 00 Current sda: sense key Medium Error Additional sense: Write error - auto reallocation failed end_request: I/O error, dev sda, sector 98153711 ATA: abnormal status 0xD8 on port 0xFFFFFF000005CC87 ATA: abnormal status 0xD8 on port 0xFFFFFF000005CC87 ATA: abnormal status 0xD8 on port 0xFFFFFF000005CC87 Bug #13291 is a possible dupe? There's also been mention of this problem with sata_sil in a number of places, including the LKML, but no answer and no bug for this on kernel bugzilla.
Sorry bug #132910 is the possible dupe.
FYI, http://lkml.org/lkml/2005/3/25/33
I give the patch in comment 6, but it isn't working for me. I applied the patch to kernel-2.6.10-1.771_FC2, but managed to lock the machine after ten minutes with the same error message.
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you.
kernel-2.6.12-1.1372_FC3 fails to boot here. On PCI probe at kernel startup there are repeated errors about allocating PCI resources, and the kernel panics during initrd. I'll do a full log via serial in the next few days and let you know what I find.
Are you sure you aren't just getting hit with bug #163437? It's new in 1372. If you're using SMP, I bet it's 163437.
Yes my kernel panic issues were down to that bug. I have been giving the up-kernel a whirl today, and it seems stable so far. It says "ata1(0): applying Seagate errata fix" at startup which it hasn't done before - I assume this is regarding the mod15 problem? If so, is this likely to be the thing that was giving me the lockup?
Is there a bugzilla for this "mod15 problem"? I haven't heard of that one yet. Which kernel version gave you that output?
The mod15 problem doesn't have a bugzilla entry, but googling turns up some stuff. It was "fixed" a while back. The basic problem is certain drives (based on the seagate PHY) throw write errors when writing frames where sector_count > 1 && sector_count % 15 == 1. There is a blacklist in the kernel code of known-bad drives, at the expense of very poor performance with these drives. However I'm a bit confused as I believe 2.6.5 had a blacklist, which my drive isn't in. Why does blacklisting it in 2.6.12 help, when it works without the fix under 2.6.5? It seems like theres more than one bug going on that the buglist fixes or masks. I was using 2.6.12-1.1372_FC3 to get the message in comment 11.
After installing todays mkinitrd update, if you remove, and then reinstall the smp kernel, it should work again. As the UP kernel sounds like it now works for you, I'll bet the SMP version is fine too, and we can close this bug. Please test and let me know. Thanks.
Careful... I don't think anyone here has reported that the sata_sil bug is solved yet, as of 1372. The topic got shifted a bit to the 1372 SMP kernel boot problem, which is a completely separate problem.
Regarding comment 15, 1372 seemed to fix this problem on a up kernel (comment 11). As this problem seems to be independent of up/smp (comment 1), it seems likely that 1372smp should work as well. However the mkinitrd issue meant it couldn't be tested. I've installed 1372smp with the new mkinitrd and so far so good. I'll give it a proper putting through its paces in the next couple of days, but the signs so far are positive.
1372smp works for me. I've tried testing it under heavy write, and also using it daily for a couple of weeks, and it hasn't (touch wood) died at all.
Now we wait for FC5t1 for a bare-metal (no PATA) installable version!