181310 – sata_promise command time out kills all disks on SMP

Bug 181310 - sata_promise command time out kills all disks on SMP

Summary: sata_promise command time out kills all disks on SMP

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	FCMETA_SATA 182618
TreeView+	depends on / blocked

Reported:	2006-02-13 13:12 UTC by Alexandre Oliva
Modified:	2015-01-04 22:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-11-24 23:07:33 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Alexandre Oliva 2006-02-13 13:12:12 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.0.1) Gecko/20060210 Fedora/1.5.0.1-3 Firefox/1.5.0.1

Description of problem:
It seems that, under heavy disk/network activity, sata_promise-controlled HDs may time out and that kills all disks under that controller until a reboot.

Hardware is Asus A8V Deluxe MoBo, Athlon64X2 processor, two Maxtor HDs connected to the Promise controller built into the motherboard.

This might be related with bug 180063, and it appears to match exactly the description at http://lists.debian.org/debian-kernel/2006/01/msg00317.html

I have both disks on software RAID 1.  Whenever the system fails, the last thing I can see in /var/log/messages is the first line below (since / is on those disks, and I surprised it even makes it; that was probably at a time when sda was accidentally taken out of the raid set for /):

ata3: command timeout
ata3: status=0x50 { DriveReady SeekComplete }
ata4: command timeout
ata4: status=0x50 { DriveReady SeekComplete }

Oddly, other two nearly-identical boxes, just with 4 SATA disks (two on the VIA cotroller, two on the Promise controller) do not present the problem.  The most significant difference is that the RAID 1 sets are across controllers, i.e., each disk in the VIA controller is mirrored by a disk in the Promise controller  I suppose this might make the failure just less likely to happen, as opposed to impossible, but after a few days trying, I still haven't been able to trigger the problem there.  Another similar box with a single disk on the VIA controller has not triggered the problem either.

This box with 2 disks on the Promise controller appears to become stable if booted with maxcpus=1, but it's a shame to be unable to use the second CPU :-(

Version-Release number of selected component (if applicable):
kernel-2.6.1[45]-*

How reproducible:
Sometimes

Steps to Reproduce:
1.Copy lots of stuff over the network, local disk or just install daily rawhide updates.

Actual Results:  Both disks cease to respond, and the only way out is to reboot

Expected Results:  Continuous operation with the 2 cores enabled :-)

Additional info:

Comment 1 Joe Fenton 2006-03-25 23:20:57 UTC

I have the same problem, but with slightly different hardware. It still involves
dual AMD64 CPUs and is a SATA problem, however, I'm using a Master2-FAR dual
Opteron mobo where the SATA is driven by the VT8237. The drives are identical
Hitachi 80G drives, but are not set up as RAID. Heavy disk/network activity -
ESPECIALLY bittorrent - causes the SATA drives to go out to lunch. Just using
the drive normally, it may take many hours for the problem to occur. Using the
drive as a target of a usenet reader may take three to four hours for the
problem to occur, but using dozens of connections on several files in bittorrent
can make it occur in minutes.

The message I got last was: 
ata2: command 0xb0 timeout, stat 0x50 host_stat 0x0
ata2: command 0x35 timeout, stat 0x50 host_stat 0x4

That was with the kernel 2.6.15-1.2054 that comes with FC5 release.

Comment 2 Kostas Georgiou 2006-03-27 09:27:33 UTC

I have almost the same setup (Asus A8V-Deluxe, A64, two Maxtor HDs connected to
the Promise controller) and the box is stable. So it does seem that the problem
is related to SMP.

And I was just about to order 20 A8V-Deluxe,A64X2,4 SATA disks for work today :(

Comment 3 Boris Leidner 2006-04-06 21:11:59 UTC

Same problem:

AMD64 X2 4200+, ABit AV7, 160GB Maxtor SATA

dmesg shortly before filesystem is gone:
ata1: command 0x35 timeout, stat 0x50 host_stat 0x4

I noticed jumpy, slow moving behaviour of my usb mouse. Right after that issue
the SATA problem begins. That must be related somehow. Interrupts getting messed up?

Comment 4 Andy Campbell 2006-04-22 16:10:24 UTC

Same Porblem:

AMD64 X2 4600+, Abit AV8 3rd-Eye

No real pattern when it ocurrs 10min to a number of hours, machine comes VERY
slow and if you swap to a console screen you see ...

ata2: command 0x35 timeout, stat 0x50 host_stat 0x4

adding noapic to my bott options seems to stop it hanging.  The FC4 installation
on the same disk works fine.

Comment 5 Peter Bieringer 2006-05-09 13:04:58 UTC

Perhaps related: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=183138#c5
System: Athlon 700 MHz / MSI board
Will try with APM + noapic instead of ACPI now.

Comment 6 Peter Bieringer 2006-05-12 13:50:17 UTC

noapci and acpi=off won't help, perhaps it's a Athlon specific issue, Promise
SATA 150 TX4 has the same problem used with this CPU/Board combination. Both
controllers (SATA 150/SATA II 300) working fine on 2 Intel CPU based hosts
(PIII-933 and PII-350).

BTW: MSI borad is a K7 PRO

Comment 7 Richard Ziegler 2006-09-09 20:42:37 UTC

Same problem, but I'm using an Intel Core 2 Duo, and the motherboard has sata_nv
and sata_sil24.  I see the problem with drives on on both controllers.  The only
difference is that the sata_sil24 controller does a reset (but the errors
continue). Otherwise the errors are exactly as described by the others in this bz.  

Setting maxcpus=1 has fixed it for me so far as well.  (THANKS Alexandre!!)

Asus P5N32-SLI SE Deluxe
4 Seagate 320G SATA-II HDs (2 in RAID1 config, 2 LVM)

Comment 8 Richard Ziegler 2006-09-13 04:42:51 UTC

I just tried the latest kernel in FC5 updates testing, and it did not solve this
problem for me.  I will try FC6 Test 3 when it is released in a few days.  If I
am indeed hitting this same defect, should I expect this to clear up with FC6t3?

Comment 9 Marek Kassur 2006-09-17 17:39:04 UTC

In my case it starts a couple weeks ago. First on my Asus P4P800 Delux
motherboard with P4 3.2GHz HT (Intel based: ata_piix):
ata1: command 0xc8 timeout, stat 0x50 host_stat 0x21
Then I changed suspected disk to new one, but it didn't solve this
problem, so I changed motherboard to ASRock P4V88+ (VIA based) and PSU (to be
sure), it didn't solve the problem as well.
ata1: command 0xea timeout, stat 0x50 host_stat 0x0
ata2: command 0x35 timeout, stat 0x50 host_stat 0x4
ata1: command 0x35 timeout, stat 0x50 host_stat 0x4
Additionally, usb ports stopped detecting new devices, usb mouse became jumpy,
usb keyboard lost characters or add some when I type.

Downgrading to kernel-smp-2.6.16-1.2069_FC4 from kernel-smp-2.6.17-1.2142_FC4
help with usb problems, but not with sata timeouts.

Fedora Core 4
P4 3.2GHz HT, 2GB RAM
Asus P4P800 Delux ICH5 (ata_piix) or ASRock P4V88+ (sata_via)
2 x 160GB SATA in RAID1 (/dev/md0) mounted as root

Comment 10 Dave Jones 2006-10-17 00:41:45 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 11 Dave Jones 2006-11-24 23:07:33 UTC

This bug has been mass-closed along with all other bugs that
have been in NEEDINFO state for several months.

Due to the large volume of inactive bugs in bugzilla, this
is the only method we have of cleaning out stale bug reports
where the reporter has disappeared.

If you can reproduce this bug after installing all the
current updates, please reopen this bug.

If you are not the reporter, you can add a comment requesting
it be reopened, and someone will get to it asap.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.