Bug 463034 - [sata_nv swncq] kernel 2.6.26.3-29 raid errors: "md: super_written gets error=-5, uptodate=0"
[sata_nv swncq] kernel 2.6.26.3-29 raid errors: "md: super_written gets error...
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
9
x86_64 Linux
medium Severity urgent
: ---
: ---
Assigned To: Jeff Garzik
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-20 19:04 EDT by Rolf Fokkens
Modified: 2013-07-02 22:35 EDT (History)
6 users (show)

See Also:
Fixed In Version: 2.6.26.6-79.fc9
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-23 12:39:11 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
output of lspci, dmidecode and dmesg (17.05 KB, application/octet-stream)
2008-09-20 19:04 EDT, Rolf Fokkens
no flags Details
dmesg of 2.6.25.14-108 kernel (52.24 KB, text/plain)
2008-09-22 15:43 EDT, Rolf Fokkens
no flags Details
dmesg of 2.6.26.3-29 kernel (47.91 KB, text/plain)
2008-09-22 15:43 EDT, Rolf Fokkens
no flags Details
output of lspci -vv (30.57 KB, text/plain)
2008-09-25 08:36 EDT, Chuck Ebbert
no flags Details
Output of hdparm -i for /dev/sd[a-c] (6.87 KB, text/plain)
2008-10-14 13:09 EDT, Rolf Fokkens
no flags Details
Output of smartctl -a /dev/sd[a-c] (14.31 KB, text/plain)
2008-10-14 13:10 EDT, Rolf Fokkens
no flags Details
dmesg of 2.6.26.5-45 kernel (56.49 KB, application/octet-stream)
2008-10-20 14:35 EDT, Rolf Fokkens
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 11751 None None None Never

  None (edit)
Description Rolf Fokkens 2008-09-20 19:04:59 EDT
Created attachment 317294 [details]
output of lspci, dmidecode and dmesg

Description of problem:
Since upgrading to kernel 2.6.26.3-29 apparently my raid5 array breaks quite reproducably. In this particular case after several installs I/O errors show up from any of the SATA disks (sda2, sdb2 and sdc2) each on block 312576374. After two errors the raid5 array is broken and I have to reinstall.

Version-Release number of selected component (if applicable): kernel-2.6.26.3-29.fc9.x86_64

How reproducible: I thing 100%

Steps to Reproduce:
1. Install kernel-2.6.26.3-29.fc9.x86_64 on a raid5 system
2. Use it and just wait for a while
3. Within a few hours 2 of 3 disks fail on an I/O error and the raid5 array is broken
  
Actual results:
"Some I/O errors" on any of sda2, sdb2 and sdc2. Each on block 312576574". My own evidence being destroyed in these crases I can only refer to a similar situation:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0709.1/2242.html

Expected results:
great and reliable raid5 performance

Additional info:
I downgraded to kernel kernel-2.6.25.14-108.fc9.x86_64 and haven't had trouble since.
Comment 1 Rolf Fokkens 2008-09-21 04:42:04 EDT
See also: http://lkml.org/lkml/2007/9/1/95

This part:

> [   63.420000] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> [   63.420000] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
> cdb 0x0 data 4096 out
> [   63.420000]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [   63.420000] ata2.00: status: {DRDY }
> [   63.420000] ata2: hard resetting link
> [   65.720000] ata2: softreset failed (port not ready)
> [   65.720000] ata2: reset failed (errno=-5), retrying in 8 secs
> [   73.420000] ata2: hard resetting link
> [   75.720000] ata2: softreset failed (port not ready)
> [   75.720000] ata2: reset failed (errno=-5), retrying in 8 secs
> [   83.420000] ata2: hard resetting link
> [   85.720000] ata2: softreset failed (port not ready)
> [   85.720000] ata2: reset failed (errno=-5), retrying in 33 secs
> [  118.420000] ata2: limiting SATA link speed to 1.5 Gbps
> [  118.420000] ata2: hard resetting link
> [  120.720000] ata2: softreset failed (port not ready)
> [  120.720000] ata2: reset failed, giving up
> [  120.720000] ata2.00: disabled
> [  120.720000] ata2: EH complete
> [  120.720000] sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET
> driverbyte=DRIVER_OK,SUGGEST_OK
> [  120.720000] end_request: I/O error, dev sdb, sector 625137161
> [  120.720000] md: super_written gets error=-5, uptodate=0

Though I don't recall the attempts to reset the link several times, I think that the drive was kicked out of the array immediately and unnoticed by me.
Comment 2 Rolf Fokkens 2008-09-21 04:49:19 EDT
also: http://lkml.org/lkml/2007/9/13/49

Again it seems sata_nv related.

I have another system with an M2N-MX SE Plus board based om MCP61 that survived the kernel update without issues. That system however runs raid1. I downgraded the kernel as a precaution though
Comment 3 Chuck Ebbert 2008-09-22 13:27:05 EDT
There is an upstream patch that disables hardreset for som nvidia chipsets but this doesn't look like the same problem. In the upstream error report the drives never even get detected.
Comment 4 Rolf Fokkens 2008-09-22 14:07:36 EDT
I reinstalled everything, and did an installation on a RAID1 array, which allowed me to have a /var/log/messages after a potential failure.

The problem showed up on RAID1 as well! This is the relevant /var/log/messages parts:

Sep 21 18:58:35 localhost kernel: imklog 3.18.1, log source = /proc/kmsg started.
Sep 21 18:58:35 localhost kernel: Initializing cgroup subsys cpuset
Sep 21 18:58:35 localhost kernel: Linux version 2.6.26.3-29.fc9.x86_64 (mockbuild@) (gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC) ) #1 SMP Wed Sep 3 03:16:37 EDT 2008
Sep 21 18:58:35 localhost kernel: Command line: ro root=/dev/HOME07T/ROOT
...
Sep 21 23:30:26 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 21 23:30:26 localhost kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 21 23:30:26 localhost kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 21 23:30:26 localhost kernel: ata1.00: status: { DRDY }
Sep 21 23:30:26 localhost kernel: ata1: hard resetting link
Sep 21 23:30:27 localhost kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 21 23:30:27 localhost kernel: ata1.00: configured for UDMA/133
Sep 21 23:30:27 localhost kernel: end_request: I/O error, dev sda, sector 312576468
Sep 21 23:30:27 localhost kernel: md: super_written gets error=-5, uptodate=0
Sep 21 23:30:27 localhost kernel: raid1: Disk failure on sda5, disabling device.
Sep 21 23:30:27 localhost kernel: raid1: Operation continuing on 2 devices.
Sep 21 23:30:27 localhost kernel: ata1: EH complete
Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off
Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 21 23:30:27 localhost kernel: RAID1 conf printout:
Sep 21 23:30:27 localhost kernel: --- wd:2 rd:3
Sep 21 23:30:27 localhost kernel: disk 0, wo:1, o:0, dev:sda5
Sep 21 23:30:27 localhost kernel: disk 1, wo:0, o:1, dev:sdb5
Sep 21 23:30:27 localhost kernel: disk 2, wo:0, o:1, dev:sdc5
Sep 21 23:30:27 localhost kernel: RAID1 conf printout:
Sep 21 23:30:27 localhost kernel: --- wd:2 rd:3
Sep 21 23:30:27 localhost kernel: disk 1, wo:0, o:1, dev:sdb5
Sep 21 23:30:27 localhost kernel: disk 2, wo:0, o:1, dev:sdc5
Sep 21 23:31:16 localhost kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 21 23:31:16 localhost kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 21 23:31:16 localhost kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 21 23:31:16 localhost kernel: ata3.00: status: { DRDY }
Sep 21 23:31:16 localhost kernel: ata3: hard resetting link
Sep 21 23:31:16 localhost kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 21 23:31:16 localhost kernel: ata3.00: configured for UDMA/133
Sep 21 23:31:16 localhost kernel: end_request: I/O error, dev sdc, sector 312576468
Sep 21 23:31:16 localhost kernel: md: super_written gets error=-5, uptodate=0
Sep 21 23:31:16 localhost kernel: raid1: Disk failure on sdc5, disabling device.
Sep 21 23:31:16 localhost kernel: raid1: Operation continuing on 1 devices.
Sep 21 23:31:16 localhost kernel: ata3: EH complete
Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB)
Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] Write Protect is off
Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 21 23:31:16 localhost kernel: RAID1 conf printout:
Sep 21 23:31:16 localhost kernel: --- wd:1 rd:3
Sep 21 23:31:16 localhost kernel: disk 1, wo:0, o:1, dev:sdb5
Sep 21 23:31:16 localhost kernel: disk 2, wo:1, o:0, dev:sdc5
Sep 21 23:31:16 localhost kernel: RAID1 conf printout:
Sep 21 23:31:16 localhost kernel: --- wd:1 rd:3
Sep 21 23:31:16 localhost kernel: disk 1, wo:0, o:1, dev:sdb5

Within a 1 minute two drives get kicked from the array, both apparently tryin toe access the same block.

fdisk output (for all disks the same):

Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00032e19

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          62      497983+  fd  Linux raid autodetect
/dev/sda2              63       10261    81923467+  fd  Linux raid autodetect
/dev/sda3           10262       10325      514080   fd  Linux raid autodetect
/dev/sda4           10326       19457    73352790    5  Extended
/dev/sda5           10326       19457    73352758+  fd  Linux raid autodetect
Comment 5 Rolf Fokkens 2008-09-22 15:43:03 EDT
Created attachment 317413 [details]
dmesg of 2.6.25.14-108 kernel
Comment 6 Rolf Fokkens 2008-09-22 15:43:45 EDT
Created attachment 317414 [details]
dmesg of 2.6.26.3-29 kernel
Comment 7 Rolf Fokkens 2008-09-22 15:46:41 EDT
A potential interesting difference between de sata_nv SATA drivers of the 2.6.25 and 2.6.26 kernel:

T:[rolf@home07 tmp]$ grep NCQ msg-2.6.25.14-108.fc9.x86_64 
Sep 22 19:42:46 home07 kernel: ata1.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32)
Sep 22 19:42:46 home07 kernel: ata2.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32)
Sep 22 19:42:46 home07 kernel: ata3.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32)
T:[rolf@home07 tmp]$ grep NCQ msg-2.6.26.3-29.fc9.x86_64 
Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.0: Using SWNCQ mode
Sep 22 21:29:10 home07 kernel: ata1.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32)
Sep 22 21:29:10 home07 kernel: ata2.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32)
Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.1: Using SWNCQ mode
Sep 22 21:29:10 home07 kernel: ata3.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32)
Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.2: Using SWNCQ mode
T:[rolf@home07 tmp]$ 

See the attachements for full details
Comment 8 Rolf Fokkens 2008-09-22 16:16:30 EDT
Google "SWNCQ" and "SATA_NV" results in hits which describe familiar problems.
Comment 9 Chuck Ebbert 2008-09-25 08:36:33 EDT
Created attachment 317683 [details]
output of lspci -vv

Please don't attach tar files.
Comment 10 Chuck Ebbert 2008-09-25 08:41:05 EDT
You can disable SWNCQ by editing /etc/modprobe.conf and adding these lines, then rebuilding the initrd:

  options scsi_hostadapter1 swncq=0 
  options sata_nv swncq=0 

If you don't have an alias line for scsi_hostadapter1 you can omit the first line.
Comment 11 Rolf Fokkens 2008-09-25 18:21:32 EDT
I'm already running with a modified modprobe.conf and initrd, things seem fine now. The system ran OK however for a day before that without a hitch, so I'll use swncq=0 for a few days and then I'll report the status again.
Comment 12 Rolf Fokkens 2008-10-02 12:45:47 EDT
Things are still fine with swncq=0...
Comment 13 Chuck Ebbert 2008-10-13 16:26:02 EDT
disabled swncq by default in 2.6.26.6-69
Comment 14 Chuck Ebbert 2008-10-14 04:14:06 EDT
Can you post the output of "hdparm -I" and "smartctl -a" commands against the drive(s) having the problem?
Comment 15 Holger Lubitz 2008-10-14 06:13:08 EDT
The "hits which describe familiar problems" seem to be based on pre-mainline versions of the swncq patch. The issue Jon Ivar Rykkelid had (your links in description and comment #2) didn't even seem to involve swncq.

As you note yourself that another system doesn't show the issue, may I suggest checking and maybe replacing your cabling? Asus is said to have been shipping some less than perfect SATA cables in the past. ata2 turning up as 3.0 Gbps in your 2.6.25 dmesg and only 1.5 Gbps in your 2.6.26 dmesg seems to support that theory as well.

swncq was off by default when it first hit mainline and that default was changed when it was believed to be safe. If it can cause data corruption, disabling it again should hit mainline immediately (together with a big red warning advising against it - also the people at nvidia should be looking into this). But as it is, this seems premature to me.
Comment 16 Rolf Fokkens 2008-10-14 13:09:24 EDT
Created attachment 320329 [details]
Output of hdparm -i for /dev/sd[a-c]
Comment 17 Rolf Fokkens 2008-10-14 13:10:35 EDT
Created attachment 320330 [details]
Output of smartctl -a /dev/sd[a-c]
Comment 18 Rolf Fokkens 2008-10-14 13:23:26 EDT
As noted in Comment #2: The other system w/o the problems is MCP61 based. The system with the problems is MCP55 based. The former has default sqncq=0 while the latter has default sqncq=1 (as coded in the driver).

This explains the difference between the systems with respect to the bug.

For the moment I'm just happy with a stable system by setting swncq=0. Making my system unstable may result in a time consuming reinstall of it, which I'm not enjoying very much.

However...if I stay the only person reporting this problem, it may be a problem that's specifically related to my system. In that case I can try to do experiments (by your request) to provide y'all with more info.
Comment 19 Fedora Update System 2008-10-14 15:53:24 EDT
kernel-2.6.26.6-46.fc8 has been submitted as an update for Fedora 8.
http://admin.fedoraproject.org/updates/kernel-2.6.26.6-46.fc8
Comment 20 Fedora Update System 2008-10-14 16:01:29 EDT
kernel-2.6.26.6-71.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.26.6-71.fc9
Comment 21 Chuck Ebbert 2008-10-20 10:23:48 EDT
I didn't notice the SATA link speed changed, and there have been a few reports of higher link speeds causing problems. FWIW I always set the jumpers on new drives to force the speed to 1.5Gbit/sec to avoid possible problems with marginal cables...
Comment 22 Rolf Fokkens 2008-10-20 13:50:39 EDT
I don't know what you mean by the mentioned link speed changes. The only thing I see is that all links operate at 3.0 Gbps for the Samsung HD160JJ disks. Only the links speed for the Samsung SH-S203D DVD RW drive is limited to 1.5 Gbps.

I only mention a speed kickback to 1.5 Gbps in comment #1, which is a similar but not the same problem.
Comment 23 Holger Lubitz 2008-10-20 14:14:28 EDT
Then look again.

From your "dmesg of 2.6.25.14-108 kernel":

Sep 22 19:42:46 home07 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 22 19:42:46 home07 kernel: ata2.00: ATA-7: SAMSUNG HD160JJ, WU100-33, max UDMA/100

From your "dmesg of 2.6.26.3-29 kernel":

Sep 22 21:29:10 home07 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 22 21:29:10 home07 kernel: ata2.00: ATA-7: SAMSUNG HD160JJ, WU100-33, max UDMA/100

If you are saying that you are seeing 3.0 Gbps again in your _current_ dmesg on the machine, it would very much indicate a cable problem with at least this specific drive.
Comment 24 Rolf Fokkens 2008-10-20 14:35:03 EDT
Created attachment 320898 [details]
dmesg of 2.6.26.5-45 kernel
Comment 25 Rolf Fokkens 2008-10-20 14:44:42 EDT
OK, got it. ata2 runs at 1.5 Gbps from startup.

Attached my current kernel's dmesg. It uses 3.0 Gbps indeed and it runs flawlessly. It runs without NCQ however (SWNCQ=0).

If things run just fine at 3.0 Gbps and there's trouble at 1.5 Gbps I'm not sure about what's the cause and what's the effect.
Comment 26 Fedora Update System 2008-10-20 16:29:09 EDT
kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8929
Comment 27 Chuck Ebbert 2008-10-22 01:07:48 EDT
Upstream developers think this could be a power problem that is happening now because NCQ makes the drive use more power. Can you test with SWNCQ enabled and connect the drives to different power connectors (or a separate power supply?)
Comment 28 Fedora Update System 2008-10-23 12:38:36 EDT
kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 29 Rolf Fokkens 2008-10-25 07:04:18 EDT
Regarding the potential power probem I replaced all MOLEX-SATA adapter cables with new ones (now none of them splits 1 MOLEX connecter to 2 SATA power connector). This is the result for the 2.6.26.3-29.fc9.x86_64 kernel:

Oct 25 12:52:43 home07 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 25 12:52:43 home07 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Oct 25 12:52:43 home07 kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 25 12:52:43 home07 kernel: ata3.00: status: { DRDY }
Oct 25 12:52:43 home07 kernel: ata3: hard resetting link
Oct 25 12:52:43 home07 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 25 12:52:43 home07 kernel: ata3.00: configured for UDMA/133
Oct 25 12:52:43 home07 kernel: end_request: I/O error, dev sdc, sector 312576468
Oct 25 12:52:43 home07 kernel: md: super_written gets error=-5, uptodate=0
Oct 25 12:52:43 home07 kernel: raid1: Disk failure on sdc5, disabling device.
Oct 25 12:52:43 home07 kernel: raid1: Operation continuing on 2 devices.
Oct 25 12:52:43 home07 kernel: ata3: EH complete
Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB)
Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] Write Protect is off
Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 25 12:52:43 home07 kernel: RAID1 conf printout:
Oct 25 12:52:43 home07 kernel: --- wd:2 rd:3
Oct 25 12:52:43 home07 kernel: disk 0, wo:0, o:1, dev:sda5
Oct 25 12:52:43 home07 kernel: disk 1, wo:0, o:1, dev:sdb5
Oct 25 12:52:43 home07 kernel: disk 2, wo:1, o:0, dev:sdc5
Oct 25 12:52:43 home07 kernel: RAID1 conf printout:
Oct 25 12:52:43 home07 kernel: --- wd:2 rd:3
Oct 25 12:52:43 home07 kernel: disk 0, wo:0, o:1, dev:sda5
Oct 25 12:52:43 home07 kernel: disk 1, wo:0, o:1, dev:sdb5

This happened during a "yum update" which resulted in lots of writes.

Testing with an additional power supply has two practical issues:
* I don't have another power supply
* How do I connect an additional power supply?
Comment 30 Thomas Hilber 2009-07-27 14:10:31 EDT
Thanks to all posters above. I've got exactly the same issue as Rolf Fokkens on my Asus M2NPV-VM (MCP51 based).

My box first was running for over 2.5 years without any problems (kernel 2.6.20 i686) . 

Since I upgraded to kernel 2.6.26 x86_64 (debian lenny) my RAID-5 begun to fail reproducible.
All disks failed repeatedly at the *same* high sector number 152344255 (near partition end).

I fortunately could  always rebuild the RAID before a second disk failed at the same time. So I
never lost any data.

Mostly a few hours after reboot I get:

Jul 19 13:00:08 roja kernel: [ 3953.892925] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 19 13:00:08 roja kernel: [ 3953.892936] ata2.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul 19 13:00:08 roja kernel: [ 3953.892938]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 19 13:00:08 roja kernel: [ 3953.892940] ata2.00: status: { DRDY }
Jul 19 13:00:08 roja kernel: [ 3953.892948] ata2: hard resetting link
Jul 19 13:00:08 roja kernel: [ 3954.501492] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 19 13:00:08 roja kernel: [ 3954.517725] ata2.00: configured for UDMA/133
Jul 19 13:00:08 roja kernel: [ 3954.517740] end_request: I/O error, dev sdb, sector 152344255
Jul 19 13:00:08 roja kernel: [ 3954.517745] md: super_written gets error=-5, uptodate=0
Jul 19 13:00:08 roja mdadm[4975]: Fail event detected on md device /dev/md0, component device /dev/sdb1
Jul 19 13:00:08 roja kernel: [ 3954.517749] raid5: Disk failure on sdb1, disabling device.
Jul 19 13:00:08 roja kernel: [ 3954.517750] raid5: Operation continuing on 2 devices.

no problems since 8 days 24/7 operation any more after disabling swncq. Nobody needs to
tell me this is no driver bug:)

BTW:
Another MCP67 based system never showed the problem. WIth exactly the same debian version.
Comment 31 Thomas Hilber 2009-07-27 15:50:21 EDT
just to mention it:

kernel 2.6.30 shows same behavior: only after disabling swncq it's running without disk failures on my MCP51 system. 

again with 2.6.30: the MCP67 appears to be immune against default swncq setting
Comment 32 Jordi Ferrer Plana 2009-11-27 12:10:29 EST
The same problem in 2.6.30-gentoo-r5 (Gentoo Linux):

Nov 27 12:43:50 llebre kernel: ata2: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Nov 27 12:43:50 llebre kernel: ata2: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Nov 27 12:43:50 llebre kernel: dhfis 0x1 dmafis 0x0 sdbfis 0x0
Nov 27 12:43:50 llebre kernel: ata2: ATA_REG 0x40 ERR_REG 0x0
Nov 27 12:43:50 llebre kernel: ata2: tag : dhfis dmafis sdbfis sacitve
Nov 27 12:43:50 llebre kernel: ata2: tag 0x0: 1 0 0 1
Nov 27 12:43:50 llebre kernel: ata2.00: exception Emask 0x10 SAct 0x1 SErr 0x1810000 action 0xe frozen
Nov 27 12:43:50 llebre kernel: ata2.00: hot unplug
Nov 27 12:43:50 llebre kernel: ata2: SError: { PHYRdyChg LinkSeq TrStaTrns }
Nov 27 12:43:50 llebre kernel: ata2.00: cmd 60/08:00:7f:86:40/00:00:0b:00:00/40 tag 0 ncq 4096 in
Nov 27 12:43:50 llebre kernel: res 40/00:00:7f:86:40/00:00:7f:86:40/40 Emask 0x10 (ATA bus error)
Nov 27 12:43:50 llebre kernel: ata2.00: status: { DRDY }
Nov 27 12:43:50 llebre kernel: ata2: hard resetting link
Nov 27 12:43:50 llebre kernel: ata2: link is slow to respond, please be patient (ready=0)
Nov 27 12:43:50 llebre kernel: ata2: SRST failed (errno=-16)
Nov 27 12:43:50 llebre kernel: ata2: hard resetting link
Nov 27 12:43:50 llebre kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 27 12:43:50 llebre kernel: ata2.00: configured for UDMA/133
Nov 27 12:43:50 llebre kernel: ata2: EH complete


And solved it by disabling swncq.

However I realized that if I plug the disk drives to different controllers the system seams also stable.

I understand that I have 3 MCP55 controllers in my PCI:
# lspci |grep MCP55|grep IDE
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)

Initially I had a software RAID (mirror using two drives) with swncq=1 and the two disk drives plugged at the same controller i.e. 00:05.0:

# dir /sys/bus/scsi/devices/|grep "5\.0"
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 2:0:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host2/target2:0:0/2:0:0:0
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 3:0:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host3/target3:0:0/3:0:0:0
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 host2 -> ../../../devices/pci0000:00/0000:00:05.0/host2
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 host3 -> ../../../devices/pci0000:00/0000:00:05.0/host3
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 target2:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host2/target2:0:0
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 target3:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host3/target3:0:0

The problem appeared after upgrading the kernel at some version (when swncq became by default 1 instead of 0). The RAID became degraded since one of the two disks had the above "hardresetting" problems. Since the disk seamed OK after the errors (i.e. "EH Complete" in the log file) I tried to rebuild the RAID mirror and it failed again. I couldn't add it to the software array anymore.

Before rebooting the server and disabling swncq I detached the failed disk (hotunplug) and I hotplugged it in the other controller (i.e. 00:05.1), I re-scaned the scsi bus, then the drive was detected ok. I joined it to the RAID again without any problem and the system was stable.

I do not know if this problem is due to a "power" problem, a software driver problem or a harware chipset problem, but I hope this information could help sata_nv developers in order to find out the problem.

Note You need to log in before you can comment on or make changes to this bug.