Bug 463034

Summary: [sata_nv swncq] kernel 2.6.26.3-29 raid errors: "md: super_written gets error=-5, uptodate=0"
Product: [Fedora] Fedora Reporter: Rolf Fokkens <rolf>
Component: kernelAssignee: Jeff Garzik <jgarzik>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: medium    
Version: 9CC: bugzilla.redhat, holger.lubitz, kernel-maint, nemesis, netfox, peterm, sellis
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.26.6-79.fc9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-04 11:42:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of lspci, dmidecode and dmesg
none
dmesg of 2.6.25.14-108 kernel
none
dmesg of 2.6.26.3-29 kernel
none
output of lspci -vv
none
Output of hdparm -i for /dev/sd[a-c]
none
Output of smartctl -a /dev/sd[a-c]
none
dmesg of 2.6.26.5-45 kernel none

Description Rolf Fokkens 2008-09-20 23:04:59 UTC
Created attachment 317294 [details]
output of lspci, dmidecode and dmesg

Description of problem:
Since upgrading to kernel 2.6.26.3-29 apparently my raid5 array breaks quite reproducably. In this particular case after several installs I/O errors show up from any of the SATA disks (sda2, sdb2 and sdc2) each on block 312576374. After two errors the raid5 array is broken and I have to reinstall.

Version-Release number of selected component (if applicable): kernel-2.6.26.3-29.fc9.x86_64

How reproducible: I thing 100%

Steps to Reproduce:
1. Install kernel-2.6.26.3-29.fc9.x86_64 on a raid5 system
2. Use it and just wait for a while
3. Within a few hours 2 of 3 disks fail on an I/O error and the raid5 array is broken
  
Actual results:
"Some I/O errors" on any of sda2, sdb2 and sdc2. Each on block 312576574". My own evidence being destroyed in these crases I can only refer to a similar situation:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0709.1/2242.html

Expected results:
great and reliable raid5 performance

Additional info:
I downgraded to kernel kernel-2.6.25.14-108.fc9.x86_64 and haven't had trouble since.

Comment 1 Rolf Fokkens 2008-09-21 08:42:04 UTC
See also: http://lkml.org/lkml/2007/9/1/95

This part:

> [   63.420000] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> [   63.420000] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
> cdb 0x0 data 4096 out
> [   63.420000]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
> 0x4 (timeout)
> [   63.420000] ata2.00: status: {DRDY }
> [   63.420000] ata2: hard resetting link
> [   65.720000] ata2: softreset failed (port not ready)
> [   65.720000] ata2: reset failed (errno=-5), retrying in 8 secs
> [   73.420000] ata2: hard resetting link
> [   75.720000] ata2: softreset failed (port not ready)
> [   75.720000] ata2: reset failed (errno=-5), retrying in 8 secs
> [   83.420000] ata2: hard resetting link
> [   85.720000] ata2: softreset failed (port not ready)
> [   85.720000] ata2: reset failed (errno=-5), retrying in 33 secs
> [  118.420000] ata2: limiting SATA link speed to 1.5 Gbps
> [  118.420000] ata2: hard resetting link
> [  120.720000] ata2: softreset failed (port not ready)
> [  120.720000] ata2: reset failed, giving up
> [  120.720000] ata2.00: disabled
> [  120.720000] ata2: EH complete
> [  120.720000] sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET
> driverbyte=DRIVER_OK,SUGGEST_OK
> [  120.720000] end_request: I/O error, dev sdb, sector 625137161
> [  120.720000] md: super_written gets error=-5, uptodate=0

Though I don't recall the attempts to reset the link several times, I think that the drive was kicked out of the array immediately and unnoticed by me.

Comment 2 Rolf Fokkens 2008-09-21 08:49:19 UTC
also: http://lkml.org/lkml/2007/9/13/49

Again it seems sata_nv related.

I have another system with an M2N-MX SE Plus board based om MCP61 that survived the kernel update without issues. That system however runs raid1. I downgraded the kernel as a precaution though

Comment 3 Chuck Ebbert 2008-09-22 17:27:05 UTC
There is an upstream patch that disables hardreset for som nvidia chipsets but this doesn't look like the same problem. In the upstream error report the drives never even get detected.

Comment 4 Rolf Fokkens 2008-09-22 18:07:36 UTC
I reinstalled everything, and did an installation on a RAID1 array, which allowed me to have a /var/log/messages after a potential failure.

The problem showed up on RAID1 as well! This is the relevant /var/log/messages parts:

Sep 21 18:58:35 localhost kernel: imklog 3.18.1, log source = /proc/kmsg started.
Sep 21 18:58:35 localhost kernel: Initializing cgroup subsys cpuset
Sep 21 18:58:35 localhost kernel: Linux version 2.6.26.3-29.fc9.x86_64 (mockbuild@) (gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC) ) #1 SMP Wed Sep 3 03:16:37 EDT 2008
Sep 21 18:58:35 localhost kernel: Command line: ro root=/dev/HOME07T/ROOT
...
Sep 21 23:30:26 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 21 23:30:26 localhost kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 21 23:30:26 localhost kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 21 23:30:26 localhost kernel: ata1.00: status: { DRDY }
Sep 21 23:30:26 localhost kernel: ata1: hard resetting link
Sep 21 23:30:27 localhost kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 21 23:30:27 localhost kernel: ata1.00: configured for UDMA/133
Sep 21 23:30:27 localhost kernel: end_request: I/O error, dev sda, sector 312576468
Sep 21 23:30:27 localhost kernel: md: super_written gets error=-5, uptodate=0
Sep 21 23:30:27 localhost kernel: raid1: Disk failure on sda5, disabling device.
Sep 21 23:30:27 localhost kernel: raid1: Operation continuing on 2 devices.
Sep 21 23:30:27 localhost kernel: ata1: EH complete
Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off
Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 21 23:30:27 localhost kernel: RAID1 conf printout:
Sep 21 23:30:27 localhost kernel: --- wd:2 rd:3
Sep 21 23:30:27 localhost kernel: disk 0, wo:1, o:0, dev:sda5
Sep 21 23:30:27 localhost kernel: disk 1, wo:0, o:1, dev:sdb5
Sep 21 23:30:27 localhost kernel: disk 2, wo:0, o:1, dev:sdc5
Sep 21 23:30:27 localhost kernel: RAID1 conf printout:
Sep 21 23:30:27 localhost kernel: --- wd:2 rd:3
Sep 21 23:30:27 localhost kernel: disk 1, wo:0, o:1, dev:sdb5
Sep 21 23:30:27 localhost kernel: disk 2, wo:0, o:1, dev:sdc5
Sep 21 23:31:16 localhost kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 21 23:31:16 localhost kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 21 23:31:16 localhost kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 21 23:31:16 localhost kernel: ata3.00: status: { DRDY }
Sep 21 23:31:16 localhost kernel: ata3: hard resetting link
Sep 21 23:31:16 localhost kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 21 23:31:16 localhost kernel: ata3.00: configured for UDMA/133
Sep 21 23:31:16 localhost kernel: end_request: I/O error, dev sdc, sector 312576468
Sep 21 23:31:16 localhost kernel: md: super_written gets error=-5, uptodate=0
Sep 21 23:31:16 localhost kernel: raid1: Disk failure on sdc5, disabling device.
Sep 21 23:31:16 localhost kernel: raid1: Operation continuing on 1 devices.
Sep 21 23:31:16 localhost kernel: ata3: EH complete
Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB)
Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] Write Protect is off
Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 21 23:31:16 localhost kernel: RAID1 conf printout:
Sep 21 23:31:16 localhost kernel: --- wd:1 rd:3
Sep 21 23:31:16 localhost kernel: disk 1, wo:0, o:1, dev:sdb5
Sep 21 23:31:16 localhost kernel: disk 2, wo:1, o:0, dev:sdc5
Sep 21 23:31:16 localhost kernel: RAID1 conf printout:
Sep 21 23:31:16 localhost kernel: --- wd:1 rd:3
Sep 21 23:31:16 localhost kernel: disk 1, wo:0, o:1, dev:sdb5

Within a 1 minute two drives get kicked from the array, both apparently tryin toe access the same block.

fdisk output (for all disks the same):

Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00032e19

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          62      497983+  fd  Linux raid autodetect
/dev/sda2              63       10261    81923467+  fd  Linux raid autodetect
/dev/sda3           10262       10325      514080   fd  Linux raid autodetect
/dev/sda4           10326       19457    73352790    5  Extended
/dev/sda5           10326       19457    73352758+  fd  Linux raid autodetect

Comment 5 Rolf Fokkens 2008-09-22 19:43:03 UTC
Created attachment 317413 [details]
dmesg of 2.6.25.14-108 kernel

Comment 6 Rolf Fokkens 2008-09-22 19:43:45 UTC
Created attachment 317414 [details]
dmesg of 2.6.26.3-29 kernel

Comment 7 Rolf Fokkens 2008-09-22 19:46:41 UTC
A potential interesting difference between de sata_nv SATA drivers of the 2.6.25 and 2.6.26 kernel:

T:[rolf@home07 tmp]$ grep NCQ msg-2.6.25.14-108.fc9.x86_64 
Sep 22 19:42:46 home07 kernel: ata1.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32)
Sep 22 19:42:46 home07 kernel: ata2.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32)
Sep 22 19:42:46 home07 kernel: ata3.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32)
T:[rolf@home07 tmp]$ grep NCQ msg-2.6.26.3-29.fc9.x86_64 
Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.0: Using SWNCQ mode
Sep 22 21:29:10 home07 kernel: ata1.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32)
Sep 22 21:29:10 home07 kernel: ata2.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32)
Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.1: Using SWNCQ mode
Sep 22 21:29:10 home07 kernel: ata3.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32)
Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.2: Using SWNCQ mode
T:[rolf@home07 tmp]$ 

See the attachements for full details

Comment 8 Rolf Fokkens 2008-09-22 20:16:30 UTC
Google "SWNCQ" and "SATA_NV" results in hits which describe familiar problems.

Comment 9 Chuck Ebbert 2008-09-25 12:36:33 UTC
Created attachment 317683 [details]
output of lspci -vv

Please don't attach tar files.

Comment 10 Chuck Ebbert 2008-09-25 12:41:05 UTC
You can disable SWNCQ by editing /etc/modprobe.conf and adding these lines, then rebuilding the initrd:

  options scsi_hostadapter1 swncq=0 
  options sata_nv swncq=0 

If you don't have an alias line for scsi_hostadapter1 you can omit the first line.

Comment 11 Rolf Fokkens 2008-09-25 22:21:32 UTC
I'm already running with a modified modprobe.conf and initrd, things seem fine now. The system ran OK however for a day before that without a hitch, so I'll use swncq=0 for a few days and then I'll report the status again.

Comment 12 Rolf Fokkens 2008-10-02 16:45:47 UTC
Things are still fine with swncq=0...

Comment 13 Chuck Ebbert 2008-10-13 20:26:02 UTC
disabled swncq by default in 2.6.26.6-69

Comment 14 Chuck Ebbert 2008-10-14 08:14:06 UTC
Can you post the output of "hdparm -I" and "smartctl -a" commands against the drive(s) having the problem?

Comment 15 Holger Lubitz 2008-10-14 10:13:08 UTC
The "hits which describe familiar problems" seem to be based on pre-mainline versions of the swncq patch. The issue Jon Ivar Rykkelid had (your links in description and comment #2) didn't even seem to involve swncq.

As you note yourself that another system doesn't show the issue, may I suggest checking and maybe replacing your cabling? Asus is said to have been shipping some less than perfect SATA cables in the past. ata2 turning up as 3.0 Gbps in your 2.6.25 dmesg and only 1.5 Gbps in your 2.6.26 dmesg seems to support that theory as well.

swncq was off by default when it first hit mainline and that default was changed when it was believed to be safe. If it can cause data corruption, disabling it again should hit mainline immediately (together with a big red warning advising against it - also the people at nvidia should be looking into this). But as it is, this seems premature to me.

Comment 16 Rolf Fokkens 2008-10-14 17:09:24 UTC
Created attachment 320329 [details]
Output of hdparm -i for /dev/sd[a-c]

Comment 17 Rolf Fokkens 2008-10-14 17:10:35 UTC
Created attachment 320330 [details]
Output of smartctl -a /dev/sd[a-c]

Comment 18 Rolf Fokkens 2008-10-14 17:23:26 UTC
As noted in Comment #2: The other system w/o the problems is MCP61 based. The system with the problems is MCP55 based. The former has default sqncq=0 while the latter has default sqncq=1 (as coded in the driver).

This explains the difference between the systems with respect to the bug.

For the moment I'm just happy with a stable system by setting swncq=0. Making my system unstable may result in a time consuming reinstall of it, which I'm not enjoying very much.

However...if I stay the only person reporting this problem, it may be a problem that's specifically related to my system. In that case I can try to do experiments (by your request) to provide y'all with more info.

Comment 19 Fedora Update System 2008-10-14 19:53:24 UTC
kernel-2.6.26.6-46.fc8 has been submitted as an update for Fedora 8.
http://admin.fedoraproject.org/updates/kernel-2.6.26.6-46.fc8

Comment 20 Fedora Update System 2008-10-14 20:01:29 UTC
kernel-2.6.26.6-71.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.26.6-71.fc9

Comment 21 Chuck Ebbert 2008-10-20 14:23:48 UTC
I didn't notice the SATA link speed changed, and there have been a few reports of higher link speeds causing problems. FWIW I always set the jumpers on new drives to force the speed to 1.5Gbit/sec to avoid possible problems with marginal cables...

Comment 22 Rolf Fokkens 2008-10-20 17:50:39 UTC
I don't know what you mean by the mentioned link speed changes. The only thing I see is that all links operate at 3.0 Gbps for the Samsung HD160JJ disks. Only the links speed for the Samsung SH-S203D DVD RW drive is limited to 1.5 Gbps.

I only mention a speed kickback to 1.5 Gbps in comment #1, which is a similar but not the same problem.

Comment 23 Holger Lubitz 2008-10-20 18:14:28 UTC
Then look again.

From your "dmesg of 2.6.25.14-108 kernel":

Sep 22 19:42:46 home07 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 22 19:42:46 home07 kernel: ata2.00: ATA-7: SAMSUNG HD160JJ, WU100-33, max UDMA/100

From your "dmesg of 2.6.26.3-29 kernel":

Sep 22 21:29:10 home07 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 22 21:29:10 home07 kernel: ata2.00: ATA-7: SAMSUNG HD160JJ, WU100-33, max UDMA/100

If you are saying that you are seeing 3.0 Gbps again in your _current_ dmesg on the machine, it would very much indicate a cable problem with at least this specific drive.

Comment 24 Rolf Fokkens 2008-10-20 18:35:03 UTC
Created attachment 320898 [details]
dmesg of 2.6.26.5-45 kernel

Comment 25 Rolf Fokkens 2008-10-20 18:44:42 UTC
OK, got it. ata2 runs at 1.5 Gbps from startup.

Attached my current kernel's dmesg. It uses 3.0 Gbps indeed and it runs flawlessly. It runs without NCQ however (SWNCQ=0).

If things run just fine at 3.0 Gbps and there's trouble at 1.5 Gbps I'm not sure about what's the cause and what's the effect.

Comment 26 Fedora Update System 2008-10-20 20:29:09 UTC
kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8929

Comment 27 Chuck Ebbert 2008-10-22 05:07:48 UTC
Upstream developers think this could be a power problem that is happening now because NCQ makes the drive use more power. Can you test with SWNCQ enabled and connect the drives to different power connectors (or a separate power supply?)

Comment 28 Fedora Update System 2008-10-23 16:38:36 UTC
kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 29 Rolf Fokkens 2008-10-25 11:04:18 UTC
Regarding the potential power probem I replaced all MOLEX-SATA adapter cables with new ones (now none of them splits 1 MOLEX connecter to 2 SATA power connector). This is the result for the 2.6.26.3-29.fc9.x86_64 kernel:

Oct 25 12:52:43 home07 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 25 12:52:43 home07 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Oct 25 12:52:43 home07 kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 25 12:52:43 home07 kernel: ata3.00: status: { DRDY }
Oct 25 12:52:43 home07 kernel: ata3: hard resetting link
Oct 25 12:52:43 home07 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 25 12:52:43 home07 kernel: ata3.00: configured for UDMA/133
Oct 25 12:52:43 home07 kernel: end_request: I/O error, dev sdc, sector 312576468
Oct 25 12:52:43 home07 kernel: md: super_written gets error=-5, uptodate=0
Oct 25 12:52:43 home07 kernel: raid1: Disk failure on sdc5, disabling device.
Oct 25 12:52:43 home07 kernel: raid1: Operation continuing on 2 devices.
Oct 25 12:52:43 home07 kernel: ata3: EH complete
Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB)
Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] Write Protect is off
Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 25 12:52:43 home07 kernel: RAID1 conf printout:
Oct 25 12:52:43 home07 kernel: --- wd:2 rd:3
Oct 25 12:52:43 home07 kernel: disk 0, wo:0, o:1, dev:sda5
Oct 25 12:52:43 home07 kernel: disk 1, wo:0, o:1, dev:sdb5
Oct 25 12:52:43 home07 kernel: disk 2, wo:1, o:0, dev:sdc5
Oct 25 12:52:43 home07 kernel: RAID1 conf printout:
Oct 25 12:52:43 home07 kernel: --- wd:2 rd:3
Oct 25 12:52:43 home07 kernel: disk 0, wo:0, o:1, dev:sda5
Oct 25 12:52:43 home07 kernel: disk 1, wo:0, o:1, dev:sdb5

This happened during a "yum update" which resulted in lots of writes.

Testing with an additional power supply has two practical issues:
* I don't have another power supply
* How do I connect an additional power supply?

Comment 30 Thomas Hilber 2009-07-27 18:10:31 UTC
Thanks to all posters above. I've got exactly the same issue as Rolf Fokkens on my Asus M2NPV-VM (MCP51 based).

My box first was running for over 2.5 years without any problems (kernel 2.6.20 i686) . 

Since I upgraded to kernel 2.6.26 x86_64 (debian lenny) my RAID-5 begun to fail reproducible.
All disks failed repeatedly at the *same* high sector number 152344255 (near partition end).

I fortunately could  always rebuild the RAID before a second disk failed at the same time. So I
never lost any data.

Mostly a few hours after reboot I get:

Jul 19 13:00:08 roja kernel: [ 3953.892925] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 19 13:00:08 roja kernel: [ 3953.892936] ata2.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul 19 13:00:08 roja kernel: [ 3953.892938]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 19 13:00:08 roja kernel: [ 3953.892940] ata2.00: status: { DRDY }
Jul 19 13:00:08 roja kernel: [ 3953.892948] ata2: hard resetting link
Jul 19 13:00:08 roja kernel: [ 3954.501492] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 19 13:00:08 roja kernel: [ 3954.517725] ata2.00: configured for UDMA/133
Jul 19 13:00:08 roja kernel: [ 3954.517740] end_request: I/O error, dev sdb, sector 152344255
Jul 19 13:00:08 roja kernel: [ 3954.517745] md: super_written gets error=-5, uptodate=0
Jul 19 13:00:08 roja mdadm[4975]: Fail event detected on md device /dev/md0, component device /dev/sdb1
Jul 19 13:00:08 roja kernel: [ 3954.517749] raid5: Disk failure on sdb1, disabling device.
Jul 19 13:00:08 roja kernel: [ 3954.517750] raid5: Operation continuing on 2 devices.

no problems since 8 days 24/7 operation any more after disabling swncq. Nobody needs to
tell me this is no driver bug:)

BTW:
Another MCP67 based system never showed the problem. WIth exactly the same debian version.

Comment 31 Thomas Hilber 2009-07-27 19:50:21 UTC
just to mention it:

kernel 2.6.30 shows same behavior: only after disabling swncq it's running without disk failures on my MCP51 system. 

again with 2.6.30: the MCP67 appears to be immune against default swncq setting

Comment 32 Jordi Ferrer Plana 2009-11-27 17:10:29 UTC
The same problem in 2.6.30-gentoo-r5 (Gentoo Linux):

Nov 27 12:43:50 llebre kernel: ata2: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Nov 27 12:43:50 llebre kernel: ata2: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Nov 27 12:43:50 llebre kernel: dhfis 0x1 dmafis 0x0 sdbfis 0x0
Nov 27 12:43:50 llebre kernel: ata2: ATA_REG 0x40 ERR_REG 0x0
Nov 27 12:43:50 llebre kernel: ata2: tag : dhfis dmafis sdbfis sacitve
Nov 27 12:43:50 llebre kernel: ata2: tag 0x0: 1 0 0 1
Nov 27 12:43:50 llebre kernel: ata2.00: exception Emask 0x10 SAct 0x1 SErr 0x1810000 action 0xe frozen
Nov 27 12:43:50 llebre kernel: ata2.00: hot unplug
Nov 27 12:43:50 llebre kernel: ata2: SError: { PHYRdyChg LinkSeq TrStaTrns }
Nov 27 12:43:50 llebre kernel: ata2.00: cmd 60/08:00:7f:86:40/00:00:0b:00:00/40 tag 0 ncq 4096 in
Nov 27 12:43:50 llebre kernel: res 40/00:00:7f:86:40/00:00:7f:86:40/40 Emask 0x10 (ATA bus error)
Nov 27 12:43:50 llebre kernel: ata2.00: status: { DRDY }
Nov 27 12:43:50 llebre kernel: ata2: hard resetting link
Nov 27 12:43:50 llebre kernel: ata2: link is slow to respond, please be patient (ready=0)
Nov 27 12:43:50 llebre kernel: ata2: SRST failed (errno=-16)
Nov 27 12:43:50 llebre kernel: ata2: hard resetting link
Nov 27 12:43:50 llebre kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 27 12:43:50 llebre kernel: ata2.00: configured for UDMA/133
Nov 27 12:43:50 llebre kernel: ata2: EH complete


And solved it by disabling swncq.

However I realized that if I plug the disk drives to different controllers the system seams also stable.

I understand that I have 3 MCP55 controllers in my PCI:
# lspci |grep MCP55|grep IDE
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)

Initially I had a software RAID (mirror using two drives) with swncq=1 and the two disk drives plugged at the same controller i.e. 00:05.0:

# dir /sys/bus/scsi/devices/|grep "5\.0"
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 2:0:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host2/target2:0:0/2:0:0:0
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 3:0:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host3/target3:0:0/3:0:0:0
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 host2 -> ../../../devices/pci0000:00/0000:00:05.0/host2
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 host3 -> ../../../devices/pci0000:00/0000:00:05.0/host3
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 target2:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host2/target2:0:0
lrwxrwxrwx 1 root root 0 2009-11-27 17:45 target3:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host3/target3:0:0

The problem appeared after upgrading the kernel at some version (when swncq became by default 1 instead of 0). The RAID became degraded since one of the two disks had the above "hardresetting" problems. Since the disk seamed OK after the errors (i.e. "EH Complete" in the log file) I tried to rebuild the RAID mirror and it failed again. I couldn't add it to the software array anymore.

Before rebooting the server and disabling swncq I detached the failed disk (hotunplug) and I hotplugged it in the other controller (i.e. 00:05.1), I re-scaned the scsi bus, then the drive was detected ok. I joined it to the RAID again without any problem and the system was stable.

I do not know if this problem is due to a "power" problem, a software driver problem or a harware chipset problem, but I hope this information could help sata_nv developers in order to find out the problem.

Comment 33 Steven Ellis 2021-01-04 10:40:00 UTC
I'm seeing the same issue in an M2NPV-VM motherboard with a new generation WD30EFRX drive on the NForce sata_nv SATA ports even with sata_nv.swncq=0 

Jan 04 23:29:57 kvm.server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 04 23:29:57 kvm.server kernel: ata3.00: failed command: WRITE DMA EXT
Jan 04 23:29:57 kvm.server kernel: ata3.00: cmd 35/00:00:f0:9f:7e/00:04:b4:00:00/e0 tag 0 dma 524288 out
                                                           res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 04 23:29:57 kvm.server kernel: ata3.00: status: { DRDY }

System is currently running RHEL 7.8

Comment 34 Steven Ellis 2021-01-04 10:42:11 UTC
Also confirmed ncq is off on the device already

cat /sys/block/sda/device/queue_depth
1

Comment 35 Dominik 'Rathann' Mierzejewski 2021-05-04 11:42:19 UTC
(In reply to Steven Ellis from comment #33)
> I'm seeing the same issue in an M2NPV-VM motherboard with a new generation
> WD30EFRX drive on the NForce sata_nv SATA ports even with sata_nv.swncq=0 
> 
> Jan 04 23:29:57 kvm.server kernel: ata3.00: exception Emask 0x0 SAct 0x0
> SErr 0x0 action 0x6 frozen
> Jan 04 23:29:57 kvm.server kernel: ata3.00: failed command: WRITE DMA EXT
> Jan 04 23:29:57 kvm.server kernel: ata3.00: cmd
> 35/00:00:f0:9f:7e/00:04:b4:00:00/e0 tag 0 dma 524288 out
>                                                            res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Jan 04 23:29:57 kvm.server kernel: ata3.00: status: { DRDY }
> 
> System is currently running RHEL 7.8

You might want to file a new bug against RHEL, then. This is an old Fedora 9 bug report that should stay closed.