Bug 463034
Summary: | [sata_nv swncq] kernel 2.6.26.3-29 raid errors: "md: super_written gets error=-5, uptodate=0" | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Rolf Fokkens <rolf> | ||||||||||||||||
Component: | kernel | Assignee: | Jeff Garzik <jgarzik> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | 9 | CC: | bugzilla.redhat, holger.lubitz, kernel-maint, nemesis, netfox, peterm, sellis | ||||||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | 2.6.26.6-79.fc9 | Doc Type: | Bug Fix | ||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2021-05-04 11:42:19 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Rolf Fokkens
2008-09-20 23:04:59 UTC
See also: http://lkml.org/lkml/2007/9/1/95 This part: > [ 63.420000] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen > [ 63.420000] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 > cdb 0x0 data 4096 out > [ 63.420000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask > 0x4 (timeout) > [ 63.420000] ata2.00: status: {DRDY } > [ 63.420000] ata2: hard resetting link > [ 65.720000] ata2: softreset failed (port not ready) > [ 65.720000] ata2: reset failed (errno=-5), retrying in 8 secs > [ 73.420000] ata2: hard resetting link > [ 75.720000] ata2: softreset failed (port not ready) > [ 75.720000] ata2: reset failed (errno=-5), retrying in 8 secs > [ 83.420000] ata2: hard resetting link > [ 85.720000] ata2: softreset failed (port not ready) > [ 85.720000] ata2: reset failed (errno=-5), retrying in 33 secs > [ 118.420000] ata2: limiting SATA link speed to 1.5 Gbps > [ 118.420000] ata2: hard resetting link > [ 120.720000] ata2: softreset failed (port not ready) > [ 120.720000] ata2: reset failed, giving up > [ 120.720000] ata2.00: disabled > [ 120.720000] ata2: EH complete > [ 120.720000] sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET > driverbyte=DRIVER_OK,SUGGEST_OK > [ 120.720000] end_request: I/O error, dev sdb, sector 625137161 > [ 120.720000] md: super_written gets error=-5, uptodate=0 Though I don't recall the attempts to reset the link several times, I think that the drive was kicked out of the array immediately and unnoticed by me. also: http://lkml.org/lkml/2007/9/13/49 Again it seems sata_nv related. I have another system with an M2N-MX SE Plus board based om MCP61 that survived the kernel update without issues. That system however runs raid1. I downgraded the kernel as a precaution though There is an upstream patch that disables hardreset for som nvidia chipsets but this doesn't look like the same problem. In the upstream error report the drives never even get detected. I reinstalled everything, and did an installation on a RAID1 array, which allowed me to have a /var/log/messages after a potential failure. The problem showed up on RAID1 as well! This is the relevant /var/log/messages parts: Sep 21 18:58:35 localhost kernel: imklog 3.18.1, log source = /proc/kmsg started. Sep 21 18:58:35 localhost kernel: Initializing cgroup subsys cpuset Sep 21 18:58:35 localhost kernel: Linux version 2.6.26.3-29.fc9.x86_64 (mockbuild@) (gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC) ) #1 SMP Wed Sep 3 03:16:37 EDT 2008 Sep 21 18:58:35 localhost kernel: Command line: ro root=/dev/HOME07T/ROOT ... Sep 21 23:30:26 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Sep 21 23:30:26 localhost kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Sep 21 23:30:26 localhost kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Sep 21 23:30:26 localhost kernel: ata1.00: status: { DRDY } Sep 21 23:30:26 localhost kernel: ata1: hard resetting link Sep 21 23:30:27 localhost kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 21 23:30:27 localhost kernel: ata1.00: configured for UDMA/133 Sep 21 23:30:27 localhost kernel: end_request: I/O error, dev sda, sector 312576468 Sep 21 23:30:27 localhost kernel: md: super_written gets error=-5, uptodate=0 Sep 21 23:30:27 localhost kernel: raid1: Disk failure on sda5, disabling device. Sep 21 23:30:27 localhost kernel: raid1: Operation continuing on 2 devices. Sep 21 23:30:27 localhost kernel: ata1: EH complete Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB) Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off Sep 21 23:30:27 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 21 23:30:27 localhost kernel: RAID1 conf printout: Sep 21 23:30:27 localhost kernel: --- wd:2 rd:3 Sep 21 23:30:27 localhost kernel: disk 0, wo:1, o:0, dev:sda5 Sep 21 23:30:27 localhost kernel: disk 1, wo:0, o:1, dev:sdb5 Sep 21 23:30:27 localhost kernel: disk 2, wo:0, o:1, dev:sdc5 Sep 21 23:30:27 localhost kernel: RAID1 conf printout: Sep 21 23:30:27 localhost kernel: --- wd:2 rd:3 Sep 21 23:30:27 localhost kernel: disk 1, wo:0, o:1, dev:sdb5 Sep 21 23:30:27 localhost kernel: disk 2, wo:0, o:1, dev:sdc5 Sep 21 23:31:16 localhost kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Sep 21 23:31:16 localhost kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Sep 21 23:31:16 localhost kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Sep 21 23:31:16 localhost kernel: ata3.00: status: { DRDY } Sep 21 23:31:16 localhost kernel: ata3: hard resetting link Sep 21 23:31:16 localhost kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 21 23:31:16 localhost kernel: ata3.00: configured for UDMA/133 Sep 21 23:31:16 localhost kernel: end_request: I/O error, dev sdc, sector 312576468 Sep 21 23:31:16 localhost kernel: md: super_written gets error=-5, uptodate=0 Sep 21 23:31:16 localhost kernel: raid1: Disk failure on sdc5, disabling device. Sep 21 23:31:16 localhost kernel: raid1: Operation continuing on 1 devices. Sep 21 23:31:16 localhost kernel: ata3: EH complete Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB) Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] Write Protect is off Sep 21 23:31:16 localhost kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 21 23:31:16 localhost kernel: RAID1 conf printout: Sep 21 23:31:16 localhost kernel: --- wd:1 rd:3 Sep 21 23:31:16 localhost kernel: disk 1, wo:0, o:1, dev:sdb5 Sep 21 23:31:16 localhost kernel: disk 2, wo:1, o:0, dev:sdc5 Sep 21 23:31:16 localhost kernel: RAID1 conf printout: Sep 21 23:31:16 localhost kernel: --- wd:1 rd:3 Sep 21 23:31:16 localhost kernel: disk 1, wo:0, o:1, dev:sdb5 Within a 1 minute two drives get kicked from the array, both apparently tryin toe access the same block. fdisk output (for all disks the same): Disk /dev/sda: 160.0 GB, 160041885696 bytes 255 heads, 63 sectors/track, 19457 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00032e19 Device Boot Start End Blocks Id System /dev/sda1 * 1 62 497983+ fd Linux raid autodetect /dev/sda2 63 10261 81923467+ fd Linux raid autodetect /dev/sda3 10262 10325 514080 fd Linux raid autodetect /dev/sda4 10326 19457 73352790 5 Extended /dev/sda5 10326 19457 73352758+ fd Linux raid autodetect Created attachment 317413 [details]
dmesg of 2.6.25.14-108 kernel
Created attachment 317414 [details]
dmesg of 2.6.26.3-29 kernel
A potential interesting difference between de sata_nv SATA drivers of the 2.6.25 and 2.6.26 kernel: T:[rolf@home07 tmp]$ grep NCQ msg-2.6.25.14-108.fc9.x86_64 Sep 22 19:42:46 home07 kernel: ata1.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32) Sep 22 19:42:46 home07 kernel: ata2.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32) Sep 22 19:42:46 home07 kernel: ata3.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 0/32) T:[rolf@home07 tmp]$ grep NCQ msg-2.6.26.3-29.fc9.x86_64 Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.0: Using SWNCQ mode Sep 22 21:29:10 home07 kernel: ata1.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32) Sep 22 21:29:10 home07 kernel: ata2.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32) Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.1: Using SWNCQ mode Sep 22 21:29:10 home07 kernel: ata3.00: 312581808 sectors, multi 1: LBA48 NCQ (depth 31/32) Sep 22 21:29:10 home07 kernel: sata_nv 0000:00:05.2: Using SWNCQ mode T:[rolf@home07 tmp]$ See the attachements for full details Google "SWNCQ" and "SATA_NV" results in hits which describe familiar problems. Created attachment 317683 [details]
output of lspci -vv
Please don't attach tar files.
You can disable SWNCQ by editing /etc/modprobe.conf and adding these lines, then rebuilding the initrd: options scsi_hostadapter1 swncq=0 options sata_nv swncq=0 If you don't have an alias line for scsi_hostadapter1 you can omit the first line. I'm already running with a modified modprobe.conf and initrd, things seem fine now. The system ran OK however for a day before that without a hitch, so I'll use swncq=0 for a few days and then I'll report the status again. Things are still fine with swncq=0... disabled swncq by default in 2.6.26.6-69 Can you post the output of "hdparm -I" and "smartctl -a" commands against the drive(s) having the problem? The "hits which describe familiar problems" seem to be based on pre-mainline versions of the swncq patch. The issue Jon Ivar Rykkelid had (your links in description and comment #2) didn't even seem to involve swncq. As you note yourself that another system doesn't show the issue, may I suggest checking and maybe replacing your cabling? Asus is said to have been shipping some less than perfect SATA cables in the past. ata2 turning up as 3.0 Gbps in your 2.6.25 dmesg and only 1.5 Gbps in your 2.6.26 dmesg seems to support that theory as well. swncq was off by default when it first hit mainline and that default was changed when it was believed to be safe. If it can cause data corruption, disabling it again should hit mainline immediately (together with a big red warning advising against it - also the people at nvidia should be looking into this). But as it is, this seems premature to me. Created attachment 320329 [details]
Output of hdparm -i for /dev/sd[a-c]
Created attachment 320330 [details]
Output of smartctl -a /dev/sd[a-c]
As noted in Comment #2: The other system w/o the problems is MCP61 based. The system with the problems is MCP55 based. The former has default sqncq=0 while the latter has default sqncq=1 (as coded in the driver). This explains the difference between the systems with respect to the bug. For the moment I'm just happy with a stable system by setting swncq=0. Making my system unstable may result in a time consuming reinstall of it, which I'm not enjoying very much. However...if I stay the only person reporting this problem, it may be a problem that's specifically related to my system. In that case I can try to do experiments (by your request) to provide y'all with more info. kernel-2.6.26.6-46.fc8 has been submitted as an update for Fedora 8. http://admin.fedoraproject.org/updates/kernel-2.6.26.6-46.fc8 kernel-2.6.26.6-71.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.6-71.fc9 I didn't notice the SATA link speed changed, and there have been a few reports of higher link speeds causing problems. FWIW I always set the jumpers on new drives to force the speed to 1.5Gbit/sec to avoid possible problems with marginal cables... I don't know what you mean by the mentioned link speed changes. The only thing I see is that all links operate at 3.0 Gbps for the Samsung HD160JJ disks. Only the links speed for the Samsung SH-S203D DVD RW drive is limited to 1.5 Gbps. I only mention a speed kickback to 1.5 Gbps in comment #1, which is a similar but not the same problem. Then look again. From your "dmesg of 2.6.25.14-108 kernel": Sep 22 19:42:46 home07 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 22 19:42:46 home07 kernel: ata2.00: ATA-7: SAMSUNG HD160JJ, WU100-33, max UDMA/100 From your "dmesg of 2.6.26.3-29 kernel": Sep 22 21:29:10 home07 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Sep 22 21:29:10 home07 kernel: ata2.00: ATA-7: SAMSUNG HD160JJ, WU100-33, max UDMA/100 If you are saying that you are seeing 3.0 Gbps again in your _current_ dmesg on the machine, it would very much indicate a cable problem with at least this specific drive. Created attachment 320898 [details]
dmesg of 2.6.26.5-45 kernel
OK, got it. ata2 runs at 1.5 Gbps from startup. Attached my current kernel's dmesg. It uses 3.0 Gbps indeed and it runs flawlessly. It runs without NCQ however (SWNCQ=0). If things run just fine at 3.0 Gbps and there's trouble at 1.5 Gbps I'm not sure about what's the cause and what's the effect. kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8929 Upstream developers think this could be a power problem that is happening now because NCQ makes the drive use more power. Can you test with SWNCQ enabled and connect the drives to different power connectors (or a separate power supply?) kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. Regarding the potential power probem I replaced all MOLEX-SATA adapter cables with new ones (now none of them splits 1 MOLEX connecter to 2 SATA power connector). This is the result for the 2.6.26.3-29.fc9.x86_64 kernel: Oct 25 12:52:43 home07 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 25 12:52:43 home07 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Oct 25 12:52:43 home07 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 25 12:52:43 home07 kernel: ata3.00: status: { DRDY } Oct 25 12:52:43 home07 kernel: ata3: hard resetting link Oct 25 12:52:43 home07 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 25 12:52:43 home07 kernel: ata3.00: configured for UDMA/133 Oct 25 12:52:43 home07 kernel: end_request: I/O error, dev sdc, sector 312576468 Oct 25 12:52:43 home07 kernel: md: super_written gets error=-5, uptodate=0 Oct 25 12:52:43 home07 kernel: raid1: Disk failure on sdc5, disabling device. Oct 25 12:52:43 home07 kernel: raid1: Operation continuing on 2 devices. Oct 25 12:52:43 home07 kernel: ata3: EH complete Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB) Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] Write Protect is off Oct 25 12:52:43 home07 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Oct 25 12:52:43 home07 kernel: RAID1 conf printout: Oct 25 12:52:43 home07 kernel: --- wd:2 rd:3 Oct 25 12:52:43 home07 kernel: disk 0, wo:0, o:1, dev:sda5 Oct 25 12:52:43 home07 kernel: disk 1, wo:0, o:1, dev:sdb5 Oct 25 12:52:43 home07 kernel: disk 2, wo:1, o:0, dev:sdc5 Oct 25 12:52:43 home07 kernel: RAID1 conf printout: Oct 25 12:52:43 home07 kernel: --- wd:2 rd:3 Oct 25 12:52:43 home07 kernel: disk 0, wo:0, o:1, dev:sda5 Oct 25 12:52:43 home07 kernel: disk 1, wo:0, o:1, dev:sdb5 This happened during a "yum update" which resulted in lots of writes. Testing with an additional power supply has two practical issues: * I don't have another power supply * How do I connect an additional power supply? Thanks to all posters above. I've got exactly the same issue as Rolf Fokkens on my Asus M2NPV-VM (MCP51 based). My box first was running for over 2.5 years without any problems (kernel 2.6.20 i686) . Since I upgraded to kernel 2.6.26 x86_64 (debian lenny) my RAID-5 begun to fail reproducible. All disks failed repeatedly at the *same* high sector number 152344255 (near partition end). I fortunately could always rebuild the RAID before a second disk failed at the same time. So I never lost any data. Mostly a few hours after reboot I get: Jul 19 13:00:08 roja kernel: [ 3953.892925] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jul 19 13:00:08 roja kernel: [ 3953.892936] ata2.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jul 19 13:00:08 roja kernel: [ 3953.892938] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jul 19 13:00:08 roja kernel: [ 3953.892940] ata2.00: status: { DRDY } Jul 19 13:00:08 roja kernel: [ 3953.892948] ata2: hard resetting link Jul 19 13:00:08 roja kernel: [ 3954.501492] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul 19 13:00:08 roja kernel: [ 3954.517725] ata2.00: configured for UDMA/133 Jul 19 13:00:08 roja kernel: [ 3954.517740] end_request: I/O error, dev sdb, sector 152344255 Jul 19 13:00:08 roja kernel: [ 3954.517745] md: super_written gets error=-5, uptodate=0 Jul 19 13:00:08 roja mdadm[4975]: Fail event detected on md device /dev/md0, component device /dev/sdb1 Jul 19 13:00:08 roja kernel: [ 3954.517749] raid5: Disk failure on sdb1, disabling device. Jul 19 13:00:08 roja kernel: [ 3954.517750] raid5: Operation continuing on 2 devices. no problems since 8 days 24/7 operation any more after disabling swncq. Nobody needs to tell me this is no driver bug:) BTW: Another MCP67 based system never showed the problem. WIth exactly the same debian version. just to mention it: kernel 2.6.30 shows same behavior: only after disabling swncq it's running without disk failures on my MCP51 system. again with 2.6.30: the MCP67 appears to be immune against default swncq setting The same problem in 2.6.30-gentoo-r5 (Gentoo Linux): Nov 27 12:43:50 llebre kernel: ata2: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1 Nov 27 12:43:50 llebre kernel: ata2: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0 Nov 27 12:43:50 llebre kernel: dhfis 0x1 dmafis 0x0 sdbfis 0x0 Nov 27 12:43:50 llebre kernel: ata2: ATA_REG 0x40 ERR_REG 0x0 Nov 27 12:43:50 llebre kernel: ata2: tag : dhfis dmafis sdbfis sacitve Nov 27 12:43:50 llebre kernel: ata2: tag 0x0: 1 0 0 1 Nov 27 12:43:50 llebre kernel: ata2.00: exception Emask 0x10 SAct 0x1 SErr 0x1810000 action 0xe frozen Nov 27 12:43:50 llebre kernel: ata2.00: hot unplug Nov 27 12:43:50 llebre kernel: ata2: SError: { PHYRdyChg LinkSeq TrStaTrns } Nov 27 12:43:50 llebre kernel: ata2.00: cmd 60/08:00:7f:86:40/00:00:0b:00:00/40 tag 0 ncq 4096 in Nov 27 12:43:50 llebre kernel: res 40/00:00:7f:86:40/00:00:7f:86:40/40 Emask 0x10 (ATA bus error) Nov 27 12:43:50 llebre kernel: ata2.00: status: { DRDY } Nov 27 12:43:50 llebre kernel: ata2: hard resetting link Nov 27 12:43:50 llebre kernel: ata2: link is slow to respond, please be patient (ready=0) Nov 27 12:43:50 llebre kernel: ata2: SRST failed (errno=-16) Nov 27 12:43:50 llebre kernel: ata2: hard resetting link Nov 27 12:43:50 llebre kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 27 12:43:50 llebre kernel: ata2.00: configured for UDMA/133 Nov 27 12:43:50 llebre kernel: ata2: EH complete And solved it by disabling swncq. However I realized that if I plug the disk drives to different controllers the system seams also stable. I understand that I have 3 MCP55 controllers in my PCI: # lspci |grep MCP55|grep IDE 00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) Initially I had a software RAID (mirror using two drives) with swncq=1 and the two disk drives plugged at the same controller i.e. 00:05.0: # dir /sys/bus/scsi/devices/|grep "5\.0" lrwxrwxrwx 1 root root 0 2009-11-27 17:45 2:0:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host2/target2:0:0/2:0:0:0 lrwxrwxrwx 1 root root 0 2009-11-27 17:45 3:0:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host3/target3:0:0/3:0:0:0 lrwxrwxrwx 1 root root 0 2009-11-27 17:45 host2 -> ../../../devices/pci0000:00/0000:00:05.0/host2 lrwxrwxrwx 1 root root 0 2009-11-27 17:45 host3 -> ../../../devices/pci0000:00/0000:00:05.0/host3 lrwxrwxrwx 1 root root 0 2009-11-27 17:45 target2:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host2/target2:0:0 lrwxrwxrwx 1 root root 0 2009-11-27 17:45 target3:0:0 -> ../../../devices/pci0000:00/0000:00:05.0/host3/target3:0:0 The problem appeared after upgrading the kernel at some version (when swncq became by default 1 instead of 0). The RAID became degraded since one of the two disks had the above "hardresetting" problems. Since the disk seamed OK after the errors (i.e. "EH Complete" in the log file) I tried to rebuild the RAID mirror and it failed again. I couldn't add it to the software array anymore. Before rebooting the server and disabling swncq I detached the failed disk (hotunplug) and I hotplugged it in the other controller (i.e. 00:05.1), I re-scaned the scsi bus, then the drive was detected ok. I joined it to the RAID again without any problem and the system was stable. I do not know if this problem is due to a "power" problem, a software driver problem or a harware chipset problem, but I hope this information could help sata_nv developers in order to find out the problem. I'm seeing the same issue in an M2NPV-VM motherboard with a new generation WD30EFRX drive on the NForce sata_nv SATA ports even with sata_nv.swncq=0 Jan 04 23:29:57 kvm.server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 04 23:29:57 kvm.server kernel: ata3.00: failed command: WRITE DMA EXT Jan 04 23:29:57 kvm.server kernel: ata3.00: cmd 35/00:00:f0:9f:7e/00:04:b4:00:00/e0 tag 0 dma 524288 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 04 23:29:57 kvm.server kernel: ata3.00: status: { DRDY } System is currently running RHEL 7.8 Also confirmed ncq is off on the device already cat /sys/block/sda/device/queue_depth 1 (In reply to Steven Ellis from comment #33) > I'm seeing the same issue in an M2NPV-VM motherboard with a new generation > WD30EFRX drive on the NForce sata_nv SATA ports even with sata_nv.swncq=0 > > Jan 04 23:29:57 kvm.server kernel: ata3.00: exception Emask 0x0 SAct 0x0 > SErr 0x0 action 0x6 frozen > Jan 04 23:29:57 kvm.server kernel: ata3.00: failed command: WRITE DMA EXT > Jan 04 23:29:57 kvm.server kernel: ata3.00: cmd > 35/00:00:f0:9f:7e/00:04:b4:00:00/e0 tag 0 dma 524288 out > res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > Jan 04 23:29:57 kvm.server kernel: ata3.00: status: { DRDY } > > System is currently running RHEL 7.8 You might want to file a new bug against RHEL, then. This is an old Fedora 9 bug report that should stay closed. |