Description of problem: System had been running 4.8 with the Marvell mv_sata driver version 3.6.2 for quite a while without much trouble. Just updated to 5.4 and using the sata_mv driver. I'm seeing the following messages periodically: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata10.00: cmd c8/00:08:87:d0:21/00:00:00:00:00/e4 tag 0 dma 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata10.00: status: { DRDY } ata10: hard resetting link ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata10.00: configured for UDMA/133 ata10: EH complete SCSI device sds: 3907029168 512-byte hdwr sectors (2000399 MB) sds: Write Protect is off sds: Mode Sense: 00 3a 00 00 SCSI device sds: drive cache: write back seen with 4 different devices so far: ata21.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata16.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen So far, no real trouble, but worried. I've just added a new raid 5 array which is now rebuilding, and perhaps the trigger for some of the errors. System has 3 SATA controllers: 02:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09) 03:02.0 RAID bus controller: Marvell Technology Group Ltd. MV88SX5081 8-port SATA I PCI-X Controller 03:03.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX5081 8-port SATA I PCI-X Controller (rev 03) ata1 -ata8 are on 0000:02:01.0 ata9 -ata16 are on 0000:03:02.0 ata17-ata24 are on 0000:03:03.0 See also Fedora bug #549981 Version-Release number of selected component (if applicable): 2.6.18-164.10.1.el5.centos.plusPAE Some more errors: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata14.00: cmd ca/00:10:1f:f6:24/00:00:00:00:00/e8 tag 0 dma 8192 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata14.00: cmd ca/00:18:57:d9:39/00:00:00:00:00/e8 tag 0 dma 12288 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
I've yet to see an error on disks on the 6081 controller, just the 5081. Similar bug reported fixed for the 6081 (bug 493451), but maybe it wasn't really fixed for the 5081?
For BZ 493451, using the 6081 8-port, the problem was accesses to upper 4 ports were causing timeouts for the lower 4 ports. The fix was a 1 line code change to mv_start_dma, it wasn't really specific to 6081, it should help 5081 as well. Looking at BZ 549981, it looks like they are actually using the ahci driver. Would you please attach your full dmesg output and the output from "lspci -vvxx"?
Created attachment 384705 [details] dmesg boot This the boot up dmesg output.
Created attachment 384706 [details] dmesg running showing error
Created attachment 384707 [details] lspci -vvxx
I've booted kernel 2.6.33 and cannot reproduce the problem with that kernel. That appears to have sata_mv 1.28 versus 1.24 in the EL5 kernel.
Thanks for the update, would you be able to boot some RHEL5 test kernels? I have updated the sata_mv driver from 1.24 to 1.25, it may help to test driver versions incrementally up to 1.28 to see if we can narrow down on a specific driver change. http://people.redhat.com/dmilburn/.bz554872/
I think so. It's a production server so I have limited times for reboots, but I can work through them slowly. Could I trouble you for a i686 non-PAE kernel and kernel-headers packages?
Sure, I went ahead and built some test kernel rpms that update the sata_mv driver to v1.26 as well. http://people.redhat.com/dmilburn/.bz554872/ This one updates the sata_mv driver to v1.25 kernel-2.6.18-191.el5.bz554872.1.i686.rpm And this one updates the sata_mv driver to v1.26 kernel-2.6.18-191.el5.bz554872v126.1.126.i686.rpm At the moment I don't have a platform available that uses sata_mv, so I haven't tested these kernels. If possible, please boot them on a non-production system first. I have carefully backported the changes but they are untested, so please be careful.
I've built a test machine with: 02:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX5081 8-port SATA I PCI-X Controller (rev 03) Built a raid 5 array with 8 ST3750640AS drives. Could not reproduce the errors with stock kernel - I'm thinking it may also be drive related. Tried 2.6.18-191.el5.bz554872v126.1.126, but it fails with: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) BUG: warning at drivers/ata/libata-sff.c:1108/ata_sff_hsm_move() (Not tainted) [<f88e2d32>] ata_sff_hsm_move+0x3b/0x6e9 [libata] [<c042d681>] del_timer_sync+0xa/0x14 [<c061cd58>] schedule_timeout+0x78/0x8c [<c042cd9d>] process_timeout+0x0/0x5 [<f88e3f4b>] ata_pio_task+0xba/0xc7 [libata] [<c0433266>] run_workqueue+0x78/0xb5 [<f88e3e91>] ata_pio_task+0x0/0xc7 [libata] [<c0433b1a>] worker_thread+0xd9/0x10b [<c041f7a7>] default_wake_function+0x0/0xc [<c0433a41>] worker_thread+0x0/0x10b [<c0435f33>] kthread+0xc0/0xed [<c0435e73>] kthread+0x0/0xed [<c0405c53>] kernel_thread_helper+0x7/0x10 ======================= ata5.00: failed to read native max address (err_mask=0x100) ata5.00: HPA support seems broken, skipping HPA handling ata5: failed to recover some devices, retrying in 5 secs and it doesn't find any of the drives. I booted 2.6.18-191.el5.bz554872.1 and pulled a drive and replaced with a WD15EADS drive. Array rebuilt fine. However on reboot I got: Vendor: ATA Model: WDC WD15EADS-00R Rev: 01.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back sdb:<3>ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata6.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata6.00: status: { DRDY } ata6: hard resetting link ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata6.00: configured for UDMA/133 ata6: EH complete sdb1 sd 7:0:0:0: Attached scsi disk sdb Drive did eventually come up. Is this a problem with these "green" drives going to sleep? Although I'm also seeing this with WD2002FYPS drives - which are "enterprise" drives.
Still running 2.6.18-191.el5.bz554872.1. Pulled another ST3750640AS and replaced with a WD20EADS. On starting rebuild getting: Mar 12 15:11:33 test kernel: md: using 128k window, over a total of 732571904 blocks. Mar 12 15:11:40 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:11:40 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:11:40 test kernel: ata6: SError: { UnrecovData } Mar 12 15:11:40 test kernel: ata6.00: cmd c8/00:f8:4f:c8:04/00:00:00:00:00/e0 tag 0 dma 126976 in Mar 12 15:11:40 test kernel: res 51/0c:e7:55:c8:04/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) Mar 12 15:11:40 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:11:40 test kernel: ata6.00: error: { ABRT } Mar 12 15:11:40 test kernel: ata6: hard resetting link Mar 12 15:11:41 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:11:41 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:11:41 test kernel: ata6: EH complete Mar 12 15:11:46 test kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB) Mar 12 15:11:46 test kernel: sdb: Write Protect is off Mar 12 15:11:46 test kernel: SCSI device sdb: drive cache: write back Mar 12 15:14:15 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:14:15 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:14:15 test kernel: ata6: SError: { UnrecovData } Mar 12 15:14:15 test kernel: ata6.00: cmd 25/00:48:f7:5a:b4/00:02:00:00:00/e0 tag 0 dma 299008 in Mar 12 15:14:15 test kernel: res 51/0c:d7:60:5c:b4/0c:00:00:00:00/e0 Emask 0x10 (ATA bus error) Mar 12 15:14:15 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:14:15 test kernel: ata6.00: error: { ABRT } Mar 12 15:14:15 test kernel: ata6: hard resetting link Mar 12 15:14:16 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:14:16 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:14:16 test kernel: ata6: EH complete Mar 12 15:14:17 test kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB) Mar 12 15:14:17 test kernel: sdb: Write Protect is off Mar 12 15:14:17 test kernel: SCSI device sdb: drive cache: write back Mar 12 15:14:50 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:14:50 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:14:50 test kernel: ata6: SError: { UnrecovData } Mar 12 15:14:50 test kernel: ata6.00: cmd 25/00:28:17:b8:dd/00:02:00:00:00/e0 tag 0 dma 282624 in Mar 12 15:14:50 test kernel: res 51/0c:e7:4a:b8:dd/0c:01:00:00:00/e0 Emask 0x10 (ATA bus error) Mar 12 15:14:50 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:14:50 test kernel: ata6.00: error: { ABRT } Mar 12 15:14:50 test kernel: ata6: hard resetting link Mar 12 15:14:50 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:14:51 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:14:51 test kernel: ata6: EH complete Mar 12 15:14:52 test kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB) Mar 12 15:14:52 test kernel: sdb: Write Protect is off Mar 12 15:14:52 test kernel: SCSI device sdb: drive cache: write back Mar 12 15:15:47 test kernel: ata6: limiting SATA link speed to 1.5 Gbps Mar 12 15:15:47 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:15:47 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:15:47 test kernel: ata6: SError: { UnrecovData } Mar 12 15:15:47 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in Mar 12 15:15:47 test kernel: res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error) Mar 12 15:15:47 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:15:47 test kernel: ata6.00: error: { ABRT } Mar 12 15:15:47 test kernel: ata6: hard resetting link Mar 12 15:15:47 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:15:47 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:15:47 test kernel: ata6: EH complete Mar 12 15:15:50 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:15:50 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:15:50 test kernel: ata6: SError: { UnrecovData } Mar 12 15:15:50 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in Mar 12 15:15:50 test kernel: res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error) Mar 12 15:15:50 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:15:50 test kernel: ata6.00: error: { ABRT } Mar 12 15:15:50 test kernel: ata6: hard resetting link Mar 12 15:15:51 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:15:51 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:15:51 test kernel: ata6: EH complete Mar 12 15:15:54 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:15:54 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:15:54 test kernel: ata6: SError: { UnrecovData } Mar 12 15:15:54 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in Mar 12 15:15:54 test kernel: res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error) Mar 12 15:15:54 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:15:54 test kernel: ata6.00: error: { ABRT } Mar 12 15:15:54 test kernel: ata6: hard resetting link Mar 12 15:15:54 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:15:54 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:15:54 test kernel: ata6: EH complete Mar 12 15:15:57 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:15:57 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:15:57 test kernel: ata6: SError: { UnrecovData } Mar 12 15:15:57 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in Mar 12 15:15:57 test kernel: res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error) Mar 12 15:15:57 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:15:57 test kernel: ata6.00: error: { ABRT } Mar 12 15:15:57 test kernel: ata6: hard resetting link Mar 12 15:15:57 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:15:57 test kernel: ata6.00: configured for UDMA/133 Mar 12 15:15:57 test kernel: ata6: EH complete Mar 12 15:16:00 test kernel: ata6.00: limiting speed to UDMA/100:PIO4 Mar 12 15:16:00 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:16:00 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:16:00 test kernel: ata6: SError: { UnrecovData } Mar 12 15:16:00 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in Mar 12 15:16:00 test kernel: res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error) Mar 12 15:16:00 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:16:00 test kernel: ata6.00: error: { ABRT } Mar 12 15:16:00 test kernel: ata6: hard resetting link Mar 12 15:16:01 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:16:01 test kernel: ata6.00: configured for UDMA/100 Mar 12 15:16:01 test kernel: ata6: EH complete Mar 12 15:16:04 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6 Mar 12 15:16:04 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable Mar 12 15:16:04 test kernel: ata6: SError: { UnrecovData } Mar 12 15:16:04 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in Mar 12 15:16:04 test kernel: res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error) Mar 12 15:16:04 test kernel: ata6.00: status: { DRDY ERR } Mar 12 15:16:04 test kernel: ata6.00: error: { ABRT } Mar 12 15:16:04 test kernel: ata6: hard resetting link Mar 12 15:16:04 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 15:16:04 test kernel: ata6.00: configured for UDMA/100 Mar 12 15:16:04 test kernel: sd 7:0:0:0: SCSI error: return code = 0x08000002 Mar 12 15:16:04 test kernel: sdb: Current [descriptor]: sense key: Aborted Command Mar 12 15:16:04 test kernel: Add. Sense: No additional sense information Mar 12 15:16:04 test kernel: Mar 12 15:16:04 test kernel: Descriptor sense data with sense descriptors (in hex): Mar 12 15:16:04 test kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Mar 12 15:16:04 test kernel: 01 2a 25 3f Mar 12 15:16:04 test kernel: end_request: I/O error, dev sdb, sector 19539263 Mar 12 15:16:04 test kernel: raid5:md2: read error not correctable (sector 19539200 on sdb1). Mar 12 15:16:04 test kernel: raid5: Disk failure on sdb1, disabling device. Operation continuing on 6 devices Final failure coincided with running smartctl -a /dev/sdb.
After more testing, appears that /dev/sdb is truly bad. Will replace.
The SError register is reporting Unrecoverable Data error, this is normally associated with hardware (could even be cabling or power supply).
Okay, replaced sdb and no trouble re-creating the array on the test system. I've booted 2.6.18-191.el5.bz554872.1 on my production system and started an md check on an array that I've been having hiccups on. So far so good with no disk messages. I'll keep monitoring. Some differences in kernel messages: < Linux version 2.6.18-164.11.1.el5 --- > Linux version 2.6.18-191.el5.bz554872.1 < sata_mv 0000:02:01.0: version 1.24 --- > sata_mv 0000:02:01.0: version 1.25 391,392d389 < ata1.00: max_sectors limited to 256 for NCQ < ata1.00: max_sectors limited to 256 for NCQ 397,398d393 < ata2.00: max_sectors limited to 256 for NCQ < ata2.00: max_sectors limited to 256 for NCQ 404,405d398 < ata4.00: max_sectors limited to 256 for NCQ < ata4.00: max_sectors limited to 256 for NCQ 410,411d402 < ata5.00: max_sectors limited to 256 for NCQ < ata5.00: max_sectors limited to 256 for NCQ 416,417d406 < ata6.00: max_sectors limited to 256 for NCQ < ata6.00: max_sectors limited to 256 for NCQ 423,424d411 < ata8.00: max_sectors limited to 256 for NCQ < ata8.00: max_sectors limited to 256 for NCQ 647,648d633 < ata17.00: max_sectors limited to 256 for NCQ < ata17.00: max_sectors limited to 256 for NCQ 653,654d637 < ata18.00: max_sectors limited to 256 for NCQ < ata18.00: max_sectors limited to 256 for NCQ 659,660d641 < ata19.00: max_sectors limited to 256 for NCQ < ata19.00: max_sectors limited to 256 for NCQ 666,667d646 < ata21.00: max_sectors limited to 256 for NCQ < ata21.00: max_sectors limited to 256 for NCQ 672,673d650 < ata22.00: max_sectors limited to 256 for NCQ < ata22.00: max_sectors limited to 256 for NCQ 679,680d655 < ata24.00: max_sectors limited to 256 for NCQ < ata24.00: max_sectors limited to 256 for NCQ So perhaps something changed with NCQ (or just no longer output diagnostics).
Just to follow up that I have not had any more trouble with 2.6.18-191.el5.bz554872.1
Thanks for the update, would you mind trying 1 more test kernel? Looking back at your original dmesg output, your were seeing timeouts on command 0xc8 (Read DMA). The most significant change between v1.24 and v1.25 is commit 4bdee6c5103696a2729d3db2f235d202191788e4 Author: Tejun Heo <htejun> Date: Wed Aug 13 20:24:16 2008 +0900 sata_mv: don't issue two DMA commands concurrently sata_mv allowed issuing two DMA commands concurrently which the hardware allows. Unfortunately, libata core layer isn't ready for this yet and spews ugly warning message and malfunctions on this. Don't allow concurrent DMA commands for now. If possible, I would like to see if we could narrow the fix down to this change.
Can do.
Please test kernel-2.6.18-194.el5.bz554872_2.1 http://people.redhat.com/dmilburn/.bz554872/
With kernel-2.6.18-194.el5.bz554872_2.1 I seem to be seeing the errors again: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata14.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata14.00: status: { DRDY } ata14: hard resetting link ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata14.00: configured for UDMA/133 ata14: EH complete SCSI device sdk: 3907029168 512-byte hdwr sectors (2000399 MB) sdk: Write Protect is off sdk: Mode Sense: 00 3a 00 00 SCSI device sdk: drive cache: write back I also see these again at boot for the drives on the MV88SX6081 controllers (which aren't having problems): > ata1.00: max_sectors limited to 256 for NCQ > ata1.00: max_sectors limited to 256 for NCQ
Ok, sorry about that, let me take another look, I think the NCQ messages you are referring to got removed upstream with this commit which I included in the earlier test version. commit cd12e1f7a2c28917c89d65c0d4a52d3919b4c125 Author: Mark Lord <liml> Date: Mon Jan 19 18:06:28 2009 -0500 sata_mv: remove bogus nsect restriction Remove unneeded nsect restriction from GenII NCQ path, and improve comments to explain why this is not a problem.
If possible, please test kernel-2.6.18-194.el5.bz554872_dma.1.i686.rpm, I have included a couple of IRQ fixes as well as the DMA fix that went into v 1.25 commit 5d0fb2e730e2085021cf5c8b6d14983e92aea75b Author: Thomas Reitmayr <treitmayr> Date: Sat Jan 24 20:24:58 2009 +0100 sata_mv: Properly initialize main irq mask commit cae6edc3b5a536119374a5439d9b253cb26f05e1 Author: Mark Lord <liml> Date: Mon Jan 19 18:05:42 2009 -0500 sata_mv: don't read hc_irq_cause Also, I included the patch mentioned in Comment#20 to remove bogus nsect restriction, thanks.
Sorry, here is the link: http://people.redhat.com/dmilburn/.bz554872/
2.6.18-194.el5.bz554872_dma.1 appears to be working fine so far. Completed an array rebuild fine.
Ok, thanks for the update, I will work on getting these patches reviewed for RHEL5.
Finally saw a frozen message with 2.6.18-194.el5.bz554872_dma.1, tied to a smartd health check: Apr 26 23:49:34 alexandria kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Apr 26 23:49:34 alexandria kernel: ata8.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Apr 26 23:49:34 alexandria kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 26 23:49:34 alexandria kernel: ata8.00: status: { DRDY } Apr 26 23:49:34 alexandria kernel: ata8: hard resetting link Apr 26 23:49:40 alexandria kernel: ata8: link is slow to respond, please be patient (ready=0) Apr 26 23:49:42 alexandria kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 26 23:49:42 alexandria kernel: ata8.00: configured for UDMA/133 Apr 26 23:49:42 alexandria kernel: sd 7:0:0:0: timing out command, waited 7s Apr 26 23:49:42 alexandria kernel: ata8: EH complete Apr 26 23:49:42 alexandria kernel: SCSI device sdf: 3907029168 512-byte hdwr sectors (2000399 MB) Apr 26 23:49:42 alexandria kernel: sdf: Write Protect is off Apr 26 23:49:42 alexandria kernel: SCSI device sdf: drive cache: write back Apr 26 23:49:42 alexandria smartd[3636]: Device: /dev/sdf, not capable of SMART self-check Apr 26 23:49:42 alexandria smartd[3636]: Sending warning via mail to root ... Apr 26 23:49:42 alexandria smartd[3636]: Warning via mail to root: successful
Ok, and you don't think you saw those failures with this earlier test kernel (which was an upgrade to v1.25)? kernel-2.6.18-191.el5.bz554872.1.i686.rpm I was hoping to isolate some specific fixes, I thought the -194.el5.bz554872_dma.1 had all the relevent fixes between v1.24 and v1.25.
I did not see this with kernel-2.6.18-191.el5.bz554872.1.i686.rpm, but I've only seen it once so the stars may need to be aligned just right to trigger it. Hopefully this won't muddy up this bug report too much, but I'm loading up another array and just got: ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F> ata7.00: status: { DRDY ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F> ata7.00: status: { DRDY ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F> ata7.00: status: { DRDY ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back ata7.00: NCQ disabled due to excessive errors ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F> ata7.00: status: { DRDY ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 res 51/04:00:3f:74:e0/00:00:00:00:00/a0 Emask 0x1 (device error) ata7.00: status: { DRDY ERR } ata7.00: error: { ABRT } ata7.00: failed to read native max address (err_mask=0x1) ata7.00: HPA support seems broken, skipping HPA handling ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 res 61/04:ff:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7: hard resetting link EXT4-fs (md3): mounted filesystem with ordered data mode SELinux: initialized (dev md3, type ext4), uses xattr ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete ata7: limiting SATA link speed to 1.5 Gbps ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata7.00: configured for UDMA/133 (device error ignored) ata7: EH complete ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7: hard resetting link ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata7.00: configured for UDMA/133 (device error ignored) sd 6:0:0:0: SCSI error: return code = 0x08000002 sdw: Current [descriptor]: sense key: Aborted Command Add. Sense: No additional sense information Descriptor sense data with sense descriptors (in hex): 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 e8 e0 74 3f end_request: I/O error, dev sdw, sector 3907023935 ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00 SCSI device sdw: drive cache: write back ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 res 61/04:ff:00:00:00/04:00:e8:00:00/a0 Emask 0x1 (device error) ata7.00: status: { DRDY DF ERR } ata7.00: error: { ABRT } ata7.00: configured for UDMA/133 (device error ignored) raid5: Disk failure on sdw1, disabling device. Operation continuing on 5 devices ata7: EH complete SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB) sdw: Write Protect is off sdw: Mode Sense: 00 3a 00 00
ata7 is on one of the MV88SX6081 8-port SATA II controllers, so shouldn't be related to the 5081 issues.
Ok, let me look over the driver changes again between v1.24 and v1.28.
Looking at the beginning of the errors in Comment#27 ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F> ata7.00: status: { DRDY ERR } ata7.00: error: { ABRT } This shows that edma_err_cause has bit 7 (Gen II/IIE self-disable) and bit 2 (device error), pp_flags have bit 0 (DMA engine enabled) and bit 1 (setup for NCQ). It does look like the failure occurred on a NCQ command. I am looking over the changes between v1.24 and v1.28, I am not sure how much runtime you were able to get on the 2.6.33 kernel, do you think you performed the same tests on 2.6.33 as these test kernels?
This is probably completely unrelated to the original issue (6081 vs 5081). I did not have much time on the 2.6.33 kernels, but I at least did an array rebuild. This error seemed to occur just after array creation while creating the filesystem, which is stuff I did not do with 2.6.33. I have not seen this error again, currently running 2.6.18-194.el5.bz554872_dma.1.
David - How does 2.6.18-194.3.1.el5 compare with 2.6.18-194.el5.bz554872_dma.1? Wondering if I should move to that or not. Thanks.
Hi, the 2.6.18-194.3.1.el5 kernel should not help, it will be using v1.24. The 2.6.18-194.el5.bz554872_dma.1 kernel includes these commits on top v1.24 commit 4bdee6c5103696a2729d3db2f235d202191788e4 sata_mv: don't issue two DMA commands concurrently commit cae6edc3b5a536119374a5439d9b253cb26f05e1 sata_mv: don't read hc_irq_cause commit cd12e1f7a2c28917c89d65c0d4a52d3919b4c125 sata_mv: remove bogus nsect restriction commit 5d0fb2e730e2085021cf5c8b6d14983e92aea75b sata_mv: Properly initialize main irq mask Probably, the best thing, if possible, is to get more runtime on 2.6.33 or .34 so we can be sure the timeouts are not reproducible on the upstream kernel. Do you have a system you can try this on? Thanks.
I have only seen one exception message since May 1 (and it recovered fine from it), so this kernel (dma.1) is a *lot* better than the old EL5 kernel. I do have test machine but have been unable to really reproduce the error reliably there. I'm reluctant to run .33 or .34 on the production machine. I'll try again and see if I can reproduce on the test machine.
I set up my test machine with stock kernel-2.6.18-194.3.1.el5 and was able to trigger the DRDY messages by running smartcl -H in tandem with raid5 array checks. Then booted kernel-2.6.18-191.el5.bz554872.1 and have not been able to reproduce after running for 4 days. It may not be 100%, but it's damn close I can't reliably reproduce to test.
Ok, thanks for the update, I will submit for review an update to the sata_mv driver including some of these fixes.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-215.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
(In reply to comment #48) > in kernel-2.6.18-215.el5 > You can download this test kernel from http://people.redhat.com/jwilson/el5 I confirm that this problem is not present in kernel 2.6.18-222.el5 from above. (Problem was present making machine unusable in kernel 2.6.18-194.11.3.el5 8-port MV88SX5081). Thank you for fixing it. Please hurry the fix into the security-updates kernels. K.O.
FWIW - Running -222. It is much better, but I do appear to get this once a day when smartd runs a check: Sep 21 20:42:01 alexandria kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Sep 21 20:42:01 alexandria kernel: ata9.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Sep 21 20:42:01 alexandria kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Sep 21 20:42:01 alexandria kernel: ata9.00: status: { DRDY } Sep 21 20:42:01 alexandria kernel: ata9: hard resetting link Sep 21 20:42:01 alexandria kernel: ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Sep 21 20:42:08 alexandria kernel: ata9.00: revalidation failed (errno=-2) Sep 21 20:42:08 alexandria kernel: ata9: failed to recover some devices, retrying in 5 secs Sep 21 20:42:13 alexandria kernel: ata9: hard resetting link Sep 21 20:42:13 alexandria kernel: ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Sep 21 20:42:14 alexandria kernel: ata9.00: configured for UDMA/133 Sep 21 20:42:14 alexandria kernel: sd 8:0:0:0: timing out command, waited 7s Sep 21 20:42:14 alexandria kernel: ata9: EH complete Sep 21 20:42:14 alexandria kernel: SCSI device sdi: 2930277168 512-byte hdwr sectors (1500302 MB) Sep 21 20:42:14 alexandria kernel: sdi: Write Protect is off Sep 21 20:42:14 alexandria kernel: SCSI device sdi: drive cache: write back Sep 21 20:42:14 alexandria smartd[3903]: Device: /dev/sdi, not capable of SMART self-check I do think it has something to do with this particular drive though, as it started when I put a new drive in that slot. Sep 21 08:41:14 alexandria kernel: ata9.00: ATA-8: WDC WD15EADS-00R6B0, 01.00A01, max UDMA/133
(In reply to comment #50) > (In reply to comment #48) > > in kernel-2.6.18-215.el5 > > You can download this test kernel from http://people.redhat.com/jwilson/el5 > I confirm that this problem is not present in kernel 2.6.18-222.el5 from above. I confirm that the -222 kernel gives instant root using the CVE-2010-3081 exlpoit. I confirm that the sata_mv "frozen" problem is back in the -194.11.4 kernel issued as fix the CVE-2010-3081. Looks like the choice is between a kernel with working disks and a kernel with a working root password. K.O.
(In reply to comment #52) > (In reply to comment #50) > > (In reply to comment #48) > > > in kernel-2.6.18-215.el5 > > > You can download this test kernel from http://people.redhat.com/jwilson/el5 Good news. I confirm that -223 kernel has working sata_mv and seems to be immune to CVS-2010-3081. K.O.
Thanks for the updates, if you have more problems with sata_mv please open a new BZ and we can work it from there.
Latest kernel update 2.6.18-194.17.4.el5 still has the old non-working sata_mv drivers. What's the ETA for working drivers to show up in the kernel updates? Does "ON_QA" mean that they are waiting for feedback from all three people using marvell sata cards? K.O.
These patches are slated to be in the RHEL5.6 GA kernel, they've not been proposed for backporting into a RHEL5.5 update kernel. The "ON_QA" status means that a kernel with these patches is in the hands of Red Hat's QA team, awaiting testing.
Code reviewed and patchs found and applied into kernel-2.6.18-236.el5 Set as Sanity Only.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html