Bug 554872 - Periodic ata exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen messages
Summary: Periodic ata exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen messages
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: David Milburn
QA Contact: Gris Ge
URL:
Whiteboard:
Depends On:
Blocks: 677384
TreeView+ depends on / blocked
 
Reported: 2010-01-12 21:18 UTC by Orion Poplawski
Modified: 2011-02-14 15:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 20:59:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg boot (42.30 KB, text/plain)
2010-01-15 20:47 UTC, Orion Poplawski
no flags Details
dmesg running showing error (120.66 KB, text/plain)
2010-01-15 20:47 UTC, Orion Poplawski
no flags Details
lspci -vvxx (18.58 KB, text/plain)
2010-01-15 20:48 UTC, Orion Poplawski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 15056 0 None None None Never
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Orion Poplawski 2010-01-12 21:18:45 UTC
Description of problem:

System had been running 4.8 with the Marvell mv_sata driver version 3.6.2 for quite a while without much trouble.  Just updated to 5.4 and using the sata_mv driver.

I'm seeing the following messages periodically:

ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.00: cmd c8/00:08:87:d0:21/00:00:00:00:00/e4 tag 0 dma 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata10.00: status: { DRDY }
ata10: hard resetting link
ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata10.00: configured for UDMA/133
ata10: EH complete
SCSI device sds: 3907029168 512-byte hdwr sectors (2000399 MB)
sds: Write Protect is off
sds: Mode Sense: 00 3a 00 00
SCSI device sds: drive cache: write back

seen with 4 different devices so far:

ata21.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata16.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

So far, no real trouble, but worried.

I've just added a new raid 5 array which is now rebuilding, and perhaps the trigger for some of the errors.

System has 3 SATA controllers:
02:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09)
03:02.0 RAID bus controller: Marvell Technology Group Ltd. MV88SX5081 8-port SATA I PCI-X Controller
03:03.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX5081 8-port SATA I PCI-X Controller (rev 03)

ata1 -ata8  are on 0000:02:01.0
ata9 -ata16 are on 0000:03:02.0
ata17-ata24 are on 0000:03:03.0

See also Fedora bug #549981

Version-Release number of selected component (if applicable):
2.6.18-164.10.1.el5.centos.plusPAE

Some more errors:
ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata14.00: cmd ca/00:10:1f:f6:24/00:00:00:00:00/e8 tag 0 dma 8192 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata14.00: cmd ca/00:18:57:d9:39/00:00:00:00:00/e8 tag 0 dma 12288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Comment 1 Orion Poplawski 2010-01-13 21:26:21 UTC
I've yet to see an error on disks on the 6081 controller, just the 5081.  Similar bug reported fixed for the 6081 (bug 493451), but maybe it wasn't really fixed for the 5081?

Comment 2 David Milburn 2010-01-15 20:40:38 UTC
For BZ 493451, using the 6081 8-port, the problem was accesses to upper
4 ports were causing timeouts for the lower 4 ports. The fix was a 1 line
code change to mv_start_dma, it wasn't really specific to 6081, it should
help 5081 as well.

Looking at BZ 549981, it looks like they are actually using the ahci
driver.

Would you please attach your full dmesg output and the output from "lspci -vvxx"?

Comment 3 Orion Poplawski 2010-01-15 20:47:20 UTC
Created attachment 384705 [details]
dmesg boot

This the boot up dmesg output.

Comment 4 Orion Poplawski 2010-01-15 20:47:58 UTC
Created attachment 384706 [details]
dmesg running showing error

Comment 5 Orion Poplawski 2010-01-15 20:48:31 UTC
Created attachment 384707 [details]
lspci -vvxx

Comment 6 Orion Poplawski 2010-03-10 15:39:37 UTC
I've booted kernel 2.6.33 and cannot reproduce the problem with that kernel.  That appears to have sata_mv 1.28 versus 1.24 in the EL5 kernel.

Comment 7 David Milburn 2010-03-10 22:20:25 UTC
Thanks for the update, would you be able to boot some RHEL5 test kernels? I
have updated the sata_mv driver from 1.24 to 1.25, it may help to test 
driver versions incrementally up to 1.28 to see if we can narrow down on a 
specific driver change.

http://people.redhat.com/dmilburn/.bz554872/

Comment 8 Orion Poplawski 2010-03-10 22:30:34 UTC
I think so.  It's a production server so I have limited times for reboots, but I can work through them slowly.   Could I trouble you for a i686 non-PAE kernel and kernel-headers packages?

Comment 9 David Milburn 2010-03-11 14:41:58 UTC
Sure, I went ahead and built some test kernel rpms that update the sata_mv
driver to v1.26 as well.

http://people.redhat.com/dmilburn/.bz554872/

This one updates the sata_mv driver to v1.25
kernel-2.6.18-191.el5.bz554872.1.i686.rpm

And this one updates the sata_mv driver to v1.26
kernel-2.6.18-191.el5.bz554872v126.1.126.i686.rpm

At the moment I don't have a platform available that uses sata_mv, so I haven't
tested these kernels. If possible, please boot them on a non-production system
first. I have carefully backported the changes but they are untested, so please
be careful.

Comment 10 Orion Poplawski 2010-03-12 22:06:00 UTC
I've built a test machine with:

02:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX5081 8-port SATA I PCI-X Controller (rev 03)

Built a raid 5 array with 8 ST3750640AS drives.  Could not reproduce the errors with stock kernel - I'm thinking it may also be drive related.

Tried 2.6.18-191.el5.bz554872v126.1.126, but it fails with:

ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
BUG: warning at drivers/ata/libata-sff.c:1108/ata_sff_hsm_move() (Not tainted)
 [<f88e2d32>] ata_sff_hsm_move+0x3b/0x6e9 [libata]
 [<c042d681>] del_timer_sync+0xa/0x14
 [<c061cd58>] schedule_timeout+0x78/0x8c
 [<c042cd9d>] process_timeout+0x0/0x5
 [<f88e3f4b>] ata_pio_task+0xba/0xc7 [libata]
 [<c0433266>] run_workqueue+0x78/0xb5
 [<f88e3e91>] ata_pio_task+0x0/0xc7 [libata]
 [<c0433b1a>] worker_thread+0xd9/0x10b
 [<c041f7a7>] default_wake_function+0x0/0xc
 [<c0433a41>] worker_thread+0x0/0x10b
 [<c0435f33>] kthread+0xc0/0xed
 [<c0435e73>] kthread+0x0/0xed
 [<c0405c53>] kernel_thread_helper+0x7/0x10
 =======================
ata5.00: failed to read native max address (err_mask=0x100)
ata5.00: HPA support seems broken, skipping HPA handling
ata5: failed to recover some devices, retrying in 5 secs

and it doesn't find any of the drives.

I booted 2.6.18-191.el5.bz554872.1 and pulled a drive and replaced with a WD15EADS drive.  Array rebuilt fine.  However on reboot I got:

  Vendor: ATA       Model: WDC WD15EADS-00R  Rev: 01.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
 sdb:<3>ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata6.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata6.00: configured for UDMA/133
ata6: EH complete
 sdb1
sd 7:0:0:0: Attached scsi disk sdb

Drive did eventually come up.  Is this a problem with these "green" drives going to sleep?  Although I'm also seeing this with WD2002FYPS drives - which are "enterprise" drives.

Comment 11 Orion Poplawski 2010-03-12 22:18:18 UTC
Still running 2.6.18-191.el5.bz554872.1.  Pulled another  ST3750640AS and replaced with a WD20EADS.  On starting rebuild getting:

Mar 12 15:11:33 test kernel: md: using 128k window, over a total of 732571904 blocks.
Mar 12 15:11:40 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:11:40 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:11:40 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:11:40 test kernel: ata6.00: cmd c8/00:f8:4f:c8:04/00:00:00:00:00/e0 tag 0 dma 126976 in
Mar 12 15:11:40 test kernel:          res 51/0c:e7:55:c8:04/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Mar 12 15:11:40 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:11:40 test kernel: ata6.00: error: { ABRT }
Mar 12 15:11:40 test kernel: ata6: hard resetting link
Mar 12 15:11:41 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:11:41 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:11:41 test kernel: ata6: EH complete
Mar 12 15:11:46 test kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB)
Mar 12 15:11:46 test kernel: sdb: Write Protect is off
Mar 12 15:11:46 test kernel: SCSI device sdb: drive cache: write back
Mar 12 15:14:15 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:14:15 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:14:15 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:14:15 test kernel: ata6.00: cmd 25/00:48:f7:5a:b4/00:02:00:00:00/e0 tag 0 dma 299008 in
Mar 12 15:14:15 test kernel:          res 51/0c:d7:60:5c:b4/0c:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Mar 12 15:14:15 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:14:15 test kernel: ata6.00: error: { ABRT }
Mar 12 15:14:15 test kernel: ata6: hard resetting link
Mar 12 15:14:16 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:14:16 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:14:16 test kernel: ata6: EH complete
Mar 12 15:14:17 test kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB)
Mar 12 15:14:17 test kernel: sdb: Write Protect is off
Mar 12 15:14:17 test kernel: SCSI device sdb: drive cache: write back
Mar 12 15:14:50 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:14:50 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:14:50 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:14:50 test kernel: ata6.00: cmd 25/00:28:17:b8:dd/00:02:00:00:00/e0 tag 0 dma 282624 in
Mar 12 15:14:50 test kernel:          res 51/0c:e7:4a:b8:dd/0c:01:00:00:00/e0 Emask 0x10 (ATA bus error)
Mar 12 15:14:50 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:14:50 test kernel: ata6.00: error: { ABRT }
Mar 12 15:14:50 test kernel: ata6: hard resetting link
Mar 12 15:14:50 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:14:51 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:14:51 test kernel: ata6: EH complete
Mar 12 15:14:52 test kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB)
Mar 12 15:14:52 test kernel: sdb: Write Protect is off
Mar 12 15:14:52 test kernel: SCSI device sdb: drive cache: write back
Mar 12 15:15:47 test kernel: ata6: limiting SATA link speed to 1.5 Gbps
Mar 12 15:15:47 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:15:47 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:15:47 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:15:47 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in
Mar 12 15:15:47 test kernel:          res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error)
Mar 12 15:15:47 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:15:47 test kernel: ata6.00: error: { ABRT }
Mar 12 15:15:47 test kernel: ata6: hard resetting link
Mar 12 15:15:47 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:15:47 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:15:47 test kernel: ata6: EH complete
Mar 12 15:15:50 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:15:50 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:15:50 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:15:50 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in
Mar 12 15:15:50 test kernel:          res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error)
Mar 12 15:15:50 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:15:50 test kernel: ata6.00: error: { ABRT }
Mar 12 15:15:50 test kernel: ata6: hard resetting link
Mar 12 15:15:51 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:15:51 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:15:51 test kernel: ata6: EH complete
Mar 12 15:15:54 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:15:54 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:15:54 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:15:54 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in
Mar 12 15:15:54 test kernel:          res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error)
Mar 12 15:15:54 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:15:54 test kernel: ata6.00: error: { ABRT }
Mar 12 15:15:54 test kernel: ata6: hard resetting link
Mar 12 15:15:54 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:15:54 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:15:54 test kernel: ata6: EH complete
Mar 12 15:15:57 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:15:57 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:15:57 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:15:57 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in
Mar 12 15:15:57 test kernel:          res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error)
Mar 12 15:15:57 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:15:57 test kernel: ata6.00: error: { ABRT }
Mar 12 15:15:57 test kernel: ata6: hard resetting link
Mar 12 15:15:57 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:15:57 test kernel: ata6.00: configured for UDMA/133
Mar 12 15:15:57 test kernel: ata6: EH complete
Mar 12 15:16:00 test kernel: ata6.00: limiting speed to UDMA/100:PIO4
Mar 12 15:16:00 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:16:00 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:16:00 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:16:00 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in
Mar 12 15:16:00 test kernel:          res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error)
Mar 12 15:16:00 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:16:00 test kernel: ata6.00: error: { ABRT }
Mar 12 15:16:00 test kernel: ata6: hard resetting link
Mar 12 15:16:01 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:16:01 test kernel: ata6.00: configured for UDMA/100
Mar 12 15:16:01 test kernel: ata6: EH complete
Mar 12 15:16:04 test kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x6
Mar 12 15:16:04 test kernel: ata6.00: edma_err_cause=00000104 pp_flags=00000001, dev error, EDMA self-disable
Mar 12 15:16:04 test kernel: ata6: SError: { UnrecovData }
Mar 12 15:16:04 test kernel: ata6.00: cmd c8/00:00:3f:25:2a/00:00:00:00:00/e1 tag 0 dma 131072 in
Mar 12 15:16:04 test kernel:          res 51/0c:ef:3f:25:2a/0c:01:00:00:00/e1 Emask 0x10 (ATA bus error)
Mar 12 15:16:04 test kernel: ata6.00: status: { DRDY ERR }
Mar 12 15:16:04 test kernel: ata6.00: error: { ABRT }
Mar 12 15:16:04 test kernel: ata6: hard resetting link
Mar 12 15:16:04 test kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 15:16:04 test kernel: ata6.00: configured for UDMA/100
Mar 12 15:16:04 test kernel: sd 7:0:0:0: SCSI error: return code = 0x08000002
Mar 12 15:16:04 test kernel: sdb: Current [descriptor]: sense key: Aborted Command
Mar 12 15:16:04 test kernel:     Add. Sense: No additional sense information
Mar 12 15:16:04 test kernel: 
Mar 12 15:16:04 test kernel: Descriptor sense data with sense descriptors (in hex):
Mar 12 15:16:04 test kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Mar 12 15:16:04 test kernel:         01 2a 25 3f 
Mar 12 15:16:04 test kernel: end_request: I/O error, dev sdb, sector 19539263
Mar 12 15:16:04 test kernel: raid5:md2: read error not correctable (sector 19539200 on sdb1).
Mar 12 15:16:04 test kernel: raid5: Disk failure on sdb1, disabling device. Operation continuing on 6 devices


Final failure coincided with running smartctl -a /dev/sdb.

Comment 12 Orion Poplawski 2010-03-15 15:44:08 UTC
After more testing, appears that /dev/sdb is truly bad.  Will replace.

Comment 13 David Milburn 2010-03-15 17:13:55 UTC
The SError register is reporting Unrecoverable Data error, this is normally
associated with hardware (could even be cabling or power supply).

Comment 14 Orion Poplawski 2010-03-16 17:17:37 UTC
Okay, replaced sdb and no trouble re-creating the array on the test system.

I've booted 2.6.18-191.el5.bz554872.1 on my production system and started an md check on an array that I've been having hiccups on.  So far so good with no disk messages.  I'll keep monitoring.

Some differences in kernel messages:

< Linux version 2.6.18-164.11.1.el5
---
> Linux version 2.6.18-191.el5.bz554872.1

< sata_mv 0000:02:01.0: version 1.24
---
> sata_mv 0000:02:01.0: version 1.25
391,392d389
< ata1.00: max_sectors limited to 256 for NCQ
< ata1.00: max_sectors limited to 256 for NCQ
397,398d393
< ata2.00: max_sectors limited to 256 for NCQ
< ata2.00: max_sectors limited to 256 for NCQ
404,405d398
< ata4.00: max_sectors limited to 256 for NCQ
< ata4.00: max_sectors limited to 256 for NCQ
410,411d402
< ata5.00: max_sectors limited to 256 for NCQ
< ata5.00: max_sectors limited to 256 for NCQ
416,417d406
< ata6.00: max_sectors limited to 256 for NCQ
< ata6.00: max_sectors limited to 256 for NCQ
423,424d411
< ata8.00: max_sectors limited to 256 for NCQ
< ata8.00: max_sectors limited to 256 for NCQ
647,648d633
< ata17.00: max_sectors limited to 256 for NCQ
< ata17.00: max_sectors limited to 256 for NCQ
653,654d637
< ata18.00: max_sectors limited to 256 for NCQ
< ata18.00: max_sectors limited to 256 for NCQ
659,660d641
< ata19.00: max_sectors limited to 256 for NCQ
< ata19.00: max_sectors limited to 256 for NCQ
666,667d646
< ata21.00: max_sectors limited to 256 for NCQ
< ata21.00: max_sectors limited to 256 for NCQ
672,673d650
< ata22.00: max_sectors limited to 256 for NCQ
< ata22.00: max_sectors limited to 256 for NCQ
679,680d655
< ata24.00: max_sectors limited to 256 for NCQ
< ata24.00: max_sectors limited to 256 for NCQ

So perhaps something changed with NCQ (or just no longer output diagnostics).

Comment 15 Orion Poplawski 2010-04-01 14:38:55 UTC
Just to follow up that I have not had any more trouble with 2.6.18-191.el5.bz554872.1

Comment 16 David Milburn 2010-04-01 19:24:29 UTC
Thanks for the update, would you mind trying 1 more test kernel? Looking back
at your original dmesg output, your were seeing timeouts on command 0xc8 (Read DMA). The most significant change between v1.24 and v1.25 is

commit 4bdee6c5103696a2729d3db2f235d202191788e4
Author: Tejun Heo <htejun>
Date:   Wed Aug 13 20:24:16 2008 +0900

    sata_mv: don't issue two DMA commands concurrently

    sata_mv allowed issuing two DMA commands concurrently which the
    hardware allows.  Unfortunately, libata core layer isn't ready for
    this yet and spews ugly warning message and malfunctions on this.
    Don't allow concurrent DMA commands for now.

If possible, I would like to see if we could narrow the fix down to this
change.

Comment 17 Orion Poplawski 2010-04-01 19:28:22 UTC
Can do.

Comment 18 David Milburn 2010-04-02 00:16:42 UTC
Please test kernel-2.6.18-194.el5.bz554872_2.1

http://people.redhat.com/dmilburn/.bz554872/

Comment 19 Orion Poplawski 2010-04-06 20:36:06 UTC
With kernel-2.6.18-194.el5.bz554872_2.1 I seem to be seeing the errors again:

ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata14.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata14.00: status: { DRDY }
ata14: hard resetting link
ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata14.00: configured for UDMA/133
ata14: EH complete
SCSI device sdk: 3907029168 512-byte hdwr sectors (2000399 MB)
sdk: Write Protect is off
sdk: Mode Sense: 00 3a 00 00
SCSI device sdk: drive cache: write back

I also see these again at boot for the drives on the MV88SX6081 controllers (which aren't having problems):

> ata1.00: max_sectors limited to 256 for NCQ
> ata1.00: max_sectors limited to 256 for NCQ

Comment 20 David Milburn 2010-04-06 20:51:18 UTC
Ok, sorry about that, let me take another look, I think the NCQ messages
you are referring to got removed upstream with this commit which I included
in the earlier test version.

commit cd12e1f7a2c28917c89d65c0d4a52d3919b4c125
Author: Mark Lord <liml>
Date:   Mon Jan 19 18:06:28 2009 -0500

    sata_mv: remove bogus nsect restriction

    Remove unneeded nsect restriction from GenII NCQ path,
    and improve comments to explain why this is not a problem.

Comment 21 David Milburn 2010-04-08 19:40:48 UTC
If possible, please test kernel-2.6.18-194.el5.bz554872_dma.1.i686.rpm, I have
included a couple of IRQ fixes as well as the DMA fix that went into v 1.25

commit 5d0fb2e730e2085021cf5c8b6d14983e92aea75b
Author: Thomas Reitmayr <treitmayr>
Date:   Sat Jan 24 20:24:58 2009 +0100

    sata_mv: Properly initialize main irq mask

commit cae6edc3b5a536119374a5439d9b253cb26f05e1
Author: Mark Lord <liml>
Date:   Mon Jan 19 18:05:42 2009 -0500

    sata_mv: don't read hc_irq_cause

Also, I included the patch mentioned in Comment#20 to remove bogus nsect
restriction, thanks.

Comment 22 David Milburn 2010-04-08 19:41:36 UTC
Sorry, here is the link:

http://people.redhat.com/dmilburn/.bz554872/

Comment 23 Orion Poplawski 2010-04-22 14:56:50 UTC
2.6.18-194.el5.bz554872_dma.1 appears to be working fine so far.  Completed an array rebuild fine.

Comment 24 David Milburn 2010-04-23 17:51:41 UTC
Ok, thanks for the update, I will work on getting these patches reviewed
for RHEL5.

Comment 25 Orion Poplawski 2010-04-27 14:38:50 UTC
Finally saw a frozen message with 2.6.18-194.el5.bz554872_dma.1, tied to a smartd health check:

Apr 26 23:49:34 alexandria kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 26 23:49:34 alexandria kernel: ata8.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Apr 26 23:49:34 alexandria kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 26 23:49:34 alexandria kernel: ata8.00: status: { DRDY }
Apr 26 23:49:34 alexandria kernel: ata8: hard resetting link

Apr 26 23:49:40 alexandria kernel: ata8: link is slow to respond, please be patient (ready=0)

Apr 26 23:49:42 alexandria kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 26 23:49:42 alexandria kernel: ata8.00: configured for UDMA/133
Apr 26 23:49:42 alexandria kernel: sd 7:0:0:0: timing out command, waited 7s
Apr 26 23:49:42 alexandria kernel: ata8: EH complete
Apr 26 23:49:42 alexandria kernel: SCSI device sdf: 3907029168 512-byte hdwr sectors (2000399 MB)
Apr 26 23:49:42 alexandria kernel: sdf: Write Protect is off
Apr 26 23:49:42 alexandria kernel: SCSI device sdf: drive cache: write back
Apr 26 23:49:42 alexandria smartd[3636]: Device: /dev/sdf, not capable of SMART self-check 
Apr 26 23:49:42 alexandria smartd[3636]: Sending warning via mail to root ... 
Apr 26 23:49:42 alexandria smartd[3636]: Warning via mail to root: successful

Comment 26 David Milburn 2010-04-29 15:20:39 UTC
Ok, and you don't think you saw those failures with this earlier test kernel
(which was an upgrade to v1.25)?

kernel-2.6.18-191.el5.bz554872.1.i686.rpm

I was hoping to isolate some specific fixes, I thought the -194.el5.bz554872_dma.1 had all the relevent fixes between v1.24 and v1.25.

Comment 27 Orion Poplawski 2010-04-29 15:35:37 UTC
I did not see this with kernel-2.6.18-191.el5.bz554872.1.i686.rpm, but I've only seen it once so the stars may need to be aligned just right to trigger it.

Hopefully this won't muddy up this bug report too much, but I'm loading up another array and just got:

ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable
ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in
         res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F>
ata7.00: status: { DRDY ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable
ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in
         res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F>
ata7.00: status: { DRDY ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable
ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in
         res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F>
ata7.00: status: { DRDY ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
ata7.00: NCQ disabled due to excessive errors
ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable
ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in
         res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F>
ata7.00: status: { DRDY ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 51/04:00:3f:74:e0/00:00:00:00:00/a0 Emask 0x1 (device error)
ata7.00: status: { DRDY ERR }
ata7.00: error: { ABRT }
ata7.00: failed to read native max address (err_mask=0x1)
ata7.00: HPA support seems broken, skipping HPA handling
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 61/04:ff:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
EXT4-fs (md3): mounted filesystem with ordered data mode
SELinux: initialized (dev md3, type ext4), uses xattr
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
ata7: limiting SATA link speed to 1.5 Gbps
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata7.00: configured for UDMA/133 (device error ignored)
ata7: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata7.00: cmd 35/00:08:3f:74:e0/00:00:e8:00:00/e0 tag 0 dma 4096 out
         res 61/04:08:3f:74:e0/04:00:e8:00:00/e0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7: hard resetting link
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata7.00: configured for UDMA/133 (device error ignored)
sd 6:0:0:0: SCSI error: return code = 0x08000002
sdw: Current [descriptor]: sense key: Aborted Command
    Add. Sense: No additional sense information

Descriptor sense data with sense descriptors (in hex):
        72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
        e8 e0 74 3f 
end_request: I/O error, dev sdw, sector 3907023935
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00
SCSI device sdw: drive cache: write back
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 61/04:ff:00:00:00/04:00:e8:00:00/a0 Emask 0x1 (device error)
ata7.00: status: { DRDY DF ERR }
ata7.00: error: { ABRT }
ata7.00: configured for UDMA/133 (device error ignored)
raid5: Disk failure on sdw1, disabling device. Operation continuing on 5 devices
ata7: EH complete
SCSI device sdw: 3907029168 512-byte hdwr sectors (2000399 MB)
sdw: Write Protect is off
sdw: Mode Sense: 00 3a 00 00

Comment 28 Orion Poplawski 2010-04-29 15:39:01 UTC
ata7 is on one of the MV88SX6081 8-port SATA II controllers, so shouldn't be related to the 5081 issues.

Comment 29 David Milburn 2010-04-29 15:48:35 UTC
Ok, let me look over the driver changes again between v1.24 and v1.28.

Comment 30 David Milburn 2010-05-05 22:39:16 UTC
Looking at the beginning of the errors in Comment#27

ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6
ata7.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA
self-disable
ata7.00: cmd 60/80:00:3f:00:00/00:00:00:00:00/40 tag 0 ncq 65536 in
         res 41/04:00:3f:00:00/00:00:00:00:00/40 Emask 0x401 (device error) <F>
ata7.00: status: { DRDY ERR }
ata7.00: error: { ABRT }

This shows that edma_err_cause has bit 7 (Gen II/IIE self-disable) and bit
2 (device error), pp_flags have bit 0 (DMA engine enabled) and bit 1 (setup
for NCQ).

It does look like the failure occurred on a NCQ command. 

I am looking over the changes between v1.24 and v1.28, I am not
sure how much runtime you were able to get on the 2.6.33 kernel, do you think
you performed the same tests on 2.6.33 as these test kernels?

Comment 31 Orion Poplawski 2010-05-05 22:49:35 UTC
This is probably completely unrelated to the original issue (6081 vs 5081).  I did not have much time on the 2.6.33 kernels, but I at least did an array rebuild.  This error seemed to occur just after array creation while creating the filesystem, which is stuff I did not do with 2.6.33.  I have not seen this error again, currently running 2.6.18-194.el5.bz554872_dma.1.

Comment 32 Orion Poplawski 2010-05-18 16:13:26 UTC
David -  How does 2.6.18-194.3.1.el5 compare with 2.6.18-194.el5.bz554872_dma.1?  Wondering if I should move to that or not.  Thanks.

Comment 33 David Milburn 2010-05-18 20:01:00 UTC
Hi, the 2.6.18-194.3.1.el5 kernel should not help, it will be using v1.24.

The 2.6.18-194.el5.bz554872_dma.1 kernel includes these commits on top v1.24

commit 4bdee6c5103696a2729d3db2f235d202191788e4
    sata_mv: don't issue two DMA commands concurrently

commit cae6edc3b5a536119374a5439d9b253cb26f05e1
    sata_mv: don't read hc_irq_cause

commit cd12e1f7a2c28917c89d65c0d4a52d3919b4c125
    sata_mv: remove bogus nsect restriction

commit 5d0fb2e730e2085021cf5c8b6d14983e92aea75b
    sata_mv: Properly initialize main irq mask

Probably, the best thing, if possible, is to get more runtime on 2.6.33 or .34
so we can be sure the timeouts are not reproducible on the upstream kernel. Do
you have a system you can try this on? Thanks.

Comment 34 Orion Poplawski 2010-05-18 20:30:03 UTC
I have only seen one exception message since May 1 (and it recovered fine from it), so this kernel (dma.1) is a *lot* better than the old EL5 kernel.  I do have test machine but have been unable to really reproduce the error reliably there.  I'm reluctant to run .33 or .34 on the production machine.  I'll try again and see if I can reproduce on the test machine.

Comment 35 Orion Poplawski 2010-05-24 14:51:54 UTC
I set up my test machine with stock kernel-2.6.18-194.3.1.el5 and was able to trigger the DRDY messages by running smartcl -H in tandem with raid5 array checks.  Then booted kernel-2.6.18-191.el5.bz554872.1 and have not been able to reproduce after running for 4 days.  It may not be 100%, but it's damn close I can't reliably reproduce to test.

Comment 36 David Milburn 2010-05-26 21:27:18 UTC
Ok, thanks for the update, I will submit for review an update to the sata_mv
driver including some of these fixes.

Comment 38 RHEL Program Management 2010-08-19 13:09:15 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 48 Jarod Wilson 2010-09-03 19:05:42 UTC
in kernel-2.6.18-215.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 50 Konstantin Olchanski 2010-09-17 23:08:21 UTC
(In reply to comment #48)
> in kernel-2.6.18-215.el5
> You can download this test kernel from http://people.redhat.com/jwilson/el5

I confirm that this problem is not present in kernel 2.6.18-222.el5 from above.

(Problem was present making machine unusable in kernel 2.6.18-194.11.3.el5 8-port MV88SX5081).

Thank you for fixing it. Please hurry the fix into the security-updates kernels.

K.O.

Comment 51 Orion Poplawski 2010-09-22 22:19:46 UTC
FWIW - Running -222. It is much better, but I do appear to get this once a day when smartd runs a check:

Sep 21 20:42:01 alexandria kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 21 20:42:01 alexandria kernel: ata9.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Sep 21 20:42:01 alexandria kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 21 20:42:01 alexandria kernel: ata9.00: status: { DRDY }
Sep 21 20:42:01 alexandria kernel: ata9: hard resetting link
Sep 21 20:42:01 alexandria kernel: ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 21 20:42:08 alexandria kernel: ata9.00: revalidation failed (errno=-2)
Sep 21 20:42:08 alexandria kernel: ata9: failed to recover some devices, retrying in 5 secs
Sep 21 20:42:13 alexandria kernel: ata9: hard resetting link
Sep 21 20:42:13 alexandria kernel: ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 21 20:42:14 alexandria kernel: ata9.00: configured for UDMA/133
Sep 21 20:42:14 alexandria kernel: sd 8:0:0:0: timing out command, waited 7s
Sep 21 20:42:14 alexandria kernel: ata9: EH complete
Sep 21 20:42:14 alexandria kernel: SCSI device sdi: 2930277168 512-byte hdwr sectors (1500302 MB)
Sep 21 20:42:14 alexandria kernel: sdi: Write Protect is off
Sep 21 20:42:14 alexandria kernel: SCSI device sdi: drive cache: write back
Sep 21 20:42:14 alexandria smartd[3903]: Device: /dev/sdi, not capable of SMART self-check 

I do think it has something to do with this particular drive though, as it started when I put a new drive in that slot.

Sep 21 08:41:14 alexandria kernel: ata9.00: ATA-8: WDC WD15EADS-00R6B0, 01.00A01, max UDMA/133

Comment 52 Konstantin Olchanski 2010-09-22 22:47:37 UTC
(In reply to comment #50)
> (In reply to comment #48)
> > in kernel-2.6.18-215.el5
> > You can download this test kernel from http://people.redhat.com/jwilson/el5
> I confirm that this problem is not present in kernel 2.6.18-222.el5 from above.


I confirm that the -222 kernel gives instant root using the CVE-2010-3081 exlpoit.
I confirm that the sata_mv "frozen" problem is back in the -194.11.4 kernel issued as fix the CVE-2010-3081.

Looks like the choice is between a kernel with working disks and a kernel with a working root password.


K.O.

Comment 53 Konstantin Olchanski 2010-09-23 00:17:11 UTC
(In reply to comment #52)
> (In reply to comment #50)
> > (In reply to comment #48)
> > > in kernel-2.6.18-215.el5
> > > You can download this test kernel from http://people.redhat.com/jwilson/el5


Good news. I confirm that -223 kernel has working sata_mv and seems to be immune to CVS-2010-3081.


K.O.

Comment 54 David Milburn 2010-09-23 13:29:13 UTC
Thanks for the updates, if you have more problems with sata_mv please open
a new BZ and we can work it from there.

Comment 55 Konstantin Olchanski 2010-10-27 18:59:45 UTC
Latest kernel update 2.6.18-194.17.4.el5 still has the old non-working sata_mv drivers. What's the ETA for working drivers to show up in the kernel updates? Does "ON_QA" mean that they are waiting for feedback from all three people using marvell sata cards?
K.O.

Comment 56 Jarod Wilson 2010-10-27 21:23:20 UTC
These patches are slated to be in the RHEL5.6 GA kernel, they've not been proposed for backporting into a RHEL5.5 update kernel. The "ON_QA" status means that a kernel with these patches is in the hands of Red Hat's QA team, awaiting testing.

Comment 58 Gris Ge 2010-12-29 06:11:04 UTC
Code reviewed and patchs found and applied into kernel-2.6.18-236.el5

Set as Sanity Only.

Comment 60 errata-xmlrpc 2011-01-13 20:59:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.