Bug 404851

Summary: [RHEL5.3][Kernel] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: dmilburn, duck, dzickus, jgarzik, jss, mgahagan, peterm, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
URL: http://rhts.lab.boston.redhat.com/testlogs/11377/37289/294536/sys.log
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 19:47:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 448732, 483701    
Attachments:
Description Flags
lspci -xvvv output
none
dmesg, kernel 2.6.30, sda2 frozen, reset, then disabled, some call traces none

Description Jeff Burke 2007-11-29 19:02:49 UTC
Description of problem:
 SATA drives go offline. I have seen this in several ways. During install and
after install while running tests.

Version-Release number of selected component (if applicable):
 2.6.18-53.el5 

How reproducible:
 Very intermmitant. I have only seen this on several hosts.
hp-xw4550-01.rhts.boston.redhat.com, ixen-01.rhts.boston.redhat.com,
ixen-02.rhts.boston.redhat.com

Steps to Reproduce:
1. Not sure. It happens both during install and after intstall while running
random tests.
  
Actual results:
ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
ata1.00: cmd 60/20:00:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 16384 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: limiting SATA link speed to 1.5 Gbps
ata1.00: limiting speed to UDMA/100:PIO3
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata1: EH complete
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 0
Buffer I/O error on device sda, logical block 0
Buffer I/O error on device sda, logical block 1
Buffer I/O error on device sda, logical block 2
Buffer I/O error on device sda, logical block 3
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 0
Buffer I/O error on device sda, logical block 0
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 156301480
Buffer I/O error on device sda, logical block 19537685
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 156301480
Buffer I/O error on device sda, logical block 19537685
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 156301480
Buffer I/O error on device sda, logical block 19537685
sd 0:0:0:0: SCSI error: return code = 0x00040000

Expected results:
 Should not go offline

Additional info:
 I have seen this both with the RHEL5.1 kernel and the RT kernel 2.6.21(base)
I don't belive this is a "hardware" type issue. The reason is that it had run
other RHTS jobs before and it is running other RHTS jobs now just fine. I have
never seen this issue with RHEL4.

Comment 1 Peter Martuccelli 2007-11-30 19:26:38 UTC
Without more of a solid reproducer this is going to be difficult to track down.
 Maybe Jeff will have some additional input regarding the SATA recovery modes
leading up to the drive being disabled.

Comment 2 David Milburn 2008-08-20 13:58:09 UTC
The error/recovery handling will be updated in RHEL5.3, moving this to 5.4
since to see if it remains reproducible.

Comment 3 RHEL Program Management 2009-02-16 15:26:27 UTC
Updating PM score.

Comment 4 John 2009-06-30 21:33:13 UTC
I can reproduce this problem easily on HP DC7900 Small form factor pc, CentOS 5.3. 4GB RAM, 2x500GB drives in raid1 array.

Reproducible with standard centos kernel 2.6.18-128, and with latest vanilla kernel 2.6.30.

Problem persists with PAE disabled, with scheduler=preemptive, voluntary, and even with no preemtion, with smartd enabled and disabled, with acpi on and off, and, with and without noapic and nolapic boot options. 

I cannot get this problem to go away... so if you need someone to reproduce this problem, I can do it. Easily.

The hp dc7900 uses intel Q45 chipset, with ich10 controller, and it has already had a patch released for windows, well, for the intel matrix storage manager, iastor.sys, which handles this ICH10 chipset. This patch fixed random BSOD issues, not sure if it related.

So far, when it has occurred, this error has had 3 outcomes so far.
1) nothing bad happens, link gets reset, drive continues.
2) drive gets reset, journal gets aborted, device remounted readonly. fsck on reboot does not do much besides create new journal. Unmount and then manual fsck fixes some errors, then later dmesg shows attempts to access past end of device, and a further fsck shows massive corruption on the software raid array.
3) one of the 2 drives in array is disabled and raid continues to operate in degraded mode on 1 drive. Reboot then recovery of 2nd drive in array with "mdadm /dev/md1 --re-add /dev/sda2 " succeeds.

Steps to reproduce:

1) install CentOS5.3/RHEL5.3 system on software raid (mirrored) array. Create a mirrored data partition too, so you can thrash the data partition and hopefully cause corruption on that partition, rather than root partition
2) install stress, bonnie++, dbench, iozone (mayve got rpms from epel or rpmforge)
3) create small script, to thrash CPU and disks simultaneously, eg:
#!/bin/sh
#megastress.sh
function bonniepp () {
    su survey1 -c "bonnie++ -d /data/bonnie/ -p3 > /home/survey1/bonnie0.log" &
    su survey1 -c "bonnie++ -d /data/bonnie/ -y s -x 15 > /data/bonnie1.log" &
    su survey1 -c "bonnie++ -d /data/bonnie/ -y s -x 15 > /data/bonnie2.log" &
    su survey1 -c "bonnie++ -d /data/bonnie/ -y s -x 15 > /data/bonnie3.log" &
}
iozone -a &  #how to specify the filssystem to test on?
stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 1800s &
dbench -D /data/bonnie -t 1800 6 &
bonniepp

4) run megastress.sh, as root
5) in another terminal, run:
while [ 1 ] ; do dmesg | grep frozen ; sleep 5 ; done
6) wait 5-10 minutes and you will see the grep start returning lines saying your ata1 device has frozen. Then you need to terminate stress, bonniepp, dbench, and iozone, before they do any more damage to your filesystem...
7) run dmesg and see what damage was done.

Comment 5 John 2009-06-30 21:38:14 UTC
I should add, that i am looking after a cluster of 6 hp dc7900 nodes. 

Problem is reproducible on all of them, so this is not bad hardware.

I am keen to run any tests, recompile kernel with patches or changed options, and do anything that anyone can suggest to help nail down this pig of a bug.

Would be grateful for any suggestions, Thanks!
John

ps i will attach some dmesg output below

Comment 6 David Milburn 2009-06-30 22:03:44 UTC
Would you please attach the output of "lspci -xvvv" along with your dmesg
output?

Comment 7 John 2009-06-30 23:03:21 UTC
Created attachment 350030 [details]
lspci -xvvv output

this is lscpi -xvvv output from HP DC7900 Small Form Factor PC.
Intel Q45(express?) chipset, with ICH10.
Bios= v1.16

Comment 8 John 2009-06-30 23:12:03 UTC
Created attachment 350031 [details]
dmesg, kernel 2.6.30, sda2 frozen, reset, then disabled, some call traces

There are some interesting call traces in this dmesg, eg:

INFO: task dbench:7585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbench        D c04be555  1820  7585   7553
 ef275e10 00200082 ef275de8 c04be555 eb280400 ef275df4 c04be555 00000000
 98757161 000001aa ef1902e8 f7086d20 ef190070 ef1902e8 c2c1f100 00000001
 c091b100 ee4b7740 00000000 ef275de4 008bd314 00000000 c044a4fe f64f8b50
Call Trace:
 [<c04be555>] ? __find_get_block+0x85/0x190
 [<c04be555>] ? __find_get_block+0x85/0x190
 [<c044a4fe>] ? getnstimeofday+0x4e/0x110
 [<c070d1c8>] schedule+0x8/0x20
 [<c070d2c8>] io_schedule+0x48/0x80
 [<c0478155>] sync_page+0x35/0x40
 [<c070d5e5>] __wait_on_bit+0x45/0x70
 [<c0478120>] ? sync_page+0x0/0x40
 [<c04782d5>] wait_on_page_bit+0x55/0x60
 [<c0442b70>] ? wake_bit_function+0x0/0x60
 [<c0480f3e>] truncate_inode_pages_range+0x1ee/0x310
 [<c04c8513>] ? inotify_inode_is_dead+0x73/0x80
 [<f80e0a30>] ? ext3_delete_inode+0x0/0xd0 [ext3]
 [<c048107a>] truncate_inode_pages+0x1a/0x20
 [<f80e0a46>] ext3_delete_inode+0x16/0xd0 [ext3]
 [<f80e0a30>] ? ext3_delete_inode+0x0/0xd0 [ext3]
 [<c04b1ada>] generic_delete_inode+0x7a/0x140
 [<c04b1ccf>] generic_drop_inode+0x12f/0x1b0
 [<c04b1507>] iput+0x47/0x50
 [<c04aac1f>] do_unlinkat+0xbf/0x140
 [<c046039f>] ? audit_syscall_entry+0x14f/0x1d0
 [<c04aacb0>] sys_unlink+0x10/0x20
 [<c0402d84>] sysenter_do_call+0x12/0x22

Then dmesg finishes off with:

ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
ata1.00: irq_stat 0x00400040, connection status changed
ata1: SError: { PHYRdyChg 10B8B DevExch }
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:8c:f8:52:23/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/100
ata1: EH complete
end_request: I/O error, dev sda, sector 61641165
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sda2, disabling device.
raid1: Operation continuing on 1 devices.
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:0, dev:sda2
 disk 1, wo:0, o:1, dev:sdb2
RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:sdb2

Comment 9 John 2009-06-30 23:48:19 UTC
I also have this bit below in my /var/log/messages, after finishing stress test, rebuilding array and rebooting into 2.6.30. It says there was an irq on line 10, and "nobody cared" ie it wasnt handled. I havent seen this before, so, not sure what to think. Perhaps the irq wasnt handled because we were still booting up and raid devices had not been fully detected yet? Or maybe this indicates something is going wrong on irq10, which is shared between the sata controller and some other devices. Dmesg shows irq 10 is being shared with some pci, ehci, uhci, as well as ahci/ata devices...

Jun 30 13:02:43 node105 kernel: ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part 
Jun 30 13:02:43 node105 kernel: irq 10: nobody cared (try booting with the "irqpoll" option)
Jun 30 13:02:43 node105 kernel: Pid: 430, comm: insmod Not tainted 2.6.30-LADS #3
Jun 30 13:02:43 node105 kernel: Call Trace:
Jun 30 13:02:43 node105 kernel:  [<c0464f57>] __report_bad_irq+0x27/0x90
Jun 30 13:02:43 node105 kernel:  [<c0463d9a>] ? handle_IRQ_event+0x4a/0x130
Jun 30 13:02:43 node105 kernel:  [<c04650fa>] note_interrupt+0x13a/0x180
Jun 30 13:02:43 node105 kernel:  [<c0465bf3>] handle_level_irq+0xa3/0xc0
Jun 30 13:02:43 node105 kernel:  [<c0465b50>] ? handle_level_irq+0x0/0xc0
Jun 30 13:02:43 node105 kernel:  <IRQ>  [<c04044c6>] ? do_IRQ+0x46/0xb0
Jun 30 13:02:43 node105 kernel:  [<c0403429>] ? common_interrupt+0x29/0x30
Jun 30 13:02:43 node105 kernel:  [<c04348de>] ? __do_softirq+0x3e/0x180
Jun 30 13:02:43 node105 kernel:  [<c04348a0>] ? __do_softirq+0x0/0x180
Jun 30 13:02:43 node105 kernel:  <IRQ>  [<c043488b>] ? irq_exit+0x2b/0x40
Jun 30 13:02:43 node105 kernel:  [<c04044cf>] ? do_IRQ+0x4f/0xb0
Jun 30 13:02:43 node105 kernel:  [<c0403429>] ? common_interrupt+0x29/0x30
Jun 30 13:02:43 node105 kernel:  [<c049b9dd>] ? __kmalloc+0xed/0x150
Jun 30 13:02:43 node105 kernel:  [<c0605760>] ? dmam_coherent_release+0x0/0x80
Jun 30 13:02:43 node105 kernel:  [<c0603f0f>] ? devres_alloc+0x1f/0x60
Jun 30 13:02:43 node105 kernel:  [<c060593b>] ? dmam_alloc_coherent+0x2b/0x100
Jun 30 13:02:43 node105 kernel:  [<c0603d65>] ? devm_kzalloc+0x55/0x70
Jun 30 13:02:43 node105 kernel:  [<f81c5259>] ? ahci_port_start+0x59/0xc0 [ahci]
Jun 30 13:02:43 node105 kernel:  [<f8186dad>] ? ata_host_start+0xbd/0x190 [libata]
Jun 30 13:02:43 node105 kernel:  [<f818771b>] ? ata_host_activate+0x1b/0xe0 [libata]
Jun 30 13:02:43 node105 kernel:  [<f81c6230>] ? ahci_interrupt+0x0/0x4f0 [ahci]
Jun 30 13:02:43 node105 kernel:  [<f81c5e8b>] ? ahci_init_one+0x8fb/0xca0 [ahci]
Jun 30 13:02:43 node105 kernel:  [<c0582d4e>] ? local_pci_probe+0xe/0x10
Jun 30 13:02:43 node105 kernel:  [<c058358c>] ? pci_device_probe+0x5c/0x80
Jun 30 13:02:43 node105 kernel:  [<c0601789>] ? driver_probe_device+0x69/0x140
Jun 30 13:02:43 node105 kernel:  [<c0582ee4>] ? pci_match_device+0xa4/0xb0
Jun 30 13:02:43 node105 kernel:  [<c06018d5>] ? __driver_attach+0x75/0x80
Jun 30 13:02:43 node105 kernel:  [<c0600d43>] ? bus_for_each_dev+0x43/0x70
Jun 30 13:02:43 node105 kernel:  [<c0582e10>] ? pci_device_shutdown+0x0/0x30
Jun 30 13:02:43 node105 kernel:  [<c0601629>] ? driver_attach+0x19/0x20
Jun 30 13:02:43 node105 kernel:  [<c0601860>] ? __driver_attach+0x0/0x80
Jun 30 13:02:43 node105 kernel:  [<c0601352>] ? bus_add_driver+0xc2/0x280
Jun 30 13:02:43 node105 kernel:  [<c0582e10>] ? pci_device_shutdown+0x0/0x30
Jun 30 13:02:43 node105 kernel:  [<c0601bba>] ? driver_register+0x5a/0x130
Jun 30 13:02:43 node105 kernel:  [<f81ce000>] ? ahci_init+0x0/0x19 [ahci]
Jun 30 13:02:43 node105 kernel:  [<c05837c0>] ? __pci_register_driver+0x40/0xa0
Jun 30 13:02:43 node105 kernel:  [<f81ce000>] ? ahci_init+0x0/0x19 [ahci]
Jun 30 13:02:43 node105 kernel:  [<f81ce017>] ? ahci_init+0x17/0x19 [ahci]
Jun 30 13:02:43 node105 kernel:  [<c040112a>] ? do_one_initcall+0x2a/0x150
Jun 30 13:02:43 node105 kernel:  [<c046b548>] ? tracepoint_module_notify+0x28/0x30
Jun 30 13:02:43 node105 kernel:  [<c0710e17>] ? notifier_call_chain+0x37/0x70
Jun 30 13:02:43 node105 kernel:  [<c044702d>] ? __blocking_notifier_call_chain+0x4d/0x60
Jun 30 13:02:43 node105 kernel:  [<c0455a13>] ? sys_init_module+0x83/0x1b0
Jun 30 13:02:43 node105 kernel:  [<c049f9d0>] ? sys_close+0x60/0xa0
Jun 30 13:02:43 node105 kernel:  [<c0402e45>] ? syscall_call+0x7/0xb
Jun 30 13:02:43 node105 kernel: handlers:
Jun 30 13:02:43 node105 kernel: [<c062e510>] (usb_hcd_irq+0x0/0x70)
Jun 30 13:02:43 node105 last message repeated 2 times
Jun 30 13:02:43 node105 kernel: Disabling IRQ #10

The odd thing is, despite the message:
Jun 30 13:02:43 node105 kernel: Disabling IRQ #10

following messages which followed soon after seem to still show the ahci devices on irq10:
Jun 30 13:02:43 node105 kernel: ata1: SATA max UDMA/133 abar m2048@0xf3126000 port 0xf3126100 irq 10
Jun 30 13:02:43 node105 kernel: ata2: SATA max UDMA/133 abar m2048@0xf3126000 port 0xf3126180 irq 10
Jun 30 13:02:43 node105 kernel: ata3: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 10
Jun 30 13:02:43 node105 kernel: ata4: DUMMY
Jun 30 13:02:43 node105 kernel: ata5: SATA max UDMA/133 abar m2048@0xf3126000 port 0xf3126300 irq 10

Now over to the experts.... where do i go from here?!

Ta for any suggestions!

Comment 10 John 2009-07-01 00:32:39 UTC
sorry, the problem in message above, ie: 
Jun 30 13:02:43 node105 kernel: irq 10: nobody cared 
etc were caused by booting into kernel with acpi=off noapic nolapic. 

Problem with the "frozen" devices remains, regardless of boot options...

So i would still appreciate any advice on what to investigate next. We're trying to use these nodes to collect and process GB of hydrographic data (lidar) up here in Alaska, and until this is resolved, we're having to stop the nodes holding our data from doing any processing. 

As soon as we give those nodes any CPU work to do, we risk corrupting and losing our data. So this bug is a major pain in the buttock region. 

I've also tried updating the bios on one of these nodes, to v1.19 released 5 or 6 days ago, but the problem persists...

I do have a node to myself on which I am happy to test anything and everything that anyone suggests...

thanks again...

Comment 11 David Milburn 2009-07-01 20:36:14 UTC
Looking at the error message,

ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
ata1.00: irq_stat 0x00400040, connection status changed
ata1: SError: { PHYRdyChg 10B8B DevExch }
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:8c:f8:52:23/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }

According to the Sata Error Register (SErr=0x4090000), the following errors
are occuring between the controller and the device resulting in the ATA
bus error:

PHYRdyChg- this is an internal PHY error losing communication link.
10B8B - host controller detected a decoding error
DevExch - device presence status bit has changed.

This usually points to an electrical problem, I do realize you said
it is reproducible on 6 different systems.

I assume they all are using this same SATA drive

WDC WD5000AAKS-6

If possible try to reproduce the problem using a different drive.

Another a good test would be turning off NCQ before trying to reproduce the
problem.

# echo 1 > /sys/block/sdX/device/queue_depth

You could also try forcing it down to 1.5 Gbs by including this option in
/etc/modprobe.conf and rebuilding initrd.

options libata force=1.5Gbs

Comment 13 John 2009-07-03 16:27:39 UTC
Sorry, Wednesday was my day off, so had to wait until yesterday to try the suggestions. Late on Tuesday though, I went into bios and thought i mayve fixed the problem by disabling a whole bunch of devices, including two usb controllers the bios had put on irq 10 which seemed to be the one also in use by sata controller, at least when booting with noapic and nolapic. Anyway, disabling these devices did not help, problem still occurred :(

So, yesterday i first tried David's suggestion to force sata down with "options libata force=1.5Gbps". It didnt help.

But then.... reverting to 3Gbps, and turning off the NCQ.... SUCCESS!!!

I have now turned all the devices back on in bios, reverted to 3 GBps, and turned NCQ off. I've done 10 hours of stress testing on 2 nodes, and I can no longer reproduce this error! There is no doubt whatsoever that turning off NCQ has fixed the problem

I also did some googling and found this, which says to add WDC WD5000AAKS-0 to the NCQ blacklist:
http://www.transcoding.org/cgi-bin/wiki?action=browse&id=Western_Digital_NCQ&revision=26
and this:
http://bugzilla.kernel.org/show_bug.cgi?id=8627

So, apparently Western Digital don't have a clue how to do NCQ... I think these drives should be permanently blacklisted in kernel's libata.c, with NCQ on they are an absolute menace.

Thanks again for the help David, it is hugely appreciated. Pls accept my apologies for posting in here as it seems my issue was a bit different to the original poster's.

If there is anything you would like me to try, in order to work out exactly what NCQ-related stuff this drive is doing wrong so we can take the results to WD and tell them to fix it, please let me know.

Comment 15 David Milburn 2009-09-02 19:47:19 UTC
Since we made significant changes to the error/recovery code in RHEL5 U3 and
haven't seen the original problem, and John's problem (Comment #13) is 
specific to his Western Digital drive, I am closing this BZ. John, if you
want to pursue the NCQ issue please open a seperate BZ. Ideally, we would need
to reproduce on an upstream kernel, and if possible, blacklist the particular
drive upstream and then backport to RHEL5.