Bug 404851
Summary: | [RHEL5.3][Kernel] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Burke <jburke> | ||||||
Component: | kernel | Assignee: | David Milburn <dmilburn> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.3 | CC: | dmilburn, duck, dzickus, jgarzik, jss, mgahagan, peterm, syeghiay | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
URL: | http://rhts.lab.boston.redhat.com/testlogs/11377/37289/294536/sys.log | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-02 19:47:19 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 448732, 483701 | ||||||||
Attachments: |
|
Description
Jeff Burke
2007-11-29 19:02:49 UTC
Without more of a solid reproducer this is going to be difficult to track down. Maybe Jeff will have some additional input regarding the SATA recovery modes leading up to the drive being disabled. The error/recovery handling will be updated in RHEL5.3, moving this to 5.4 since to see if it remains reproducible. Updating PM score. I can reproduce this problem easily on HP DC7900 Small form factor pc, CentOS 5.3. 4GB RAM, 2x500GB drives in raid1 array. Reproducible with standard centos kernel 2.6.18-128, and with latest vanilla kernel 2.6.30. Problem persists with PAE disabled, with scheduler=preemptive, voluntary, and even with no preemtion, with smartd enabled and disabled, with acpi on and off, and, with and without noapic and nolapic boot options. I cannot get this problem to go away... so if you need someone to reproduce this problem, I can do it. Easily. The hp dc7900 uses intel Q45 chipset, with ich10 controller, and it has already had a patch released for windows, well, for the intel matrix storage manager, iastor.sys, which handles this ICH10 chipset. This patch fixed random BSOD issues, not sure if it related. So far, when it has occurred, this error has had 3 outcomes so far. 1) nothing bad happens, link gets reset, drive continues. 2) drive gets reset, journal gets aborted, device remounted readonly. fsck on reboot does not do much besides create new journal. Unmount and then manual fsck fixes some errors, then later dmesg shows attempts to access past end of device, and a further fsck shows massive corruption on the software raid array. 3) one of the 2 drives in array is disabled and raid continues to operate in degraded mode on 1 drive. Reboot then recovery of 2nd drive in array with "mdadm /dev/md1 --re-add /dev/sda2 " succeeds. Steps to reproduce: 1) install CentOS5.3/RHEL5.3 system on software raid (mirrored) array. Create a mirrored data partition too, so you can thrash the data partition and hopefully cause corruption on that partition, rather than root partition 2) install stress, bonnie++, dbench, iozone (mayve got rpms from epel or rpmforge) 3) create small script, to thrash CPU and disks simultaneously, eg: #!/bin/sh #megastress.sh function bonniepp () { su survey1 -c "bonnie++ -d /data/bonnie/ -p3 > /home/survey1/bonnie0.log" & su survey1 -c "bonnie++ -d /data/bonnie/ -y s -x 15 > /data/bonnie1.log" & su survey1 -c "bonnie++ -d /data/bonnie/ -y s -x 15 > /data/bonnie2.log" & su survey1 -c "bonnie++ -d /data/bonnie/ -y s -x 15 > /data/bonnie3.log" & } iozone -a & #how to specify the filssystem to test on? stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 1800s & dbench -D /data/bonnie -t 1800 6 & bonniepp 4) run megastress.sh, as root 5) in another terminal, run: while [ 1 ] ; do dmesg | grep frozen ; sleep 5 ; done 6) wait 5-10 minutes and you will see the grep start returning lines saying your ata1 device has frozen. Then you need to terminate stress, bonniepp, dbench, and iozone, before they do any more damage to your filesystem... 7) run dmesg and see what damage was done. I should add, that i am looking after a cluster of 6 hp dc7900 nodes. Problem is reproducible on all of them, so this is not bad hardware. I am keen to run any tests, recompile kernel with patches or changed options, and do anything that anyone can suggest to help nail down this pig of a bug. Would be grateful for any suggestions, Thanks! John ps i will attach some dmesg output below Would you please attach the output of "lspci -xvvv" along with your dmesg output? Created attachment 350030 [details]
lspci -xvvv output
this is lscpi -xvvv output from HP DC7900 Small Form Factor PC.
Intel Q45(express?) chipset, with ICH10.
Bios= v1.16
Created attachment 350031 [details]
dmesg, kernel 2.6.30, sda2 frozen, reset, then disabled, some call traces
There are some interesting call traces in this dmesg, eg:
INFO: task dbench:7585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbench D c04be555 1820 7585 7553
ef275e10 00200082 ef275de8 c04be555 eb280400 ef275df4 c04be555 00000000
98757161 000001aa ef1902e8 f7086d20 ef190070 ef1902e8 c2c1f100 00000001
c091b100 ee4b7740 00000000 ef275de4 008bd314 00000000 c044a4fe f64f8b50
Call Trace:
[<c04be555>] ? __find_get_block+0x85/0x190
[<c04be555>] ? __find_get_block+0x85/0x190
[<c044a4fe>] ? getnstimeofday+0x4e/0x110
[<c070d1c8>] schedule+0x8/0x20
[<c070d2c8>] io_schedule+0x48/0x80
[<c0478155>] sync_page+0x35/0x40
[<c070d5e5>] __wait_on_bit+0x45/0x70
[<c0478120>] ? sync_page+0x0/0x40
[<c04782d5>] wait_on_page_bit+0x55/0x60
[<c0442b70>] ? wake_bit_function+0x0/0x60
[<c0480f3e>] truncate_inode_pages_range+0x1ee/0x310
[<c04c8513>] ? inotify_inode_is_dead+0x73/0x80
[<f80e0a30>] ? ext3_delete_inode+0x0/0xd0 [ext3]
[<c048107a>] truncate_inode_pages+0x1a/0x20
[<f80e0a46>] ext3_delete_inode+0x16/0xd0 [ext3]
[<f80e0a30>] ? ext3_delete_inode+0x0/0xd0 [ext3]
[<c04b1ada>] generic_delete_inode+0x7a/0x140
[<c04b1ccf>] generic_drop_inode+0x12f/0x1b0
[<c04b1507>] iput+0x47/0x50
[<c04aac1f>] do_unlinkat+0xbf/0x140
[<c046039f>] ? audit_syscall_entry+0x14f/0x1d0
[<c04aacb0>] sys_unlink+0x10/0x20
[<c0402d84>] sysenter_do_call+0x12/0x22
Then dmesg finishes off with:
ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
ata1.00: irq_stat 0x00400040, connection status changed
ata1: SError: { PHYRdyChg 10B8B DevExch }
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:8c:f8:52:23/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/100
ata1: EH complete
end_request: I/O error, dev sda, sector 61641165
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sda2, disabling device.
raid1: Operation continuing on 1 devices.
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:1, o:0, dev:sda2
disk 1, wo:0, o:1, dev:sdb2
RAID1 conf printout:
--- wd:1 rd:2
disk 1, wo:0, o:1, dev:sdb2
I also have this bit below in my /var/log/messages, after finishing stress test, rebuilding array and rebooting into 2.6.30. It says there was an irq on line 10, and "nobody cared" ie it wasnt handled. I havent seen this before, so, not sure what to think. Perhaps the irq wasnt handled because we were still booting up and raid devices had not been fully detected yet? Or maybe this indicates something is going wrong on irq10, which is shared between the sata controller and some other devices. Dmesg shows irq 10 is being shared with some pci, ehci, uhci, as well as ahci/ata devices... Jun 30 13:02:43 node105 kernel: ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part Jun 30 13:02:43 node105 kernel: irq 10: nobody cared (try booting with the "irqpoll" option) Jun 30 13:02:43 node105 kernel: Pid: 430, comm: insmod Not tainted 2.6.30-LADS #3 Jun 30 13:02:43 node105 kernel: Call Trace: Jun 30 13:02:43 node105 kernel: [<c0464f57>] __report_bad_irq+0x27/0x90 Jun 30 13:02:43 node105 kernel: [<c0463d9a>] ? handle_IRQ_event+0x4a/0x130 Jun 30 13:02:43 node105 kernel: [<c04650fa>] note_interrupt+0x13a/0x180 Jun 30 13:02:43 node105 kernel: [<c0465bf3>] handle_level_irq+0xa3/0xc0 Jun 30 13:02:43 node105 kernel: [<c0465b50>] ? handle_level_irq+0x0/0xc0 Jun 30 13:02:43 node105 kernel: <IRQ> [<c04044c6>] ? do_IRQ+0x46/0xb0 Jun 30 13:02:43 node105 kernel: [<c0403429>] ? common_interrupt+0x29/0x30 Jun 30 13:02:43 node105 kernel: [<c04348de>] ? __do_softirq+0x3e/0x180 Jun 30 13:02:43 node105 kernel: [<c04348a0>] ? __do_softirq+0x0/0x180 Jun 30 13:02:43 node105 kernel: <IRQ> [<c043488b>] ? irq_exit+0x2b/0x40 Jun 30 13:02:43 node105 kernel: [<c04044cf>] ? do_IRQ+0x4f/0xb0 Jun 30 13:02:43 node105 kernel: [<c0403429>] ? common_interrupt+0x29/0x30 Jun 30 13:02:43 node105 kernel: [<c049b9dd>] ? __kmalloc+0xed/0x150 Jun 30 13:02:43 node105 kernel: [<c0605760>] ? dmam_coherent_release+0x0/0x80 Jun 30 13:02:43 node105 kernel: [<c0603f0f>] ? devres_alloc+0x1f/0x60 Jun 30 13:02:43 node105 kernel: [<c060593b>] ? dmam_alloc_coherent+0x2b/0x100 Jun 30 13:02:43 node105 kernel: [<c0603d65>] ? devm_kzalloc+0x55/0x70 Jun 30 13:02:43 node105 kernel: [<f81c5259>] ? ahci_port_start+0x59/0xc0 [ahci] Jun 30 13:02:43 node105 kernel: [<f8186dad>] ? ata_host_start+0xbd/0x190 [libata] Jun 30 13:02:43 node105 kernel: [<f818771b>] ? ata_host_activate+0x1b/0xe0 [libata] Jun 30 13:02:43 node105 kernel: [<f81c6230>] ? ahci_interrupt+0x0/0x4f0 [ahci] Jun 30 13:02:43 node105 kernel: [<f81c5e8b>] ? ahci_init_one+0x8fb/0xca0 [ahci] Jun 30 13:02:43 node105 kernel: [<c0582d4e>] ? local_pci_probe+0xe/0x10 Jun 30 13:02:43 node105 kernel: [<c058358c>] ? pci_device_probe+0x5c/0x80 Jun 30 13:02:43 node105 kernel: [<c0601789>] ? driver_probe_device+0x69/0x140 Jun 30 13:02:43 node105 kernel: [<c0582ee4>] ? pci_match_device+0xa4/0xb0 Jun 30 13:02:43 node105 kernel: [<c06018d5>] ? __driver_attach+0x75/0x80 Jun 30 13:02:43 node105 kernel: [<c0600d43>] ? bus_for_each_dev+0x43/0x70 Jun 30 13:02:43 node105 kernel: [<c0582e10>] ? pci_device_shutdown+0x0/0x30 Jun 30 13:02:43 node105 kernel: [<c0601629>] ? driver_attach+0x19/0x20 Jun 30 13:02:43 node105 kernel: [<c0601860>] ? __driver_attach+0x0/0x80 Jun 30 13:02:43 node105 kernel: [<c0601352>] ? bus_add_driver+0xc2/0x280 Jun 30 13:02:43 node105 kernel: [<c0582e10>] ? pci_device_shutdown+0x0/0x30 Jun 30 13:02:43 node105 kernel: [<c0601bba>] ? driver_register+0x5a/0x130 Jun 30 13:02:43 node105 kernel: [<f81ce000>] ? ahci_init+0x0/0x19 [ahci] Jun 30 13:02:43 node105 kernel: [<c05837c0>] ? __pci_register_driver+0x40/0xa0 Jun 30 13:02:43 node105 kernel: [<f81ce000>] ? ahci_init+0x0/0x19 [ahci] Jun 30 13:02:43 node105 kernel: [<f81ce017>] ? ahci_init+0x17/0x19 [ahci] Jun 30 13:02:43 node105 kernel: [<c040112a>] ? do_one_initcall+0x2a/0x150 Jun 30 13:02:43 node105 kernel: [<c046b548>] ? tracepoint_module_notify+0x28/0x30 Jun 30 13:02:43 node105 kernel: [<c0710e17>] ? notifier_call_chain+0x37/0x70 Jun 30 13:02:43 node105 kernel: [<c044702d>] ? __blocking_notifier_call_chain+0x4d/0x60 Jun 30 13:02:43 node105 kernel: [<c0455a13>] ? sys_init_module+0x83/0x1b0 Jun 30 13:02:43 node105 kernel: [<c049f9d0>] ? sys_close+0x60/0xa0 Jun 30 13:02:43 node105 kernel: [<c0402e45>] ? syscall_call+0x7/0xb Jun 30 13:02:43 node105 kernel: handlers: Jun 30 13:02:43 node105 kernel: [<c062e510>] (usb_hcd_irq+0x0/0x70) Jun 30 13:02:43 node105 last message repeated 2 times Jun 30 13:02:43 node105 kernel: Disabling IRQ #10 The odd thing is, despite the message: Jun 30 13:02:43 node105 kernel: Disabling IRQ #10 following messages which followed soon after seem to still show the ahci devices on irq10: Jun 30 13:02:43 node105 kernel: ata1: SATA max UDMA/133 abar m2048@0xf3126000 port 0xf3126100 irq 10 Jun 30 13:02:43 node105 kernel: ata2: SATA max UDMA/133 abar m2048@0xf3126000 port 0xf3126180 irq 10 Jun 30 13:02:43 node105 kernel: ata3: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 10 Jun 30 13:02:43 node105 kernel: ata4: DUMMY Jun 30 13:02:43 node105 kernel: ata5: SATA max UDMA/133 abar m2048@0xf3126000 port 0xf3126300 irq 10 Now over to the experts.... where do i go from here?! Ta for any suggestions! sorry, the problem in message above, ie: Jun 30 13:02:43 node105 kernel: irq 10: nobody cared etc were caused by booting into kernel with acpi=off noapic nolapic. Problem with the "frozen" devices remains, regardless of boot options... So i would still appreciate any advice on what to investigate next. We're trying to use these nodes to collect and process GB of hydrographic data (lidar) up here in Alaska, and until this is resolved, we're having to stop the nodes holding our data from doing any processing. As soon as we give those nodes any CPU work to do, we risk corrupting and losing our data. So this bug is a major pain in the buttock region. I've also tried updating the bios on one of these nodes, to v1.19 released 5 or 6 days ago, but the problem persists... I do have a node to myself on which I am happy to test anything and everything that anyone suggests... thanks again... Looking at the error message, ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen ata1.00: irq_stat 0x00400040, connection status changed ata1: SError: { PHYRdyChg 10B8B DevExch } ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 res 40/00:8c:f8:52:23/00:00:15:00:00/40 Emask 0x10 (ATA bus error) ata1.00: status: { DRDY } According to the Sata Error Register (SErr=0x4090000), the following errors are occuring between the controller and the device resulting in the ATA bus error: PHYRdyChg- this is an internal PHY error losing communication link. 10B8B - host controller detected a decoding error DevExch - device presence status bit has changed. This usually points to an electrical problem, I do realize you said it is reproducible on 6 different systems. I assume they all are using this same SATA drive WDC WD5000AAKS-6 If possible try to reproduce the problem using a different drive. Another a good test would be turning off NCQ before trying to reproduce the problem. # echo 1 > /sys/block/sdX/device/queue_depth You could also try forcing it down to 1.5 Gbs by including this option in /etc/modprobe.conf and rebuilding initrd. options libata force=1.5Gbs Sorry, Wednesday was my day off, so had to wait until yesterday to try the suggestions. Late on Tuesday though, I went into bios and thought i mayve fixed the problem by disabling a whole bunch of devices, including two usb controllers the bios had put on irq 10 which seemed to be the one also in use by sata controller, at least when booting with noapic and nolapic. Anyway, disabling these devices did not help, problem still occurred :( So, yesterday i first tried David's suggestion to force sata down with "options libata force=1.5Gbps". It didnt help. But then.... reverting to 3Gbps, and turning off the NCQ.... SUCCESS!!! I have now turned all the devices back on in bios, reverted to 3 GBps, and turned NCQ off. I've done 10 hours of stress testing on 2 nodes, and I can no longer reproduce this error! There is no doubt whatsoever that turning off NCQ has fixed the problem I also did some googling and found this, which says to add WDC WD5000AAKS-0 to the NCQ blacklist: http://www.transcoding.org/cgi-bin/wiki?action=browse&id=Western_Digital_NCQ&revision=26 and this: http://bugzilla.kernel.org/show_bug.cgi?id=8627 So, apparently Western Digital don't have a clue how to do NCQ... I think these drives should be permanently blacklisted in kernel's libata.c, with NCQ on they are an absolute menace. Thanks again for the help David, it is hugely appreciated. Pls accept my apologies for posting in here as it seems my issue was a bit different to the original poster's. If there is anything you would like me to try, in order to work out exactly what NCQ-related stuff this drive is doing wrong so we can take the results to WD and tell them to fix it, please let me know. Since we made significant changes to the error/recovery code in RHEL5 U3 and haven't seen the original problem, and John's problem (Comment #13) is specific to his Western Digital drive, I am closing this BZ. John, if you want to pursue the NCQ issue please open a seperate BZ. Ideally, we would need to reproduce on an upstream kernel, and if possible, blacklist the particular drive upstream and then backport to RHEL5. |