Bug 157404
| Summary: | Loss of SATA ICH device hangs RAID1 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Igor <sviblovo> |
| Component: | kernel | Assignee: | Jeff Garzik <jgarzik> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.0 | CC: | peterm, spambox, tao, terry1 |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHSA-2006-0575 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2006-08-10 21:05:38 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 181409 | ||
|
Description
Igor
2005-05-11 10:01:30 UTC
I done test with latest kernel-2.6.9-5.0.5.EL. Result same.
kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61
kernel: ata1: status=0xd0 { Busy }
kernel: SCSI error : <0 0 0 0> return code = 0x8000002
kernel: FMK Current sda: sense = 70 9d
kernel: ASC=f7 ASCQ=f7
kernel: end_request: I/O error, dev sda, sector 390716648
kernel: md: write_disk_sb failed for device sda5
kernel: md: errors occurred during superblock update, repeating
kernel: ATA: abnormal status 0xD0 on port 0x177
......................................
kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61
kernel: ata1: status=0xd0 { Busy }
kernel: SCSI error : <0 0 0 0> return code = 0x8000002
kernel: FMK Current sda: sense = 70 9d
kernel: ASC=f7 ASCQ=f7
kernel: end_request: I/O error, dev sda, sector 9992269
kernel: md: write_disk_sb failed for device sda2
kernel: md: errors occurred during superblock update, repeating
kernel: ATA: abnormal status 0xD0 on port 0x177
..................................
I watch this messages 10 minutes. Only reset button has helped.
Usually program watchdog (ftp://ibiblio.org/pub/Linux/system/daemons/watchdog/)
works on my server, but I have unloaded it, differently it would reset
server in a minute after system lock (the beginning of testing).
Same problem here with RHEL4 u2 and RX100 S3 server.
If I pull out a disk, all disk I/O requests stall, and at no stage the OS drops
that SATA channel and continues running RAID 1 in degraded mode. The screen
fills up with error messages, as the OS never gives up trying:
Current sdb: sense key Medium Error
Additional sense: Write error - auto reallocation failed
end_request: I/O error, dev sdb, sector 152103232
ata2: error occurred, port reset
ata2: status=0x01 { Error }
ata2: error=0x01 { AddrMarkNotFound }
scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 09 10 e9 41 00 00 06
00
If I come back an hour later and reconnect the harddrive, then the system picks
up again where it left off. Strange.
This also appears to happen in Fedora Core 4 with the 2.6.14-1.1656_FC4smp kernel. In my case I have 4 320G SATA disks. I have a data partition made up of 4 physical 300G disk partitions in a RAID 5 configuration. The system is on separate RAID 1 partitions. If I disconnect one of the disks SATA cable I get kernel error messages but no response from the md Raid layer. /proc/mdstat lists no problems with the array. A really nasty thing happens if you try and access a file, for example "cat /data/text-file". The command delays for a while and then returns with no error but with no file contents displayed ... This is VERY BAD ! If I reset the system the RAID layer notices one disk is down and the system, together with data partions are fine although the array is in a degraded state as expected. committed in stream U4 build 34.27. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html |