kernel-2.6.9-5.0.3.EL, ata_piix driver This bug is like these: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127813 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=120505 But exist in RHEL4 too. I have discovered this bug three month ago, after installing fedora core 3 (kernel-smp-2.6.10-1.741_FC3.i686.rpm). Search in Bugzilla has shown presence of this bug in RHEL3, where it was successfuly closed. Then I have installed RHEL4 with kernel-2.6.9-5.0.3.EL expecting the good results, but nothing has changed. I yet did not run tests for last kernel-2.6.9-5.0.5.EL (it is hard for production server), but something prompts me, that results will be same. Has already passed three months, and I have once again looked in Bugzilla for this bug, and was surprised, that else nobody reported on it for RHEL4. My hardware: MB SuperMicro X6DVA-EG dual Xeon http://www.supermicro.com/products/motherboard/Xeon800/E7320/X6DVA-EG.cfm E7320 MCH , 6300ESB ICH - hance rapids (ICH5 like) http://www.intel.com/design/chipsets/embedded/6300esb.htm lspci show 00:1f.2 IDE interface: Intel Corp. 6300ESB SATA Storage Controller (rev 02) 2 hdd WD2000JD sata 200Gb Case with hotswap for 6 sata hdd I do not expect working hotswap, but it is necessary for me, that the system continued to function in case of failure of one hdd. If drive is pulled from case (simulating failure), it causes system locks. I copied messages from console manually so typing errors are possible. ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61 ata1: status=0xd0 {Busy} SCSI error: <0 0 0 0> return code = 0x8000002 EOM Current sda: sense = 70 5e ASC=82 ASCQ=f8 end_request: I/O eror, dev sda, sector xxxxxx md: write_disk_sb failed for device sda2 md: errors occurred during superblock update, repeating ATA: abnormal status 0xD0 on port 0x177 This messages repeated approximately once a minute. About 10 minutes i tried to get control above system, but unsuccessfully. Only pressing the reset button has helped. Reproducibility: every time Steps to Reproduce: 1. Use MB with 6300ESB ICH (may be ICH5/5R) with SATA hard drives. 2. Load EL4 w/ ATAPIIX module on SATA drives and configure it for software raid1. 3. Pull power/data cables from SATA drive Actual results: View timeouts/system lock Expected results: No system lock, only few error messages in logs and email from mdadm monitor. Additional info: /var/log/dmesg ........................ Feb 14 14:55:23 gw3 kernel: Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 Feb 14 14:55:23 gw3 kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx Feb 14 14:55:23 gw3 kernel: hda: CDU5211, ATAPI CD/DVD-ROM drive Feb 14 14:55:23 gw3 kernel: ide1: I/O resource 0x170-0x177 not free. Feb 14 14:55:23 gw3 kernel: ide1: ports already in use, skipping probe Feb 14 14:55:23 gw3 kernel: elevator: using anticipatory as default io schedulerFeb 14 14:55:23 gw3 kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 Feb 14 14:55:23 gw3 kernel: hda: ATAPI 52X CD-ROM drive, 120kB Cache Feb 14 14:55:23 gw3 kernel: Uniform CD-ROM driver Revision: 3.20 ........................ Feb 14 14:55:23 gw3 kernel: SCSI subsystem initialized Feb 14 14:55:23 gw3 kernel: ata_piix: combined mode detected Feb 14 14:55:23 gw3 kernel: ACPI: PCI interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 169 Feb 14 14:55:23 gw3 kernel: ata: 0x1f0 IDE port busy Feb 14 14:55:23 gw3 kernel: ata1: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xFC08 irq 15 Feb 14 14:55:23 gw3 kernel: ata1: dev 0 ATA, max UDMA/100, 390721968 sectors: lba48 Feb 14 14:55:23 gw3 kernel: ata1: dev 1 ATA, max UDMA/133, 390721968 sectors: lba48 Feb 14 14:55:23 gw3 kernel: ata1: dev 0 configured for UDMA/100 Feb 14 14:55:23 gw3 kernel: ata1: dev 1 configured for UDMA/100 Feb 14 14:55:23 gw3 kernel: scsi0 : ata_piix Feb 14 14:55:23 gw3 kernel: Vendor: ATA Model: WDC WD2000JD-00G Rev: 02.0 Feb 14 14:55:23 gw3 kernel: Type: Direct-Access ANSI SCSI revision: 05 Feb 14 14:55:23 gw3 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Feb 14 14:55:23 gw3 kernel: SCSI device sda: drive cache: write back Feb 14 14:55:23 gw3 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Feb 14 14:55:23 gw3 kernel: SCSI device sda: drive cache: write back Feb 14 14:55:23 gw3 kernel: sda: sda1 sda2 sda3 sda4 < sda5 > Feb 14 14:55:23 gw3 kernel: Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 Feb 14 14:55:23 gw3 kernel: Vendor: ATA Model: WDC WD2000JD-00H Rev: 08.0 Feb 14 14:55:23 gw3 kernel: Type: Direct-Access ANSI SCSI revision: 05 Feb 14 14:55:23 gw3 kernel: SCSI device sdb: 390721968 512-byte 3 gw3 kernel: SCSI device sdb: drive cache: write back Feb 14 14:55:23 gw3 kernel: SCSI device sdb: 390721968 512-byte hdwr sectors (200050 MB) Feb 14 14:55:23 gw3 kernel: SCSI device sdb: drive cache: write back Feb 14 14:55:23 gw3 kernel: sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 > Feb 14 14:55:23 gw3 kernel: Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0 Feb 14 14:55:23 gw3 kernel: device-mapper: 4.3.0-ioctl (2004-09-30) initialised: dm-devel Feb 14 14:55:23 gw3 kernel: md: raid1 personality registered as nr 3 Feb 14 14:55:23 gw3 kernel: md: Autodetecting RAID arrays. Feb 14 14:55:23 gw3 kernel: md: autorun ... Feb 14 14:55:23 gw3 kernel: md: considering sdb5 ... Feb 14 14:55:23 gw3 kernel: md: adding sdb5 ... Feb 14 14:55:23 gw3 kernel: md: sdb3 has different UUID to sdb5 Feb 14 14:55:23 gw3 kernel: md: sdb2 has different UUID to sdb5 Feb 14 14:55:23 gw3 kernel: md: sdb1 has different UUID to sdb5 Feb 14 14:55:23 gw3 kernel: md: adding sda5 ... Feb 14 14:55:23 gw3 kernel: md: sda3 has different UUID to sdb5 Feb 14 14:55:23 gw3 kernel: md: sda2 has different UUID to sdb5 Feb 14 14:55:23 gw3 kernel: md: sda1 has different UUID to sdb5 Feb 14 14:55:23 gw3 kernel: md: created md0 Feb 14 14:55:23 gw3 kernel: md: bind<sda5> Feb 14 14:55:23 gw3 kernel: md: bind<sdb5> Feb 14 14:55:23 gw3 kernel: md: running: <sdb5><sda5> Feb 14 14:55:24 gw3 kernel: raid1: raid set md0 active with 2 out of 2 mirrors Feb 14 14:55:24 gw3 kernel: md: considering sdb3 ... Feb 14 14:55:24 gw3 kernel: md: adding sdb3 ... Feb 14 14:55:24 gw3 kernel: md: sdb2 has different UUID to sdb3 Feb 14 14:55:24 gw3 kernel: md: sdb1 has different UUID to sdb3 Feb 14 14:55:24 gw3 kernel: md: adding sda3 ... Feb 14 14:55:24 gw3 kernel: md: sda2 has different UUID to sdb3 Feb 14 14:55:24 gw3 kernel: md: sda1 has different UUID to sdb3 Feb 14 14:55:24 gw3 kernel: md: created md2 Feb 14 14:55:24 gw3 kernel: md: bind<sda3> Feb 14 14:55:24 gw3 kernel: md: bind<sdb3> Feb 14 14:55:24 gw3 kernel: md: running: <sdb3><sda3> Feb 14 14:55:24 gw3 kernel: raid1: raid set md2 active with 2 out of 2 mirrors Feb 14 14:55:24 gw3 kernel: md: considering sdb2 ... Feb 14 14:55:24 gw3 kernel: md: adding sdb2 ... Feb 14 14:55:24 gw3 kernel: md: sdb1 has different UUID to sdb2 Feb 14 14:55:24 gw3 kernel: md: adding sda2 ... Feb 14 14:55:24 gw3 kernel: md: sda1 has different UUID to sdb2 Feb 14 14:55:24 gw3 kernel: md: created md4 Feb 14 14:55:24 gw3 kernel: md: bind<sda2> Feb 14 14:55:24 gw3 kernel: md: bind<sdb2> Feb 14 14:55:24 gw3 kernel: md: running: <sdb2><sda2> Feb 14 14:55:24 gw3 kernel: raid1: raid set md4 active with 2 out of 2 mirrors Feb 14 14:55:24 gw3 kernel: md: considering sdb1 ... Feb 14 14:55:24 gw3 kernel: md: adding sdb1 ... Feb 14 14:55:24 gw3 kernel: md: adding sda1 ... Feb 14 14:55:24 gw3 kernel: md: created md5 Feb 14 14:55:24 gw3 kernel: md: bind<sda1> Feb 14 14:55:24 gw3 kernel: md: bind<sdb1> Feb 14 14:55:24 gw3 kernel: md: running: <sdb1><sda1> Feb 14 14:55:24 gw3 kernel: raid1: raid set md5 active with 2 out of 2 mirrors Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE. Feb 14 14:55:24 gw3 kernel: md: Autodetecting RAID arrays. Feb 14 14:55:24 gw3 kernel: md: autorun ... Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE. Feb 14 14:55:24 gw3 kernel: md: Autodetecting RAID arrays. Feb 14 14:55:24 gw3 kernel: md: autorun ... Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE. Feb 14 14:55:24 gw3 kernel: md: Autodetecting RAID arrays. Feb 14 14:55:24 gw3 kernel: md: autorun ... Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE. Feb 14 14:55:24 gw3 kernel: kjournald starting. Commit interval 5 seconds Feb 14 14:55:24 gw3 kernel: EXT3-fs: mounted filesystem with ordered data mode. ........................................... # cat /etc/mdadm.conf DEVICE partitions MAILADDR root ARRAY /dev/md4 super-minor=4 ARRAY /dev/md5 super-minor=5 ARRAY /dev/md0 super-minor=0 ARRAY /dev/md2 super-minor=2 # cat /etc/fstab /dev/md4 / ext3 defaults 1 1 /dev/md5 /boot ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 /dev/vg1/lv_tmp /tmp ext3 defaults 0 2 /dev/vg1/lv_tmp2 /var/tmp ext3 defaults 0 2 /dev/vg1/lv_log /var/log ext3 defaults 0 2 /dev/md2 swap swap defaults 0 0 /dev/vg1/lv_archive /archive ext3 defaults 0 2 # cat /proc/mdstat Personalities : [raid1] md5 : active raid1 sdb1[1] sda1[0] 104320 blocks [2/2] [UU] md4 : active raid1 sdb2[1] sda2[0] 4891712 blocks [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 987904 blocks [2/2] [UU] md0 : active raid1 sdb5[1] sda5[0] 189374080 blocks [2/2] [UU] unused devices: <none> # mdadm -D /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Sat Jan 29 18:42:02 2005 Raid Level : raid1 Array Size : 189374080 (180.60 GiB 193.92 GB) Device Size : 189374080 (180.60 GiB 193.92 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sun Feb 20 21:56:43 2005 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Number Major Minor RaidDevice State 0 8 5 0 active sync /dev/sda5 1 8 21 1 active sync /dev/sdb5 UUID : 478c1cfd:9d4db00f:7f41192f:b6922e8c Events : 0.145143 #
I done test with latest kernel-2.6.9-5.0.5.EL. Result same. kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61 kernel: ata1: status=0xd0 { Busy } kernel: SCSI error : <0 0 0 0> return code = 0x8000002 kernel: FMK Current sda: sense = 70 9d kernel: ASC=f7 ASCQ=f7 kernel: end_request: I/O error, dev sda, sector 390716648 kernel: md: write_disk_sb failed for device sda5 kernel: md: errors occurred during superblock update, repeating kernel: ATA: abnormal status 0xD0 on port 0x177 ...................................... kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61 kernel: ata1: status=0xd0 { Busy } kernel: SCSI error : <0 0 0 0> return code = 0x8000002 kernel: FMK Current sda: sense = 70 9d kernel: ASC=f7 ASCQ=f7 kernel: end_request: I/O error, dev sda, sector 9992269 kernel: md: write_disk_sb failed for device sda2 kernel: md: errors occurred during superblock update, repeating kernel: ATA: abnormal status 0xD0 on port 0x177 .................................. I watch this messages 10 minutes. Only reset button has helped. Usually program watchdog (ftp://ibiblio.org/pub/Linux/system/daemons/watchdog/) works on my server, but I have unloaded it, differently it would reset server in a minute after system lock (the beginning of testing).
Same problem here with RHEL4 u2 and RX100 S3 server. If I pull out a disk, all disk I/O requests stall, and at no stage the OS drops that SATA channel and continues running RAID 1 in degraded mode. The screen fills up with error messages, as the OS never gives up trying: Current sdb: sense key Medium Error Additional sense: Write error - auto reallocation failed end_request: I/O error, dev sdb, sector 152103232 ata2: error occurred, port reset ata2: status=0x01 { Error } ata2: error=0x01 { AddrMarkNotFound } scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 09 10 e9 41 00 00 06 00 If I come back an hour later and reconnect the harddrive, then the system picks up again where it left off. Strange.
This also appears to happen in Fedora Core 4 with the 2.6.14-1.1656_FC4smp kernel. In my case I have 4 320G SATA disks. I have a data partition made up of 4 physical 300G disk partitions in a RAID 5 configuration. The system is on separate RAID 1 partitions. If I disconnect one of the disks SATA cable I get kernel error messages but no response from the md Raid layer. /proc/mdstat lists no problems with the array. A really nasty thing happens if you try and access a file, for example "cat /data/text-file". The command delays for a while and then returns with no error but with no file contents displayed ... This is VERY BAD ! If I reset the system the RAID layer notices one disk is down and the system, together with data partions are fine although the array is in a degraded state as expected.
committed in stream U4 build 34.27. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html