Bug 157404

Summary:	Loss of SATA ICH device hangs RAID1
Product:	Red Hat Enterprise Linux 4	Reporter:	Igor <sviblovo>
Component:	kernel	Assignee:	Jeff Garzik <jgarzik>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	peterm, spambox, tao, terry1
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:	RHSA-2006-0575	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-08-10 21:05:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	181409

Description Igor 2005-05-11 10:01:30 UTC

kernel-2.6.9-5.0.3.EL, ata_piix driver

This bug is like these:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127813
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=120505
But exist in RHEL4 too.
                                                                                
I have discovered this bug three month ago, after installing fedora core 3
(kernel-smp-2.6.10-1.741_FC3.i686.rpm). Search in Bugzilla has shown presence of
this bug in RHEL3, where it was successfuly closed. Then I have installed RHEL4
with kernel-2.6.9-5.0.3.EL expecting the good results, but nothing has changed.
I yet did not run tests for last kernel-2.6.9-5.0.5.EL (it is hard for
production server), but something prompts me, that results will be same. Has
already passed three months, and I have once again looked in Bugzilla for this
bug, and was surprised, that else nobody reported on it for RHEL4.
                                                                                
My hardware:
MB SuperMicro X6DVA-EG dual Xeon
http://www.supermicro.com/products/motherboard/Xeon800/E7320/X6DVA-EG.cfm
E7320 MCH , 6300ESB ICH - hance rapids (ICH5 like)
http://www.intel.com/design/chipsets/embedded/6300esb.htm
 
lspci show
00:1f.2 IDE interface: Intel Corp. 6300ESB SATA Storage Controller (rev 02)
 
2 hdd WD2000JD sata 200Gb
 
Case with hotswap for 6 sata hdd
I do not expect working hotswap, but it is necessary for me, that the system
continued to function in case of failure of one hdd.
 
If drive is pulled from case (simulating failure), it causes system locks.
I copied messages from console manually so typing errors are possible.
  
ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61
ata1: status=0xd0 {Busy}
SCSI error: <0 0 0 0> return code = 0x8000002
EOM   Current sda: sense = 70 5e
ASC=82 ASCQ=f8
end_request: I/O eror, dev sda, sector xxxxxx
md: write_disk_sb failed for device sda2
md: errors occurred during superblock update, repeating
ATA: abnormal status 0xD0 on port 0x177
  
This messages repeated approximately once a minute.
About 10 minutes i tried to get control above system, but unsuccessfully.
Only pressing the reset button has helped.


Reproducibility: every time

Steps to Reproduce:

1. Use MB with 6300ESB ICH (may be ICH5/5R) with SATA hard drives. 
2. Load EL4 w/ ATAPIIX module on SATA drives and configure it for software raid1. 
3. Pull power/data cables from SATA drive

 
Actual results: View timeouts/system lock

Expected results: No system lock, only few error messages in logs and
 email from mdadm monitor.

Additional info:

/var/log/dmesg
........................
Feb 14 14:55:23 gw3 kernel: Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
Feb 14 14:55:23 gw3 kernel: ide: Assuming 33MHz system bus speed for PIO modes;
override with idebus=xx
Feb 14 14:55:23 gw3 kernel: hda: CDU5211, ATAPI CD/DVD-ROM drive
Feb 14 14:55:23 gw3 kernel: ide1: I/O resource 0x170-0x177 not free.
Feb 14 14:55:23 gw3 kernel: ide1: ports already in use, skipping probe
Feb 14 14:55:23 gw3 kernel: elevator: using anticipatory as default io
schedulerFeb 14 14:55:23 gw3 kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Feb 14 14:55:23 gw3 kernel: hda: ATAPI 52X CD-ROM drive, 120kB Cache
Feb 14 14:55:23 gw3 kernel: Uniform CD-ROM driver Revision: 3.20
........................
Feb 14 14:55:23 gw3 kernel: SCSI subsystem initialized
Feb 14 14:55:23 gw3 kernel: ata_piix: combined mode detected
Feb 14 14:55:23 gw3 kernel: ACPI: PCI interrupt 0000:00:1f.2[A] -> GSI 18
(level, low) -> IRQ 169
Feb 14 14:55:23 gw3 kernel: ata: 0x1f0 IDE port busy
Feb 14 14:55:23 gw3 kernel: ata1: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma
0xFC08 irq 15
Feb 14 14:55:23 gw3 kernel: ata1: dev 0 ATA, max UDMA/100, 390721968 sectors: lba48
Feb 14 14:55:23 gw3 kernel: ata1: dev 1 ATA, max UDMA/133, 390721968 sectors: lba48
Feb 14 14:55:23 gw3 kernel: ata1: dev 0 configured for UDMA/100
Feb 14 14:55:23 gw3 kernel: ata1: dev 1 configured for UDMA/100
Feb 14 14:55:23 gw3 kernel: scsi0 : ata_piix
Feb 14 14:55:23 gw3 kernel:   Vendor: ATA       Model: WDC WD2000JD-00G  Rev: 02.0
Feb 14 14:55:23 gw3 kernel:   Type:   Direct-Access                      ANSI
SCSI revision: 05
Feb 14 14:55:23 gw3 kernel: SCSI device sda: 390721968 512-byte hdwr sectors
(200050 MB)
Feb 14 14:55:23 gw3 kernel: SCSI device sda: drive cache: write back
Feb 14 14:55:23 gw3 kernel: SCSI device sda: 390721968 512-byte hdwr sectors
(200050 MB)
Feb 14 14:55:23 gw3 kernel: SCSI device sda: drive cache: write back
Feb 14 14:55:23 gw3 kernel:  sda: sda1 sda2 sda3 sda4 < sda5 >
Feb 14 14:55:23 gw3 kernel: Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Feb 14 14:55:23 gw3 kernel:   Vendor: ATA       Model: WDC WD2000JD-00H  Rev: 08.0
Feb 14 14:55:23 gw3 kernel:   Type:   Direct-Access                      ANSI
SCSI revision: 05
Feb 14 14:55:23 gw3 kernel: SCSI device sdb: 390721968 512-byte 3 gw3 kernel:
SCSI device sdb: drive cache: write back
Feb 14 14:55:23 gw3 kernel: SCSI device sdb: 390721968 512-byte hdwr sectors
(200050 MB)
Feb 14 14:55:23 gw3 kernel: SCSI device sdb: drive cache: write back
Feb 14 14:55:23 gw3 kernel:  sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 >
Feb 14 14:55:23 gw3 kernel: Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
Feb 14 14:55:23 gw3 kernel: device-mapper: 4.3.0-ioctl (2004-09-30) initialised:
dm-devel
Feb 14 14:55:23 gw3 kernel: md: raid1 personality registered as nr 3
Feb 14 14:55:23 gw3 kernel: md: Autodetecting RAID arrays.
Feb 14 14:55:23 gw3 kernel: md: autorun ...
Feb 14 14:55:23 gw3 kernel: md: considering sdb5 ...
Feb 14 14:55:23 gw3 kernel: md:  adding sdb5 ...
Feb 14 14:55:23 gw3 kernel: md: sdb3 has different UUID to sdb5
Feb 14 14:55:23 gw3 kernel: md: sdb2 has different UUID to sdb5
Feb 14 14:55:23 gw3 kernel: md: sdb1 has different UUID to sdb5
Feb 14 14:55:23 gw3 kernel: md:  adding sda5 ...
Feb 14 14:55:23 gw3 kernel: md: sda3 has different UUID to sdb5
Feb 14 14:55:23 gw3 kernel: md: sda2 has different UUID to sdb5
Feb 14 14:55:23 gw3 kernel: md: sda1 has different UUID to sdb5
Feb 14 14:55:23 gw3 kernel: md: created md0
Feb 14 14:55:23 gw3 kernel: md: bind<sda5>
Feb 14 14:55:23 gw3 kernel: md: bind<sdb5>
Feb 14 14:55:23 gw3 kernel: md: running: <sdb5><sda5>
Feb 14 14:55:24 gw3 kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Feb 14 14:55:24 gw3 kernel: md: considering sdb3 ...
Feb 14 14:55:24 gw3 kernel: md:  adding sdb3 ...
Feb 14 14:55:24 gw3 kernel: md: sdb2 has different UUID to sdb3
Feb 14 14:55:24 gw3 kernel: md: sdb1 has different UUID to sdb3
Feb 14 14:55:24 gw3 kernel: md:  adding sda3 ...
Feb 14 14:55:24 gw3 kernel: md: sda2 has different UUID to sdb3
Feb 14 14:55:24 gw3 kernel: md: sda1 has different UUID to sdb3
Feb 14 14:55:24 gw3 kernel: md: created md2
Feb 14 14:55:24 gw3 kernel: md: bind<sda3>
Feb 14 14:55:24 gw3 kernel: md: bind<sdb3>
Feb 14 14:55:24 gw3 kernel: md: running: <sdb3><sda3>
Feb 14 14:55:24 gw3 kernel: raid1: raid set md2 active with 2 out of 2 mirrors
Feb 14 14:55:24 gw3 kernel: md: considering sdb2 ...
Feb 14 14:55:24 gw3 kernel: md:  adding sdb2 ...
Feb 14 14:55:24 gw3 kernel: md: sdb1 has different UUID to sdb2
Feb 14 14:55:24 gw3 kernel: md:  adding sda2 ...
Feb 14 14:55:24 gw3 kernel: md: sda1 has different UUID to sdb2
Feb 14 14:55:24 gw3 kernel: md: created md4
Feb 14 14:55:24 gw3 kernel: md: bind<sda2>
Feb 14 14:55:24 gw3 kernel: md: bind<sdb2>
Feb 14 14:55:24 gw3 kernel: md: running: <sdb2><sda2>
Feb 14 14:55:24 gw3 kernel: raid1: raid set md4 active with 2 out of 2 mirrors
Feb 14 14:55:24 gw3 kernel: md: considering sdb1 ...
Feb 14 14:55:24 gw3 kernel: md:  adding sdb1 ...
Feb 14 14:55:24 gw3 kernel: md:  adding sda1 ...
Feb 14 14:55:24 gw3 kernel: md: created md5
Feb 14 14:55:24 gw3 kernel: md: bind<sda1>
Feb 14 14:55:24 gw3 kernel: md: bind<sdb1>
Feb 14 14:55:24 gw3 kernel: md: running: <sdb1><sda1>
Feb 14 14:55:24 gw3 kernel: raid1: raid set md5 active with 2 out of 2 mirrors
Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE.
Feb 14 14:55:24 gw3 kernel: md: Autodetecting RAID arrays.
Feb 14 14:55:24 gw3 kernel: md: autorun ...
Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE.
Feb 14 14:55:24 gw3 kernel: md: Autodetecting RAID arrays.
Feb 14 14:55:24 gw3 kernel: md: autorun ...
Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE.
Feb 14 14:55:24 gw3 kernel: md: Autodetecting RAID arrays.
Feb 14 14:55:24 gw3 kernel: md: autorun ...
Feb 14 14:55:24 gw3 kernel: md: ... autorun DONE.
Feb 14 14:55:24 gw3 kernel: kjournald starting.  Commit interval 5 seconds
Feb 14 14:55:24 gw3 kernel: EXT3-fs: mounted filesystem with ordered data mode.
...........................................




# cat /etc/mdadm.conf
DEVICE partitions
MAILADDR root
ARRAY /dev/md4 super-minor=4
ARRAY /dev/md5 super-minor=5
ARRAY /dev/md0 super-minor=0
ARRAY /dev/md2 super-minor=2
                                                                               
                                                                                 
# cat /etc/fstab
/dev/md4                /                       ext3    defaults        1 1
/dev/md5                /boot                   ext3    defaults        1 2
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /dev/shm                tmpfs   defaults        0 0
none                    /proc                   proc    defaults        0 0
none                    /sys                    sysfs   defaults        0 0
/dev/vg1/lv_tmp         /tmp                    ext3    defaults        0 2
/dev/vg1/lv_tmp2        /var/tmp                ext3    defaults        0 2
/dev/vg1/lv_log         /var/log                ext3    defaults        0 2
/dev/md2                swap                    swap    defaults        0 0
/dev/vg1/lv_archive     /archive                ext3    defaults        0 2
 
# cat /proc/mdstat
Personalities : [raid1]
md5 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]
                                                                               
                                                                                 
md4 : active raid1 sdb2[1] sda2[0]
      4891712 blocks [2/2] [UU]
                                                                               
                                                                                 
md2 : active raid1 sdb3[1] sda3[0]
      987904 blocks [2/2] [UU]
                                                                               
                                                                                 
md0 : active raid1 sdb5[1] sda5[0]
      189374080 blocks [2/2] [UU]
                                                                               
                                                                                 
unused devices: <none>
                                                                               
                                                                                 
                                                                               
                                                                                 
                                                                               
                                                                                 
# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat Jan 29 18:42:02 2005
     Raid Level : raid1
     Array Size : 189374080 (180.60 GiB 193.92 GB)
    Device Size : 189374080 (180.60 GiB 193.92 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent
                                                                               
                                                                                 
    Update Time : Sun Feb 20 21:56:43 2005
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
    Number   Major   Minor   RaidDevice State
       0       8        5        0      active sync   /dev/sda5
       1       8       21        1      active sync   /dev/sdb5
           UUID : 478c1cfd:9d4db00f:7f41192f:b6922e8c
         Events : 0.145143
#

Comment 1 Igor 2005-05-18 09:18:12 UTC


I done test with latest kernel-2.6.9-5.0.5.EL. Result same. 

kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61
kernel: ata1: status=0xd0 { Busy }
kernel: SCSI error : <0 0 0 0> return code = 0x8000002
kernel: FMK Current sda: sense = 70 9d
kernel: ASC=f7 ASCQ=f7
kernel: end_request: I/O error, dev sda, sector 390716648
kernel: md: write_disk_sb failed for device sda5
kernel: md: errors occurred during superblock update, repeating
kernel: ATA: abnormal status 0xD0 on port 0x177

......................................

kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61
kernel: ata1: status=0xd0 { Busy }
kernel: SCSI error : <0 0 0 0> return code = 0x8000002
kernel: FMK Current sda: sense = 70 9d
kernel: ASC=f7 ASCQ=f7
kernel: end_request: I/O error, dev sda, sector 9992269
kernel: md: write_disk_sb failed for device sda2
kernel: md: errors occurred during superblock update, repeating
kernel: ATA: abnormal status 0xD0 on port 0x177

..................................

I watch this messages 10 minutes. Only reset button has helped. 
Usually program watchdog (ftp://ibiblio.org/pub/Linux/system/daemons/watchdog/)
works on my server, but I have unloaded it, differently it would reset 
server in a minute after system lock (the beginning of testing).

Comment 8 Stephen 2006-01-20 15:45:17 UTC

Same problem here with RHEL4 u2 and RX100 S3 server.

If I pull out a disk, all disk I/O requests stall, and at no stage the OS drops
that SATA channel and continues running RAID 1 in degraded mode. The screen
fills up with error messages, as the OS never gives up trying:

Current sdb: sense key Medium Error
Additional sense: Write error - auto reallocation failed
end_request: I/O error, dev sdb, sector 152103232
ata2: error occurred, port reset
ata2: status=0x01 { Error }
ata2: error=0x01 { AddrMarkNotFound }
scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 09 10 e9 41 00 00 06
00

If I come back an hour later and reconnect the harddrive, then the system picks
up again where it left off. Strange.

Comment 9 Terry Barnaby 2006-02-01 19:26:22 UTC

This also appears to happen in Fedora Core 4 with the 2.6.14-1.1656_FC4smp kernel.
In my case I have 4 320G SATA disks. I have a data partition made up of 4
physical 300G disk partitions in a RAID 5 configuration. The system is on
separate RAID 1 partitions.

If I disconnect one of the disks SATA cable I get kernel error messages but no
response from the md Raid layer. /proc/mdstat lists no problems with the array.
A really nasty thing happens if you try and access a file, for example "cat
/data/text-file". The command delays for a while and then returns with no error
but with no file contents displayed ... This is VERY BAD !

If I reset the system the RAID layer notices one disk is down and the system,
together with data partions are fine although the array is in a degraded state
as expected.

Comment 10 Jason Baron 2006-05-01 15:59:01 UTC

committed in stream U4 build 34.27. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 13 Red Hat Bugzilla 2006-08-10 21:05:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html