Bug 749036

Summary: Drive failure in raid10 array locks up system
Product: Red Hat Enterprise Linux 5 Reporter: Orion Poplawski <orion>
Component: kernelAssignee: Jes Sorensen <Jes.Sorensen>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.7CC: dledford, jarod, Jes.Sorensen
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-08 06:27:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Orion Poplawski 2011-10-25 22:12:35 UTC
Description of problem:

I have a Dell PowerEdge 2650 with an Adaptec AHA-3960D / AIC-7899A U160/m controller connected to 5 drives.  4 of them are in a raid10 array.  Every time (about 3-4 so far now) a drive has been kicked from the array the machine locks up.  It will respond a little, but anything that appears to need to access the disk and/or vfs will hang.  No record of the md failure makes it into /var/log/messages.

Version-Release number of selected component (if applicable):
kernel-2.6.18-274.3.1.el5

Additional info:

# cat /proc/mdstat
Personalities : [raid10] [raid1] 
md0 : active raid1 sda1[0] sdb1[1]
      128384 blocks [2/2] [UU]
      
md1 : active raid10 sda2[4] sdd1[3] sdc1[2] sdb2[1]
      143106560 blocks 256K chunks 2 near-copies [4/3] [_UUU]
      [>....................]  recovery =  0.1% (76800/71553280) finish=797.3min speed=1492K/sec
      
unused devices: <none>

Comment 1 Orion Poplawski 2011-10-27 16:08:52 UTC
Well, it made a liar of me.  Today it actually worked with 2.6.18-274.7.1.el5.

sd 0:0:0:0: Attempting to queue an ABORT message
CDB: 0x28 0x0 0x1 0xfb 0xee 0xd8 0x0 0x0 0x8 0x0
scsi0: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi0: Dumping Card State while idle, at SEQADDR 0x8
Card was paused
ACCUM = 0x0, SINDEX = 0x10, DINDEX = 0xe4, ARG_2 = 0x0
HCNT = 0x0 SCBPTR = 0xe
SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] 
LASTPHASE[0x1] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] 
SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] 
SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] 
SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] 
STACK: 0x0 0x164 0x10a 0x3
SCB count = 24
Kernel NEXTQSCB = 4
Card NEXTQSCB = 4
QINFIFO entries: 
Waiting Queue entries: 
Disconnected Queue entries: 8:6 
QOUTFIFO entries: 
Sequencer Free SCB List: 14 3 17 0 1 13 12 5 16 19 11 4 2 10 18 6 9 7 15 20 21 22 23 24 25 26 27 28 29 30 31 
Sequencer SCB Info: 
  0 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
  1 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
  2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 
  3 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
  4 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
  5 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
  6 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
  7 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
  8 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x6] 
  9 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
 10 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
 11 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
 12 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
 13 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
 14 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
 15 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
 16 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 
 17 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
 18 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
 19 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Pending list: 
  6 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0] 
Kernel Free SCB list: 16 9 7 0 1 12 18 10 8 17 3 19 5 11 2 14 15 23 13 22 21 20 

<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
(scsi0:A:0:0): Device is disconnected, re-queuing SCB
Recovery code sleeping
(scsi0:A:0:0): Abort Tag Message Sent
(scsi0:A:0:0): SCB 6 - Abort Tag Completed.
Recovery SCB completes
Recovery code awake
aic7xxx_abort returns 0x2002
sd 0:0:0:0: Attempting to queue an ABORT message
CDB: 0x28 0x0 0x1 0xfb 0xee 0x20 0x0 0x0 0x8 0x0
scsi0: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi0: Dumping Card State while idle, at SEQADDR 0x8
Card was paused
ACCUM = 0x0, SINDEX = 0x17, DINDEX = 0xe4, ARG_2 = 0x0
HCNT = 0x0 SCBPTR = 0xb
SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] 
LASTPHASE[0x1] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] 
SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] 
SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] 
SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] 
STACK: 0xe2 0x164 0x10a 0x3
SCB count = 24
Kernel NEXTQSCB = 8
Card NEXTQSCB = 8
QINFIFO entries: 
Waiting Queue entries: 
Disconnected Queue entries: 0:14 
QOUTFIFO entries: 
Sequencer Free SCB List: 11 7 4 15 12 19 18 9 1 14 5 13 8 10 6 3 17 2 16 20 21 22 23 24 25 26 27 28 29 30 31 
Sequencer SCB Info: 
  0 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xe] 
  1 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
  2 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
  3 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
  4 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
  5 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
  6 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
  7 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
  8 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
  9 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 
 10 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
 11 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
 12 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
 13 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 
 14 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 
 15 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 
 16 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
 17 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
 18 SCB_CONTROL[0xe2] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 
 19 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 
 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 
Pending list: 
 14 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0] 
Kernel Free SCB list: 23 6 10 1 4 18 13 3 17 16 5 12 15 0 11 9 7 19 2 22 21 20 

<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
(scsi0:A:0:0): Device is disconnected, re-queuing SCB
Recovery code sleeping
(scsi0:A:0:0): Abort Tag Message Sent
Timer Expired
Recovery code awake
aic7xxx_abort returns 0x2003
sd 0:0:0:0: Attempting to queue a TARGET RESET message
CDB: 0x28 0x0 0x1 0xfb 0xee 0x20 0x0 0x0 0x8 0x0
aic7xxx_dev_reset returns 0x2003
Recovery SCB completes
sd 0:0:0:0: scsi: Device offlined - not ready after error recovery
sd 0:0:0:0: rejecting I/O to offline device
raid10: sda2: rescheduling sector 66061328
sd 0:0:0:0: rejecting I/O to offline device
raid10: Disk failure on sda2, disabling device. 
        Operation continuing on 3 devices
RAID10 conf printout:
 --- wd:3 rd:4
 disk 0, wo:1, o:0, dev:sda2
 disk 1, wo:0, o:1, dev:sdb2
 disk 2, wo:0, o:1, dev:sdc1
 disk 3, wo:0, o:1, dev:sdd1
RAID10 conf printout:
 --- wd:3 rd:4
 disk 1, wo:0, o:1, dev:sdb2
 disk 2, wo:0, o:1, dev:sdc1
 disk 3, wo:0, o:1, dev:sdd1
raid10: sdb2: redirecting sector 66061328 to another mirror
sd 0:0:0:0: rejecting I/O to offline device
raid1: sda1: rescheduling sector 1024
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
raid1: Disk failure on sda1, disabling device. 
        Operation continuing on 1 devices
raid1: sdb1: redirecting sector 1024 to another mirror
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:0, dev:sda1
 disk 1, wo:0, o:1, dev:sdb1
RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:sdb1

Comment 2 Jes Sorensen 2011-12-07 14:58:59 UTC
Orion,

I am curious if the problem also happens if you soft-remove the drive,
as opposed to it getting kicked off by the controller?

When you say things are hanging, does this only include tasks trying to
access data on the raid, or does you mean the entire system is locking
up?

I have been trying to reproduce this on another system running rhel5, but
based on sata drives. So far I haven't been able to see what you are
reporting, even when running heavy I/O to the raid while kicking a disk
offline.

Cheers,
Jes

Comment 3 Orion Poplawski 2011-12-07 16:10:08 UTC
Unfortunately, I can't really play around with this system as it is our primary server (email/ldap/nfs/you name it).  I have a duplicate hardware, but not enough disks at the moment.  I'll try to set that up at some point soon.  I'm thinking that it must have something to do with the scsi hardware/driver, otherwise a *lot* more people would be complaining :).  The system does not completely hang, but any task that need disk io hangs.

Thank you for looking into it, but I suspect I'm going to have to debug/reproduce myself.  Any suggestions for trying to capture good debug info next time it happens?

Comment 4 Jes Sorensen 2011-12-15 13:56:15 UTC
Orion,

I completely understand - I was expecting it was a vital system for you.

I agree with you that this is most likely a scsi driver issue, rather than
a raid issue, but we lack the hard evidence for this right now.

If you have a second system and you are able to reproduce the problem on
it in a test scenario. Then it would be very valuable to try and see if you
can reproduce the same lockups without the drives running in a raid setup.

Comment 5 Jes Sorensen 2013-02-07 16:03:12 UTC
Orion,

We haven't gotten any further on this for well over a year now - do you still
consider it to be an issue, or can we close this bugzilla?

Thanks,
Jes

Comment 6 Orion Poplawski 2013-02-07 18:57:53 UTC
No longer running this hardware and os version, so please close.  Thanks!

Comment 7 Jes Sorensen 2013-02-08 06:27:18 UTC
Thanks for the update!

Closing