Hide Forgot
Description of problem: I have a Dell PowerEdge 2650 with an Adaptec AHA-3960D / AIC-7899A U160/m controller connected to 5 drives. 4 of them are in a raid10 array. Every time (about 3-4 so far now) a drive has been kicked from the array the machine locks up. It will respond a little, but anything that appears to need to access the disk and/or vfs will hang. No record of the md failure makes it into /var/log/messages. Version-Release number of selected component (if applicable): kernel-2.6.18-274.3.1.el5 Additional info: # cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sda1[0] sdb1[1] 128384 blocks [2/2] [UU] md1 : active raid10 sda2[4] sdd1[3] sdc1[2] sdb2[1] 143106560 blocks 256K chunks 2 near-copies [4/3] [_UUU] [>....................] recovery = 0.1% (76800/71553280) finish=797.3min speed=1492K/sec unused devices: <none>
Well, it made a liar of me. Today it actually worked with 2.6.18-274.7.1.el5. sd 0:0:0:0: Attempting to queue an ABORT message CDB: 0x28 0x0 0x1 0xfb 0xee 0xd8 0x0 0x0 0x8 0x0 scsi0: At time of recovery, card was not paused >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< scsi0: Dumping Card State while idle, at SEQADDR 0x8 Card was paused ACCUM = 0x0, SINDEX = 0x10, DINDEX = 0xe4, ARG_2 = 0x0 HCNT = 0x0 SCBPTR = 0xe SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] STACK: 0x0 0x164 0x10a 0x3 SCB count = 24 Kernel NEXTQSCB = 4 Card NEXTQSCB = 4 QINFIFO entries: Waiting Queue entries: Disconnected Queue entries: 8:6 QOUTFIFO entries: Sequencer Free SCB List: 14 3 17 0 1 13 12 5 16 19 11 4 2 10 18 6 9 7 15 20 21 22 23 24 25 26 27 28 29 30 31 Sequencer SCB Info: 0 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 1 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 8 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x6] 9 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 10 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 11 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 12 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 13 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 14 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 15 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 16 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 17 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 18 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 19 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Pending list: 6 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0] Kernel Free SCB list: 16 9 7 0 1 12 18 10 8 17 3 19 5 11 2 14 15 23 13 22 21 20 <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> (scsi0:A:0:0): Device is disconnected, re-queuing SCB Recovery code sleeping (scsi0:A:0:0): Abort Tag Message Sent (scsi0:A:0:0): SCB 6 - Abort Tag Completed. Recovery SCB completes Recovery code awake aic7xxx_abort returns 0x2002 sd 0:0:0:0: Attempting to queue an ABORT message CDB: 0x28 0x0 0x1 0xfb 0xee 0x20 0x0 0x0 0x8 0x0 scsi0: At time of recovery, card was not paused >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< scsi0: Dumping Card State while idle, at SEQADDR 0x8 Card was paused ACCUM = 0x0, SINDEX = 0x17, DINDEX = 0xe4, ARG_2 = 0x0 HCNT = 0x0 SCBPTR = 0xb SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] STACK: 0xe2 0x164 0x10a 0x3 SCB count = 24 Kernel NEXTQSCB = 8 Card NEXTQSCB = 8 QINFIFO entries: Waiting Queue entries: Disconnected Queue entries: 0:14 QOUTFIFO entries: Sequencer Free SCB List: 11 7 4 15 12 19 18 9 1 14 5 13 8 10 6 3 17 2 16 20 21 22 23 24 25 26 27 28 29 30 31 Sequencer SCB Info: 0 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xe] 1 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 8 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 9 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 10 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 11 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 12 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 13 SCB_CONTROL[0xe0] SCB_SCSIID[0x37] SCB_LUN[0x0] SCB_TAG[0xff] 14 SCB_CONTROL[0xe0] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 15 SCB_CONTROL[0xe0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 16 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 17 SCB_CONTROL[0xe0] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 18 SCB_CONTROL[0xe2] SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 19 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff] 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Pending list: 14 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0] Kernel Free SCB list: 23 6 10 1 4 18 13 3 17 16 5 12 15 0 11 9 7 19 2 22 21 20 <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> (scsi0:A:0:0): Device is disconnected, re-queuing SCB Recovery code sleeping (scsi0:A:0:0): Abort Tag Message Sent Timer Expired Recovery code awake aic7xxx_abort returns 0x2003 sd 0:0:0:0: Attempting to queue a TARGET RESET message CDB: 0x28 0x0 0x1 0xfb 0xee 0x20 0x0 0x0 0x8 0x0 aic7xxx_dev_reset returns 0x2003 Recovery SCB completes sd 0:0:0:0: scsi: Device offlined - not ready after error recovery sd 0:0:0:0: rejecting I/O to offline device raid10: sda2: rescheduling sector 66061328 sd 0:0:0:0: rejecting I/O to offline device raid10: Disk failure on sda2, disabling device. Operation continuing on 3 devices RAID10 conf printout: --- wd:3 rd:4 disk 0, wo:1, o:0, dev:sda2 disk 1, wo:0, o:1, dev:sdb2 disk 2, wo:0, o:1, dev:sdc1 disk 3, wo:0, o:1, dev:sdd1 RAID10 conf printout: --- wd:3 rd:4 disk 1, wo:0, o:1, dev:sdb2 disk 2, wo:0, o:1, dev:sdc1 disk 3, wo:0, o:1, dev:sdd1 raid10: sdb2: redirecting sector 66061328 to another mirror sd 0:0:0:0: rejecting I/O to offline device raid1: sda1: rescheduling sector 1024 sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device raid1: Disk failure on sda1, disabling device. Operation continuing on 1 devices raid1: sdb1: redirecting sector 1024 to another mirror RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:1, o:0, dev:sda1 disk 1, wo:0, o:1, dev:sdb1 RAID1 conf printout: --- wd:1 rd:2 disk 1, wo:0, o:1, dev:sdb1
Orion, I am curious if the problem also happens if you soft-remove the drive, as opposed to it getting kicked off by the controller? When you say things are hanging, does this only include tasks trying to access data on the raid, or does you mean the entire system is locking up? I have been trying to reproduce this on another system running rhel5, but based on sata drives. So far I haven't been able to see what you are reporting, even when running heavy I/O to the raid while kicking a disk offline. Cheers, Jes
Unfortunately, I can't really play around with this system as it is our primary server (email/ldap/nfs/you name it). I have a duplicate hardware, but not enough disks at the moment. I'll try to set that up at some point soon. I'm thinking that it must have something to do with the scsi hardware/driver, otherwise a *lot* more people would be complaining :). The system does not completely hang, but any task that need disk io hangs. Thank you for looking into it, but I suspect I'm going to have to debug/reproduce myself. Any suggestions for trying to capture good debug info next time it happens?
Orion, I completely understand - I was expecting it was a vital system for you. I agree with you that this is most likely a scsi driver issue, rather than a raid issue, but we lack the hard evidence for this right now. If you have a second system and you are able to reproduce the problem on it in a test scenario. Then it would be very valuable to try and see if you can reproduce the same lockups without the drives running in a raid setup.
Orion, We haven't gotten any further on this for well over a year now - do you still consider it to be an issue, or can we close this bugzilla? Thanks, Jes
No longer running this hardware and os version, so please close. Thanks!
Thanks for the update! Closing