Bug 466371

Summary: HTS storage test on aic79xx fails, SCSI command aborts, device drops offline
Product: Red Hat Enterprise Linux 5 Reporter: Roderick Constance <rconstance>
Component: kernel-xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: bornman.richard, clalance, coughlan, drjones, jforbes, lersek, nhorman, pbonzini, revers, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-17 17:51:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514491    
Attachments:
Description Flags
/var/log/messages from failed aic79xx storage test
none
Comment none

Description Roderick Constance 2008-10-09 21:54:32 UTC
Created attachment 319952 [details]
/var/log/messages from failed aic79xx storage test

Description of problem:

While performing the HTS storage test on an Adaptec 29320 attached to external storage device, the test fails as SCSI commands are aborted and the external device is taken offline.  It looks like it fails when the test starts mkfs.

Then external device successfully passes the HTS storage test with other non-aic79xx based SCSI HBAs.

After further investigation, a simple manual fdisk or mkfs.ext3, will cause this error.

This bug looks similar to Bug# 458620.  This happens with the latest released kernel of RHEL 5.2 (2.6.18-92.1.13.el5xen).  I've also tried 2.6.18-115.el5xen, and 2.6.18-118.el5xen mentioned in the above bigzilla report.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-92.1.13.el5

How reproducible:
Always

Steps to Reproduce:
1. fdisk
2.
3.
  
Actual results:
Excerpt from /var/log/messages.  See attachment for all messages during test.

Oct  6 15:56:25 eng158 hts/runtests[14440]: Beginning test run.
Oct  6 15:56:25 eng158 kudzu[14469]: obsolete kudzu ddcProbe called
Oct  6 15:56:26 eng158 hts/runtests[14440]: storage: begin
Oct  6 15:57:27 eng158 kernel: sd 13:0:4:0: Attempting to queue an ABORT message:CDB: 0x2a 0x0 0x0 0x38 0xc 0x10 0x0 0x4 0x0 0x0
Oct  6 15:57:27 eng158 kernel: scsi13: At time of recovery, card was not paused
Oct  6 15:57:27 eng158 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
Oct  6 15:57:27 eng158 kernel: scsi13: Dumping Card State at program address 0x2 Mode 0x33
Oct  6 15:57:27 eng158 kernel: Card was paused
Oct  6 15:57:27 eng158 kernel: INTSTAT[0x0] SELOID[0x4] SELID[0x40] HS_MAILBOX[0x0] 
Oct  6 15:57:27 eng158 kernel: INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33] 
Oct  6 15:57:27 eng158 kernel: SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] LASTPHASE[0x1] 
Oct  6 15:57:27 eng158 kernel: SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0] 
Oct  6 15:57:27 eng158 kernel: SEQ_FLAGS[0x0] SEQ_FLAGS2[0x4] QFREEZE_COUNT[0x2] 
Oct  6 15:57:27 eng158 kernel: KERNEL_QFREEZE_COUNT[0x2] MK_MESSAGE_SCB[0xff00] 
Oct  6 15:57:27 eng158 kernel: MK_MESSAGE_SCSIID[0xff] SSTAT0[0x0] SSTAT1[0x8] 
Oct  6 15:57:27 eng158 kernel: SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0xc0] SIMODE1[0xa4] 
Oct  6 15:57:27 eng158 kernel: LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] 
Oct  6 15:57:27 eng158 kernel: LQOSTAT1[0x0] LQOSTAT2[0xe1] 
Oct  6 15:57:27 eng158 kernel: 
Oct  6 15:57:27 eng158 kernel: SCB Count = 4 CMDS_PENDING = 4 LASTSCB 0x0 CURRSCB 0x1 NEXTSCB 0xff40
Oct  6 15:57:27 eng158 kernel: qinstart = 225 qinfifonext = 225
Oct  6 15:57:27 eng158 kernel: QINFIFO:
Oct  6 15:57:27 eng158 kernel: WAITING_TID_QUEUES:
Oct  6 15:57:27 eng158 kernel: Pending list:
Oct  6 15:57:27 eng158 kernel:   1 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x47] 
Oct  6 15:57:27 eng158 kernel:   0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x47] 
Oct  6 15:57:27 eng158 kernel:   2 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x47] 
Oct  6 15:57:27 eng158 kernel:   3 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x47] 
Oct  6 15:57:27 eng158 kernel: Total 4
Oct  6 15:57:27 eng158 kernel: Kernel Free SCB list: 
Oct  6 15:57:27 eng158 kernel: Sequencer Complete DMA-inprog list: 
Oct  6 15:57:27 eng158 kernel: Sequencer Complete list: 
Oct  6 15:57:27 eng158 kernel: Sequencer DMA-Up and Complete list: 
Oct  6 15:57:27 eng158 kernel: Sequencer On QFreeze and Complete list: 
Oct  6 15:57:27 eng158 kernel: 
Oct  6 15:57:27 eng158 kernel: 
Oct  6 15:57:27 eng158 kernel: scsi13: FIFO0 Free, LONGJMP == 0x826b, SCB 0x2
Oct  6 15:57:27 eng158 kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89] 
Oct  6 15:57:27 eng158 kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] 
Oct  6 15:57:27 eng158 kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 
Oct  6 15:57:27 eng158 kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] 
Oct  6 15:57:27 eng158 kernel: 
Oct  6 15:57:27 eng158 kernel: scsi13: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
Oct  6 15:57:27 eng158 kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] 
Oct  6 15:57:27 eng158 kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] 
Oct  6 15:57:27 eng158 kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 
Oct  6 15:57:27 eng158 kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] 
Oct  6 15:57:27 eng158 kernel: LQIN: 0x4 0x0 0x0 0x2 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x8 0x0 0x0 0x0 0x0 0x0 0x0 
Oct  6 15:57:27 eng158 kernel: scsi13: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
Oct  6 15:57:27 eng158 kernel: scsi13: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x2
Oct  6 15:57:27 eng158 kernel: scsi13: SAVED_SCSIID = 0x0 SAVED_LUN = 0x0
Oct  6 15:57:27 eng158 kernel: SIMODE0[0xc] 
Oct  6 15:57:27 eng158 kernel: CCSCBCTL[0x4] 
Oct  6 15:57:27 eng158 kernel: scsi13: REG0 == 0x1, SINDEX = 0x14a, DINDEX = 0x10a
Oct  6 15:57:27 eng158 kernel: scsi13: SCBPTR == 0x1, SCB_NEXT == 0xff40, SCB_NEXT2 == 0xffff
Oct  6 15:57:27 eng158 kernel: CDB 2a 0 0 3c 8 10
Oct  6 15:57:27 eng158 kernel: STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Oct  6 15:57:27 eng158 kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
Oct  6 15:57:27 eng158 kernel: (scsi13:A:4:0): Device is disconnected, re-queuing SCB
Oct  6 15:57:27 eng158 kernel: scsi13: Recovery code sleeping
Oct  6 15:57:27 eng158 kernel: scsi13: ILLEGAL_PHASE 0x80
Oct  6 15:57:27 eng158 kernel:  target13:0:4: FAST-160 SCSI 160.0 MB/s DT IU QAS RTI WRFLOW PCOMP (6.25 ns, offset 127)
Oct  6 15:57:27 eng158 kernel: scsi13: target 4 using 8bit transfers
Oct  6 15:57:27 eng158 kernel:  target13:0:4: asynchronous
Oct  6 15:57:27 eng158 kernel: scsi13: target 4 using asynchronous transfers
Oct  6 15:57:32 eng158 kernel: scsi13: Timer Expired (active 4)
Oct  6 15:57:32 eng158 kernel: Recovery code awake
Oct  6 15:57:32 eng158 kernel: scsi13: Command abort returning 0x2003


Expected results:
Successfully completion of storage test, fdisk, or mkfs.

Additional info:
This bug blocks a Red Hat Hardware Certification entry.

Comment 5 Neil Horman 2009-03-11 20:39:55 UTC
I wouldn't think so, the other bug only triggered during a kdump, and it was caused by in flight operations preventing a reset competion IIRC.

By the way, you didn't cc me, you reassigned the bug to me.  I don't think you intended to do that, did you?  Sending it back your way.

Comment 6 Bill Burns 2009-03-12 10:32:57 UTC
Ok, thanks for the response, Neil.
CC'ing Tom, do we have this hardware available here to see if we can reproduce this?

Comment 8 Andrew Jones 2010-06-22 18:35:45 UTC
Just echoing Bill's question from comment 6 and adding the needinfo.

Comment 9 richard 2010-10-21 08:20:02 UTC
Created attachment 915161 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 11 Paolo Bonzini 2011-04-01 14:28:17 UTC
Richard, am I right to understand that the kernel fails to boot completely?  Also, can you confirm that you're using the Xen kernel (i.e. dom0) and can you check whether the same happens with the non-Xen kernel?

Thanks!

Comment 12 Laszlo Ersek 2011-05-17 17:51:11 UTC
We haven't gotten a response to the hardware needinfo in comment 8 in 9 months. We haven't received a response to the software needinfo in comment 11 in 1.5 months.

The beaker query in comment 10 is not strict enough, I believe. I reserved two machines from that system list, and "dell-pe700-01.rhts.eng.bos.redhat.com" turned out not have an aic79xx-driven SCSI controller -- it has aacraid. (fdisk works perfectly under -262xen, BTW.)

I changed the query to "Devices/Driver contains aic79xx".

The only one machine that was in working status, unused, and accessible to me, was "intel-s3e8132-01.rhts.eng.bos.redhat.com". This one indeed has the controller in question (log from under 2.6.18-262.el5xen x86_64):

scsi2 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7902 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs

  Vendor: SUPER     Model: GEM359 REV001     Rev: 1.09
  Type:   Processor                          ANSI SCSI revision: 02

scsi3 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7902 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs

However, no disks are attached to these controllers, so I can't actually exercise the driver.

The reported kernels are old (the most recent is 2.6.18-118.el5xen). I think the bug is unlikely to return with recent updates. If it does, please feel free to reopen. Closing as INSU for now.