Bug 653706 - [Intel 6.1 Bug] scsi error handler bug encountered in ALUA failover scenario (mptsas)
[Intel 6.1 Bug] scsi error handler bug encountered in ALUA failover scenario ...
Status: CLOSED DUPLICATE of bug 642618
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
x86_64 Linux
high Severity high
: rc
: 6.1
Assigned To: Tomas Henzl
Red Hat Kernel QE team
:
Depends On:
Blocks: 600438 672278
  Show dependency treegraph
 
Reported: 2010-11-15 19:18 EST by joseph.r.gruher
Modified: 2011-02-03 08:11 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 672278 (view as bug list)
Environment:
Last Closed: 2011-02-03 08:11:45 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (144.47 KB, application/x-zip-compressed)
2010-11-15 19:18 EST, joseph.r.gruher
no flags Details
RHEL6 32bit SOS-report files (1.39 MB, application/octet-stream)
2010-11-24 18:54 EST, ilgu hong
no flags Details
RHEL6 32bit multipath information (multipath -ll/v4, dmsetup ls --tree) (15.89 KB, text/plain)
2010-11-24 18:56 EST, ilgu hong
no flags Details
RHEL6 64bit SOS-report files (1.34 MB, application/octet-stream)
2010-11-24 18:57 EST, ilgu hong
no flags Details
RHEL6 64bit multipath information (multipath -ll/v4, dmsetup ls --tree) (15.89 KB, text/plain)
2010-11-24 18:58 EST, ilgu hong
no flags Details
RHEL6 32bit serial console log with mptbase debug (0x8188) and scsi_logging (0x124) (115.47 KB, application/octet-stream)
2010-11-29 16:45 EST, ilgu hong
no flags Details
RHEL6 64bit serial console log with mptbase debug (0x8188) and scsi_logging (0x124) (60.96 KB, application/octet-stream)
2010-11-29 16:46 EST, ilgu hong
no flags Details
Avoid Error handling when device is deleting (582 bytes, patch)
2010-11-30 06:36 EST, kashyap
no flags Details | Diff
Test log with mptsas patch (195.12 KB, application/x-zip-compressed)
2010-12-01 14:16 EST, ilgu hong
no flags Details
4x.100 patch (13.22 KB, patch)
2010-12-02 09:55 EST, kashyap
no flags Details | Diff
4x.101 patch (2.10 KB, patch)
2010-12-02 09:56 EST, kashyap
no flags Details | Diff
RHEL6 variant of .100 patch (12.66 KB, patch)
2010-12-02 17:05 EST, Jason Unrein
no flags Details | Diff
RHEL6 variant of .101 patch (14.83 KB, patch)
2010-12-02 17:05 EST, Jason Unrein
no flags Details | Diff
failed test serial log for 100.patch (768.90 KB, application/octet-stream)
2010-12-03 19:00 EST, ilgu hong
no flags Details
failed test /var/log/message for RHEL6 64bit (1.01 MB, text/plain)
2010-12-03 19:01 EST, ilgu hong
no flags Details
failed test serial log for 101.patch with rhel6 64bit (129.62 KB, application/octet-stream)
2010-12-03 19:19 EST, ilgu hong
no flags Details
failed test /var/log/message for RHEL6 64bit with 101 patch (85.88 KB, text/plain)
2010-12-03 19:20 EST, ilgu hong
no flags Details

  None (edit)
Description joseph.r.gruher 2010-11-15 19:18:35 EST
Created attachment 460697 [details]
logs

In this test I set 3 Luns. One is for OS and two are for data.

Symptom :

It looks like RHEL 6 run error handler routine before recognize third LUN removed and it makes Host reset.  And Host reset remove survival path and then system lost all available path. 

See below for detail.


<<<<<<<<<<<<<< first lun is remove by failover

sg_remove: sg0
sg_device_destroy: sg0
sd 0:0:67:0: [sda] Synchronizing SCSI cache
sd 0:0:67:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 0:0:67:0: [sda] Unhandled error code
device-mapper: multipath: Failing path 8:0.


<<<<<<<<<<<<<< Second lun is removed by failover 

sg_remove: sg1
sg_device_destroy: sg1
device-mapper: multipath: Failing path 8:16.
device-mapper: multipath: Failing path 8:32.
scsi 0:0:67:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 0:0:67:0: [sda] CDB: Write(10): 2a 00 00 62 4d 58 00 00 08 00
end_request: I/O error, dev sda, sector 6442328
sd 0:0:67:1: [sdb] Synchronizing SCSI cache
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done

<<<<<<<<<<<<<< SCSI error handler routine is  called (target reset)

Error handler scsi_eh_0 waking up
sd 0:0:67:2: scsi_eh_prt_failet 59
scsi_eh_0: Target reset failed target: 59
scsi_eh_0: Sending target reset to target 60
scsi_eh_0: Target reset failed target: 60
scsi_eh_0: Sending target reset to target 61
scsi_eh_0: Target reset failed target: 61
scsi_eh_0: Sending target reset to target 62
scsi_eh_0: Target reset failed target: 62
scsi_eh_0: Sending target reset to target 63
scsi_eh_0: Target reset failed target: 63
scsi_eh_0: Sending target reset to target 64
scsi_eh_0: Target reset failed target: 64
scsi_eh_0: Sending target reset to target 65
scsi_eh_0: Target reset failed target: 65
scsi_eh_0: Sending target reset to target 66
scsi_eh_0: Target reset failed target: 66
scsi_eh_0: Sending target reset to target 67
scsi_eh_0: Target reset failed target: 67
scsi_eh_0: Sending BRST chan: 0

<<<<<<<<<<<<<<<<<< bus reset

scsi_try_bus_reset: Snd Bus RST
mptscsih: ioc0: attempting bus reset! (sc=c5938e00)
sd 0:0:67:2: [sdc] CDB: Read(10): 28 00 01 66 f9 90 00 00 08 00
mptscsih: ioc0: bus reset: SUCCESS (sc=c5938e00)

<<<<<<<<<<<<<<  HBA report DID_DISCONNECT , but it looks like scsi error handler routine does not care about it).

scsi_eh_done scmd: c5938e00 result: 10000  
 scsi_send_eh_cmnd: scmd: c5938e00, timeleft: 10000
scsi_send_eh_cmnd: scsi_eh_completed_normally 2003
scsi_eh_tur: scmd c5938e00 rtn 2003


<<<<<<<<<<<<<<<<<  Host reset

scsi_eh_0: Sending HRST
scsi_try_host_reset: Snd Host RST
mptscsih: ioc0: attempting host reset! (sc=c5938e00)
mptbase: ioc0: Initiating recovery
mptscsih: ioc0: host reset: SUCCESS (sc=c5938e00)
scsi_eh_done scmd: c5938e00 result: 10000
scsi_send_eh_cmnd: scmultipath: Failing path 8:80.
mptbase: ioc0: LogInfo(0x30030501sd 0:0:68:0: [sdd] Synchronizing SCSI cache
sd 0:0:68:0: [sdd] 0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501) cb_idx mptbase_reply
mptbase: ioc0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501) cb_idx mptbase_reply
mptbase: ioc0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501) cb_idx mptbase_reply
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 0 GiB)

<<<<<<<<<<<<<<<<<<<<  All paths are removed by Host reset and then new path added. 

 sdb:
sd 0:0:69:2: [sdc] Write Protect is off
sd 0:0:69:2: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
 sdc: unknown partition table
sd 0:0:69:0: [sda] Write Protect is off
sd 0:0:69:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
sd 0:0:69:2: [sdc] Attached SCSI disk
 sda: sda1 sda2
 unknown partition table
sd 0:0:69:0: [sda] Attached SCSI disk
sd 0:0:69:1: [sdb] Attached SCSI disk
mptsas: ioc0: add expander: num_phys 25, sas_addr (0x5000423e2e4c52ff)
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1
sd 0:0:70:2: [sdi] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
sd 0:0:70:2: [sdi] Write Protect is off
sd 0:0:70:2: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
sd 0:0:70:0: [sdg] Write Protect is off
 sdi:Dev sdi: unable to read RDB block 0
 unable to read partition table
sd 0:0:70:1: [sdh] Attached SCSI disk
sd 0:0:70:0: [sdg] Write cache: enabled, read cache: enabled, supports DPO and FUA
sd 0:0:70:2: [sdi] Attached SCSI disk
 sdg:Dev sdg: unable to read RDB block 0
 unable to read partition table
sd 0:0:70:0: [sdg] Attached SCSI disk

<<<<<<<<<<<<<<<<<<<<<  kernel panic 

INFO: task jbd2/dm-3-8:622 blocked for more than 120 seconds.
ace+0xa0/0xb0
 [<c059e462>] ? selinux_file_alloc_security+0x42/0xc0
 [<c052ada1>] ? path_walk+0x51/0xc0
 [<c052af29>] ? do_path_lookup+0x59/0x90
 [<c052bc69>] ? do_filp_open+0xc9/0xb10
 [<c04fa822>] ? handle_mm_fault+0x132/0x1d0
 [<c080f730>] ? do_page_fault+0x0/0x480
 [<c051b568>] ? do_sys_open+0x58/0x130
 [<c04a993c>] ? audit_syscall_entry+0x21c/0x240
 [<c051b6bc>] ? sys_open+0x2c/0x40
 [<c04099fb>] ? sysenter_do_call+0x12/0x28
PuTTYsg_remove: sg0
sg_device_destroy: sg0
sd 0:0:69:0: alua: Detached
sd 0:0:69:0: [sda] Synchronizing SCSI cache
sd 0:0:69:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sg_remove: sg1
sg_device_destroy: sg1
sd 0:0:69:1: alua: Detached
sd 0:0:69:1: [sdb] Synchronizing SCSI cache
sd 0:0:69:1: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sg_remove: sg2
sg_device_destroy: sg2
sd 0:0:69:2: alua: Detached
sd 0:0:69:2: [sdc] Synchronizing SCSI cache
sd 0:0:69:2: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi target0:0:69: mptsas: ioc0: delete device: fw_channel 0, fw_id 0, phy 12, sas_addr 0x5000155000020509
Comment 2 Tom Coughlan 2010-11-24 11:19:15 EST
I talked with Mike a little about this. The problem here is that the host reset ends up removing all the devices/paths (even the good ones). That is most likely an mpt driver problem, as opposed to the layers above.  

Andrius, please add the right contacts from LSI to this BZ.  

(In reply to comment #0)
> Created attachment 460697 [details]

> In this test I set 3 Luns. One is for OS and two are for data.

And there are two paths to each LUN?

(An sosreport would help us interpret the logs better.)

Has this problem been seen on other kernel/o.s. versions?
Comment 3 joseph.r.gruher 2010-11-24 13:17:01 EST
Yes, this is a dual storage controller system using ALUA, so there are two paths to each LUN, one active and one passive.  If the active path goes down the standby path should become active.

We can also produce the problem on Ubuntu 10.4 using 2.6.32 kernel.  We do not seem to see the problem so far on SLES11 SP1.  We never saw the problem on RH 5.x.

I'm not familiar with sosreport... if there is additional debug data we can provide, please provide some more details, and we'll do our best to gather it.
Comment 4 Tom Coughlan 2010-11-24 16:03:36 EST
(In reply to comment #3)

> I'm not familiar with sosreport... if there is additional debug data we can
> provide, please provide some more details, and we'll do our best to gather it.

man sosreport

If just generates a tarball with the output of a bunch of "show" commands, /var/log/messages, and so on.
Comment 5 ilgu hong 2010-11-24 17:08:26 EST
Attached log is get by "console redirect to serial" and It is equivalent with "/var/log/message".

Main reason we use "console redirect mechanism" is that system can not save error log (mpt/scsi) when error occured.  

We use SAN stroage for OS, and OS cannot access /var/log/message when this error occured. So it is impossible to save log /var/log/message in error case.


Please, check attached log.

Thanks.
Comment 6 ilgu hong 2010-11-24 17:19:26 EST
Hi,Tom

sorry, I'm confused about previous comment.

Whatever, In error case, system already is in hang status. So we cannot run sosreport.

We can only run soreport after reboot. Is that OK with you?

Thanks.
Comment 7 Tom Coughlan 2010-11-24 18:13:23 EST
(In reply to comment #6)
> Hi,Tom

> We can only run soreport after reboot. Is that OK with you?

Yes, that is what I had in mind - a view of the system in its "normal" state. This will give a bit of background on version numbers and how things are connected (multipath -v4 -ll) and such.
Comment 8 ilgu hong 2010-11-24 18:54:45 EST
Created attachment 462780 [details]
RHEL6 32bit SOS-report files

This RHEL6 32bit sosreport result.
Comment 9 ilgu hong 2010-11-24 18:56:25 EST
Created attachment 462781 [details]
RHEL6 32bit multipath information (multipath -ll/v4, dmsetup ls --tree)

This file save the result of (multipath -ll/ dmsetup ls --tree/ multpath -v4).
Comment 10 ilgu hong 2010-11-24 18:57:53 EST
Created attachment 462782 [details]
RHEL6 64bit SOS-report files

This file save reulst of sosrepot in RHEL6 64bit.
Comment 11 ilgu hong 2010-11-24 18:58:55 EST
Created attachment 462783 [details]
RHEL6 64bit multipath information (multipath -ll/v4, dmsetup ls --tree)

This file save result of multipath -ll / dmsetup ls --tree/ multipath -v4 in RHEL6 64bit.
Comment 12 ilgu hong 2010-11-24 19:00:32 EST
Hi, Tom

I update sosreport file and multpath information file for the both 32bit and 64bit in Normal status.

Thanks.
Comment 13 Tomas Henzl 2010-11-25 07:33:55 EST
Hi Kashyap,

per comment#2 it looks like the problem is in the mptsas driver. 
Please take a look at it.
Comment 14 kashyap 2010-11-29 07:23:18 EST
(In reply to comment #13)
> Hi Kashyap,
> 
> per comment#2 it looks like the problem is in the mptsas driver. 
> Please take a look at it.

I am going through the contents right now. Meanwhile can someone explain test scenario ? 
And yes please enable driver logging level to 0x8188.

~ Kashyap
Comment 15 ilgu hong 2010-11-29 13:17:20 EST
Hi, kashyap.

Intel multiplex has dual storage controller and it supports ALUA feature. So we supports dual path with alua feature.

In this test, system is configured with three LUNs- one is for OS(RHEL6) and others are for data. we create I/O for the data LUNs and then we test failover/failback by rebooting each storage contorller.
Comment 16 ilgu hong 2010-11-29 13:57:01 EST
Hi, Kashyap.

You want to set mpt_debug_level as 0x8188.

Is that right?
Comment 17 ilgu hong 2010-11-29 16:45:39 EST
Created attachment 463589 [details]
RHEL6 32bit serial console log with mptbase debug (0x8188) and scsi_logging (0x124)

This file get with setting - mpt debug level (0x8188) scsi logging level (0x124).
Comment 18 ilgu hong 2010-11-29 16:46:57 EST
Created attachment 463590 [details]
RHEL6 64bit serial console log with mptbase debug (0x8188) and scsi_logging (0x124)

This file get with seeting - mptbase debug level (0x8188) and scsi logging level(0x124).
Comment 19 ilgu hong 2010-11-29 16:47:46 EST
Hi, Kashyap.

I attached log file.
Thanks.
Comment 20 kashyap 2010-11-30 06:35:22 EST
ilgu hong, Your input helps me. I wanted exactly same logfiles.

Here is my analysis... [Search for "kashyap" ]

--
sg_remove: sg0
sg_device_destroy: sg0
device-mapper: multipath: Failing path 8:0.
sd 0:0:3:0: [sda] Synchronizing SCSI cache  <-- "Kashyap": sda is removed and 
                                               it is assigned to 0:0:3:0
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Exend_request: I/O error, dev sdb, sector 12887056
device-mapper: multipath: Failing path 8:16.
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptsas: ioc0: TaskMgmt completed: (mf = ffff8800778c8400, mr = ffff8800778c1720)
mptsas: ioc0: 	TaskMgmt completed: fw_channel = 0, fw_id = 0,
	task_type = 0x03, iocstatus = 0x0000 loginfo = 0x00000000,	 
        response_code = 0x00, term_cmnds = 10
mptsas: ioc0: TaskMgmt: completed (0 seconds)
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff8800782d8480)
Error handler scsi_eh_0 waking up
sd 0:0:3:0: scsi_eh_prt_fail_): 2a 00 01 d4 11 70 00 00 08 00
                              // "Kashyap": Task abort is for same sda device 
                                is started here.....(See it is 0:0:3:0 device)
mptscsih: ioc0: TaskMgmt request (mf=ffff8800778c8500)
mptscsih: ioc0: TaskMgmt: ctx2abort (0x00000000) task_type = 0x04, timeout = 10
mptbase: ioc0: MPT event:(17h) : SAS Broadcase Primative: phy=3 port=3 width=1 primative=0x04
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff8800786ca200)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptscsih: ioc0: TaskMgmt completed (mf=ffff8800778c8500, mr=ffff8800778c18b0)
mptscsih: ioc0: TaskMgmt fw_channel = 0, fw_id = 0, task_type = 0x04,
	iocstatus = 0x0000, loginfo = 0x00000000, response_code = 0x00,
	term_cmnds = 0
-



Error handling is started for device which is a deleting at the same time.. Due to race in EEH it is happening...To avoid this mptsas driver has callback "eh_timed_out" 
It has been done using below Bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=615866

But most of the time it happens only when Device Missing Delay timer is ON. (it means chances are high when DMD is on, but very rare we have seen similar root cause for device offline issue). In this case without DMD also we are seeing similar issue reported in bugzilla 615866.

Please try the attached patch.!

Thanks, Kashyap
Comment 21 kashyap 2010-11-30 06:36:45 EST
Created attachment 463699 [details]
Avoid Error handling when device is deleting
Comment 22 joseph.r.gruher 2010-11-30 13:44:39 EST
Hi Kashyap-

We are looking into adding the patch and retesting RHEL6 to see if this helps us.

I did want to note that we previously mentioned we also see this problem on Ubuntu 10.4 and in that case we are using the latest upstream and already have this patch.  We'll still try this patch in the context of RHEL6, I just wanted to share the concern that this may not be a full solution for the problem.  The comments below are from one of the developers working on that Ubuntu build:

The bad news is we already have this fix because we've been using the latest upstream, so it's something else.

diff -Naurp fusion_orig/mptsas.c fusion/mptsas.c
--- fusion_orig/mptsas.c	2010-12-01 03:37:24.102545034 +0530
+++ fusion/mptsas.c	2010-12-01 03:38:28.458582184 +0530
@@ -1925,7 +1925,8 @@ static enum blk_eh_timer_return mptsas_e
 	}
 
 	vdevice = sc->device->hostdata;
-	if (vdevice && vdevice->vtarget && vdevice->vtarget->inDMD) {
+	if (vdevice && vdevice->vtarget && (vdevice->vtarget->inDMD
+					|| vdevice->vtarget->deleted)) {
 		dtmprintk(ioc, printk(MYIOC_s_WARN_FMT ": %s: target removed "
 		    "or in device removal delay (sc=%p)\n",
 		    ioc->name, __func__, sc ));

and my mptsas tree:

        vdevice = sc->device->hostdata;
        if (vdevice && vdevice->vtarget && (vdevice->vtarget->inDMD
                || vdevice->vtarget->deleted)) {
                dtmprintk(ioc, printk(MYIOC_s_WARN_FMT ": %s: target removed "
                    "or in device removal delay (sc=%p)\n",
                    ioc->name, __func__, sc));
                rc = BLK_EH_RESET_TIMER;
                goto done;
        }

The concept here is to reset the timer on the scmd so that it stays out of the error handler long enough to be reconciled by the hotplug removal. Problem is, you can only reset the timer so many times, perhaps 3, no more than 5.
Comment 23 ilgu hong 2010-12-01 14:16:12 EST
Created attachment 464085 [details]
Test log with mptsas patch

This log include 3 files.

RHEL6-32.log is serail console log for 32bit host, and it looks fine with mptsas patch.

RHEL6-64.log is serial console log for 64bit host, and it shows new error symptom.

error-var-log-message-RHEL6-64.txt is /var/log/message for 64bit host, and I just added this file for the conveninet of debugging.
Comment 24 ilgu hong 2010-12-01 14:17:12 EST
Hi Kashyap, 

I checked your patch.

It looks like there is little improvement on MPTsas behavior but it also creates new symptoms.

I tested your patch with both 32bit and 64bit RHEL6. 

In 32bit RHEL6 host, works fine more than 23 iteration (each iteration crates 5 times failover/failback for each controller) and it still works fine.

But 65bit RHEL6 host, It looks like go to endless loop in the Mptsas driver/firmware. check below log.

In the below log, mptsas_eh_timed_out handler contineously called event though failback finished, and it looks like mptsas go to abnormal status and does not works.


I attached serial console log file and I also attached /var/log/message for the convenient for your debug.

In the RHEL6-64.log serial log, error symptom starts at the line 5665.
In the error-var-log-message-RHEL6-64.txt log, error symptom starts at the line 12585.

< Error symptom - serial log for RHEL6 -64>

<<======================== failover
 end_device-0:83:1: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 0, phy 13,sas_addr 0x5000155000020408
 phy-0:83:6470DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 0:0:83:0: [sda] CDB: Write(10): 2a 00 02 14 6f b0 00 00 30 00
end_request: I/O error, dev sda, sector 34893744
scsi 0:0:83:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 0, sc=ffff88003e4dec80, mf = ffff880077e8a580, idx=fb
scsi 0:0:83:0: [sda] Unhandled error code
scsi 0:0:83:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 0:0:83:0: [sda] CDB: Write(10): 2a 00 01 96 65 70 00 00 08 00
end_request: I/O error, dev sda, sector 26633584
scsi 0:0:83:0: [sda] Unhandled error code

<<=========================== sg1 removed
sg_remove: sg1
sg_device_destroy: sg1
scsi 0:0:83:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 0:0:83:0: [sda] CDB: Write(10): 2a 00 01 55 b4 90 00 00 08 00
end_request: I/O error, dev sda, sector 22394000
sd 0:0:83:1: [sdb] Synchronizing SCSI cachcode
sd 0:0:83:1: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 0:0:83:1: [sdb] CDB: Read(10): 28 00 00 01 ec 98 00 00 08 00
end_request: I/O error, dev sdb, sector 126104
device-mapper: multipath: Failing path 8:16.


<<=========================== mptsas eh timeout handler
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)

sd 0:0:85:0: alua: port group 01 state S supports touSnA
sd 0:0:85:1: alua: port group 01 state S supports touSnA
sd 0:0:85:1: alua: port group 01 switched to state A
sd 0:0:85:0: alua: port group 01 switched to state A
sd 0:0:85:0: alua: port group 01 state A supports touSnA
scsi 0:0:83:0: alua: Detached
sd 0:0:83:1: alua: Detached
sd 0:0:85:1: alua: port group 01 state A supports touSnA
device-mapper: multipath: Failing path 8:32.

<<============================ mptsas eh timeout handler
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)

<<============================ failback
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=2: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880004147980)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Added
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880004147740)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880004147d40)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880004147ec0)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Added: id=0 channel=0
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880004147680)

<<============================ still see : mptsas eh timeout handler
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
INFO: task mpt/0:366 blocked for more than 120 seconds.
"echo 08/0x80 [ext4]
 [<ffffffffa0254249>] ext4_orphan_add+0xb9/0x1e0 [ext4]
 [<ffffffffa021d4e5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa0252a60>] ext4_setattr+0x240/0x390 [ext4]
 [<ffffffff81188318>] notify_change+0x168/0x340
 [<ffffffff8116ab84>] do_truncate+0x64/0xa0
 [<ffffffff8117d468>] do_filp_open+0x748/0xd40
 [<ffffffff81171b14>] ? cp_new_stat+0xe4/0x100
 [<ffffffff811892f2>] ? alloc_fd+0x92/0x160
 [<ffffffff81169929>] do_sys_open+0x69/0x140
 [<ffffffff81169a40>] sys_open+0x20/0x30
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=2: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88007b2a43c0)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Deleted
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880040a8e6c0)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88003780a200)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077f64440)
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Deleted: id=0 channel=0
mptsas: ioc0: TaskMgmt request (mf=ffff880077e86f00)
mptsas: ioc0: TaskMgmt type=3 (sas device delete) fw_channel = 0 fw_id = 0)
mptbase: ioc0: clearing discovery_quiesce_io flag
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptsas: ioc0: TaskMgmt completed: (mf = ffff880077e86f00, mr = ffff880077e82030)
mptsas: ioc0: 	TaskMgmt completed: fw_channel = 0, fw_id = 0,
	task_type = 0x03, iocstatus = 0x0000 loginfo = 0x00000000,
	response_code = 0x00, term_cmnds = 0
mptsas: ioc0: TaskMgmt: completed (0 seconds)
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff8800137a4ec0)
INFO: task mpt/0:366 blocked for more than 120 seconds.
"echo 0toremove_wake_function+0x0/0x40
 [<ffffffff8120bf4b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ff3b6>] ? security_file_permission+0x16/0x20
 [<ffffffff8116c818>] vfs_write+0xb8/0x1a0
 [<ffffffff810d40a2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8116d251>] sys_write+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=2: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88007b2a4480)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Added
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880052ddacc0)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff8800782098c0)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88007b2a4180)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Added: id=0 channel=0
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880078209740)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800438b6c80)
Comment 25 kashyap 2010-12-02 09:54:00 EST
(In reply to comment #22)
> Hi Kashyap-
> 
> We are looking into adding the patch and retesting RHEL6 to see if this helps
> us.
> 
> I did want to note that we previously mentioned we also see this problem on
> Ubuntu 10.4 and in that case we are using the latest upstream and already have
> this patch.  We'll still try this patch in the context of RHEL6, I just wanted
> to share the concern that this may not be a full solution for the problem.  The
> comments below are from one of the developers working on that Ubuntu build:
> 
> The bad news is we already have this fix because we've been using the latest
> upstream, so it's something else.
> 
> diff -Naurp fusion_orig/mptsas.c fusion/mptsas.c
> --- fusion_orig/mptsas.c 2010-12-01 03:37:24.102545034 +0530
> +++ fusion/mptsas.c 2010-12-01 03:38:28.458582184 +0530
> @@ -1925,7 +1925,8 @@ static enum blk_eh_timer_return mptsas_e
>   }
> 
>   vdevice = sc->device->hostdata;
> - if (vdevice && vdevice->vtarget && vdevice->vtarget->inDMD) {
> + if (vdevice && vdevice->vtarget && (vdevice->vtarget->inDMD
> +     || vdevice->vtarget->deleted)) {
>    dtmprintk(ioc, printk(MYIOC_s_WARN_FMT ": %s: target removed "
>        "or in device removal delay (sc=%p)\n",
>        ioc->name, __func__, sc ));
> 
> and my mptsas tree:
> 
>         vdevice = sc->device->hostdata;
>         if (vdevice && vdevice->vtarget && (vdevice->vtarget->inDMD
>                 || vdevice->vtarget->deleted)) {
>                 dtmprintk(ioc, printk(MYIOC_s_WARN_FMT ": %s: target removed "
>                     "or in device removal delay (sc=%p)\n",
>                     ioc->name, __func__, sc));
>                 rc = BLK_EH_RESET_TIMER;
>                 goto done;
>         }
> 
> The concept here is to reset the timer on the scmd so that it stays out of the
> error handler long enough to be reconciled by the hotplug removal. Problem is,
> you can only reset the timer so many times, perhaps 3, no more than 5.


I am about to look at new logs provided in this bugzilla, but in parallel can we have test result of another two patches ? I have already proposed those patches for Ubuntu 10.4.

I have attached both the patches.

~ Kashyap
Comment 26 kashyap 2010-12-02 09:55:12 EST
Created attachment 464262 [details]
4x.100 patch
Comment 27 kashyap 2010-12-02 09:56:06 EST
Created attachment 464263 [details]
4x.101 patch
Comment 28 ilgu hong 2010-12-02 13:39:28 EST
Hi Kashyap,

Those new patches are not made from RHEL6's message/funsion/ source tree.

Please, make a patch for RHEL6 and re-attached those patches.

Thanks.
Comment 29 Jason Unrein 2010-12-02 17:05:09 EST
Created attachment 464388 [details]
RHEL6 variant of .100 patch
Comment 30 Jason Unrein 2010-12-02 17:05:47 EST
Created attachment 464389 [details]
RHEL6 variant of .101 patch
Comment 31 ilgu hong 2010-12-03 19:00:15 EST
Created attachment 464685 [details]
failed test serial log for 100.patch

This file get with 100 path and 64bit RHEL6.
I enabled 0x8188 debug level for mptbase and 0x124 for scsi_mod.
Comment 32 ilgu hong 2010-12-03 19:01:34 EST
Created attachment 464686 [details]
failed test /var/log/message for RHEL6 64bit

This is system /var/log/message for 100.patch with RHEL6 64bit.
Comment 33 ilgu hong 2010-12-03 19:02:44 EST
Hi Kashyap-

I still see same problem with "100 patch" in the RHEL6 64bit.

One thing interesting is that it still occured only 64bit RHEL6 and it also occured almost same iteration.

Is there any problem related with 64bit supporting?

I attached log for this- 
see line 5792 at the RHEL6-64-100.log and see also line 12537 at the var_log_message_rhel6-64-100-patch.txt

Please, check below.



<<<<<< ======================== failover start

ioc0: TaskMgmt type=3 (sas device delete) fw_channel = 0 fw_id =nt=ffff8800520e99c0 timeout + 60HZ
sd 0:0:81:1: [sdb] Synchronizing SCSI cache
mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8800520e99c0 timeout + 60HZ
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Exfw_event=0xffff88003902cb00)
sd 0:0:82:0: alua: port group 01 state A supports touSnA
sd 0:0:82:1: alua: port group 01 state A supports touSnA
sd 0:0:81:1: alua: Detached
sd 0:0:82:1: alua: port group 01 state A supports touSnA
scsi 0:0:81:0: alua: Detached
sd 0:0:82:0: alua: port group 01 state A supports touSnA
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00
sd 0:0:82:2: alua: port group 01 state A supports touSnA
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00

<<<<<< =========== mptsas_eh_timed_out called

mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff880031e1de80)
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00

<<<<<< ==========  failover

mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=2: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88003902ca40)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Added
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88003902ce00)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88003902c740)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88003902c2c0)

<<<<<< ============ another failover 
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Added: id=0 channel=0
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff88003902c200)

<<<<<< =========== still see mptsas_eh_timed_out called with same sc( scsi command) 

mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff880031e1de80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff880031e1de80)
INFO: task mpt/0:347 blocked for more than 120 seconds.
"echo 075>] inode_setattr+0x35/0x170
 [<ffffffffa02529a6>] ext4_setattr+0x186/0x390 [ext4]
 [<ffffffff81188318>] notify_change+0x168/0x340
 [<ffffffff8116ab84>] do_truncate+0x64/0xa0
 [<ffffffff8117d468>] do_filp_open+0x748/0xd40
 [<ffffffff81171b14>] ? cp_new_stat+0xe4/0x100
 [<ffffffff811892f2>] ? alloc_fd+0x92/0x160
 [<ffffffff81169929>] do_sys_open+0x69/0x140
 [<ffffffff81169a40>] sys_open+0x20/0x30
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff880031e1de80)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff880031e1de80)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=3: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880033f509c0)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Deleted
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880033f50a80)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880033f50840)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075c93480)
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Deleted: id=1 channel=0
mptsas: ioc0: TaskMgmt request (mf=ffff880077a89900)
mptsas: ioc0: TaskMgmt type=3 (sas device delete) fw_channel = 0 fw_id = 1)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptsas: ioc0: TaskMgmt completed: (mf = ffff880077a89900, mr = ffff880077a800a0)
mptsas: ioc0: 	TaskMgmt completed: fw_channel = 0, fw_id = 1,
	task_type = 0x03, iocstatus = 0x0000 loginfo = 0x00000000,
	response_code = 0x00, term_cmnds = 0
mptsas: ioc0: TaskMgmt: completed (0 seconds)
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880012c2ba40)
INFO: task mpt/0:347 blocked for more than 120 seconds.
"echo 0/0xe0
 [<ffffffff8107112c>] ? do_setitimer+0x1ac/0x220
 [<ffffffff81180619>] sys_getdents+0x89/0xf0
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff880031e1de80)
Comment 34 ilgu hong 2010-12-03 19:05:02 EST
Hi,all.

There is typo at the upper comment.

change "failover" to "failback"

"
<<<<<< =========== mptsas_eh_timed_out called

mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device
removal delay (sc=ffff880031e1de80)
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00

<<<<<< ==========  failover
"


<<<<<< =========== mptsas_eh_timed_out called

mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device
removal delay (sc=ffff880031e1de80)
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00
sd 0:0:81:2: [sdc] CDB: Test Unit Ready: 00 00 00 00 00 00

<<<<<< ==========  failback
Comment 35 ilgu hong 2010-12-03 19:19:24 EST
Created attachment 464688 [details]
failed test serial log for 101.patch with rhel6 64bit

This serial log get with 0x8188 mptbase debug level and 0x124 scsi_mod loggig level.
Comment 36 ilgu hong 2010-12-03 19:20:49 EST
Created attachment 464689 [details]
failed test /var/log/message for RHEL6 64bit  with 101 patch

/var/log/message for RHEL6 64bit host with 101 patch
Comment 37 ilgu hong 2010-12-03 19:28:44 EST
Hi Kashyap-

I see mptsas driver makes kernel panic and go to the mal status with "101 patch". It looks like more unstable than 101 patch.

I attached log files.

see line 1107 at the RHEL6-64-101.log and see line 979 at the var_log_message_rhel6_64_101-patch.txt.



see below:

<<<<<=================== failover

 end_device-0:7:1: mptsas: ioc0: removing ssp device: fw_choval delay (sc=ffff880005446dc0)
scsi 0:0:7:0: [sda] Unhandled error code
sg_remove: sg1
sg_device_destroy: sg1
scsi 0:0:7:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 0:0:7:0: [sda] CDB: Write(10): 2a 00 01 95 c2 60 00 00 10 00
end_request: I/O error, dev sda, sector 26591840
sd 0:0:7:1: [sdb] Synchronizing SCSI cache
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Exmoved or in device removal delay (sc=ffff8800261696c0)
sd 0:0:8:1: alua: port group 01 state A supports touSnA
scsi 0:0:7:0: alua: Detached
sd 0:0:8:0: alua: port group 01 state A supports touSnA
sd 0:0:7:1: alua: Detached
sd 0:0:8:1: alua: port group 01 state A supports touSnA
device-mapper: multipath: Failing path 8:32.

<<<<<==================  mptsas_eh_timed_out is called.

mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)


<<<<<=================== faiback
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=2: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075f98d80)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Added
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075f98180)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075f98540)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075f98f00)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Added: id=0 channel=0
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075f98600)

<<<<<================ still see mptsas_eh_timed_out is called
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
INFO: task mpt/0:351 blocked for more than 120 seconds.

<<<<================= mptsas makes oops.
"echo 0fffffa01d15bb>] mptsas_expander_delete+0x12b/0x320 [mptsas]
 [<ffffffffa01d4b50>] ? mptsas_firmware_event_work+0x0/0xf9c [mptsas]
 [<ffffffffa01d571e>] mptsas_firmware_event_work+0xbce/0xf9c [mptsas]
 [<ffffffff814c8286>] ? thread_return+0x4e/0x778
 [<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa01d4b50>] ? mptsas_firmware_event_work+0x0/0xf9c [mptsas]
 [<ffffffff8108c610>] worker_thread+0x170/0x2a0
 [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091936>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
INFO: task mpt/0:351 blocked for more than 120 seconds.
"echo 0nt_fs_time+0x27/0x30
 [<ffffffff8112889d>] ? shmem_xattr_security_get+0x1d/0x30
 [<ffffffff8110e230>] __generic_file_aio_write+0x250/0x480
 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff811a339c>] blkdev_aio_write+0x3c/0xa0
 [<ffffffff8116c51a>] do_sync_write+0xfa/0x140
 [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8120bf4b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ff3b6>] ? security_file_permission+0x16/0x20
 [<ffffffff8116c818>] vfs_write+0xb8/0x1a0
 [<ffffffff810d40a2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8116d251>] sys_write+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=3: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2800)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Deleted
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2b00)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2d40)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2ec0)
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Deleted: id=1 channel=0
mptsas: ioc0: TaskMgmt request (mf=ffff88007750a200)
mptsas: ioc0: TaskMgmt type=3 (sas device delete) fw_channel = 0 fw_id = 1)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
mptsas: ioc0: TaskMgmt completed: (mf = ffff88007750a200, mr = ffff880077500c80)
mptsas: ioc0: 	TaskMgmt completed: fw_channel = 0, fw_id = 1,
	task_type = 0x03, iocstatus = 0x0000 loginfo = 0x00000000,
	response_code = 0x00, term_cmnds = 0
mptsas: ioc0: TaskMgmt: completed (0 seconds)
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2bc0)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=3: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880037644ec0)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Added
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2440)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880037644d40)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate 3.0 Gpbs
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2500)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Stop
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Added: id=1 channel=0
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed22c0)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)
mptbase: ioc0: MPT event:(16h) : SAS Discovery: Start
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=2: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2e00)
mptbase: ioc0: MPT event:(1Bh) : Expander Status Change: Deleted
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880077ed2140)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=6: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff8800599cc080)
mptbase: ioc0: MPT event:(12h) : SAS PHY Link Status: Phy=13: Rate Unknown
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff8800599cc380)
mptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Deleted: id=0 channel=0
Comment 38 kashyap 2010-12-06 07:43:58 EST
ilgu hong, 

Your attached logs is having all prints which we wanted to debug so I will say logs are fine.
Now, When I analyzed both the logs for .100 and .101, symptoms of the issue is same. (though you have reported .101 is having kernel crash, it is not actually kernel crash, but mpt/0 is hang for more than 120 seconds)

1. Very first thing to mention here is, we have not hit the "SAME" issue this time as reported earlier that "Device goes OFFLINE"
   * I have not seen any single Task abort (Error handling was never kicked off).
It means whatever patches I have give (.100 and .101) were only active only if there is a Error handling kicked off.

To me it looks like even without the patch, also you may hit this new issue [ mpt/0 hang for more than 120 second ]

NOTE : Once mpt/0 is hanged we will not see any further device ADD/REMOVE to OS.

2. Why "mpt/0" Hangs ? I have very basic pointer on this.
    Once driver receive "Expander deleted" event as part of failover, It will remove all three associate LUN and later "expander" will be removed.. _BUT_ somehow expander_delete is not able to remove device and it hangs (see below stack dump)

---
ptbase: ioc0: MPT event:(0Fh) : SAS Device Status Change: Added: id=0 channel=0
mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880075f98600)
mptsas: ioc0: WARNING - : mptsas_eh_timed_out: target removed or in device removal delay (sc=ffff8800261696c0)

INFO: task mpt/0:351 blocked for more than 120 seconds.

"echo 0fffffa01d15bb>] mptsas_expander_delete+0x12b/0x320 [mptsas]
[<ffffffffa01d4b50>] ? mptsas_firmware_event_work+0x0/0xf9c [mptsas]
[<ffffffffa01d571e>] mptsas_firmware_event_work+0xbce/0xf9c [mptsas]
[<ffffffff814c8286>] ? thread_return+0x4e/0x778
[<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80

----


I would suggest, if we can reproduce original issue without any of the (.100/.101) patch, and later we can verify if that issue goes away after applying .100/.101 patch, we can resolve it and as I have mentioned to target this issue(mpt/0 hangs), we can open another bugzilla and track it over there.

What is you suggestion ?

Thanks, Kashyap
Comment 39 Tom Coughlan 2010-12-21 14:44:56 EST
(In reply to comment #38)

> What is you suggestion ?

ilgu?
Comment 40 ilgu hong 2010-12-21 15:15:07 EST
I'm OK with kashyap's suggestion, but I wonder whether LSI/RedHat knows the root cause of previous problem.

Whatever, I have no right to decide closing/opening BT in RHEL.

Please, ask Joe.


HI, Joe.
What do you think about Kashyap's suggestion.

thanks.
Comment 41 joseph.r.gruher 2010-12-21 15:43:44 EST
If mpt/0 hang is a separate issue, it makes sense to track it in a separate bug.

As far as this bug, I don't think we've verified .100 or .101 fix the problem, so this needs to remain open.

Are these patches based on the 4.x branch?  Perhaps we would have more luck with patches against the 3.x branch (where the problem was originally observed).
Comment 42 RHEL Product and Program Management 2011-01-06 23:07:18 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 43 Tomas Henzl 2011-01-20 07:50:18 EST
Ilgu,
it look to so now that with the attachment 463699 [details] applied the very original issue is gone (comment #24). this without the .100 and .101 patch. Please correct me if I'm wrong.

For the other issue reported in comment #24 the mpt/0 hang, where we don't have a solution yet, please open a new bugzilla. You can use 'clone this bug'. I can do that, if you agree with my conclusion.
Comment 44 ilgu hong 2011-01-20 13:23:40 EST
Hi, Josep.

What do you think about upper comment?

Thansk.
Comment 45 joseph.r.gruher 2011-01-21 15:26:21 EST
That seems OK.  We can close this based on the patch and open a new bug to chase the mpt/0 hang we see with the patch applied.  Tomas, I'll take you up on your offer to clone the bug, if you don't mind doing that for us.

Thanks.
Comment 46 Tomas Henzl 2011-01-24 11:17:26 EST
(In reply to comment #45)
> That seems OK.  We can close this based on the patch and open a new bug to
> chase the mpt/0 hang we see with the patch applied.  Tomas, I'll take you up on
> your offer to clone the bug, if you don't mind doing that for us.

I'm going to do that, but there are some problems with the bugzilla settings right now. I've to ask for help an administrator, please be patient.
Comment 48 Tomas Henzl 2011-01-24 12:11:53 EST
(In reply to comment #45)
> That seems OK.  We can close this based on the patch and open a new bug to
> chase the mpt/0 hang we see with the patch applied.  Tomas, I'll take you up on
> your offer to clone the bug, if you don't mind doing that for us.
It's a little bit better, when the real reporter does this - I'm the reporter now in the new bug, but I hope that it is not so important.

(In reply to comment #46)
> I'm going to do that, but there are some problems with the bugzilla settings
> right now. I've to ask for help an administrator, please be patient.

Bugzilla now works for me - thanks - new bug is created - bz#672278
Comment 49 RHEL Product and Program Management 2011-02-01 00:40:09 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 50 RHEL Product and Program Management 2011-02-01 13:52:20 EST
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.
Comment 51 Tomas Henzl 2011-02-03 08:11:45 EST
The original issue is solved with the inDMD patch, this is planned for next release.
I'm closing this one now.

*** This bug has been marked as a duplicate of bug 642618 ***

Note You need to log in before you can comment on or make changes to this bug.