Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 466903

Summary:

mailbox timeouts observed on ds4700 luns.

Product:

Red Hat Enterprise MRG

Reporter:

IBM Bug Proxy <bugproxy>

Component:

realtime-kernel

Assignee:

Red Hat Real Time Maintenance <rt-maint>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

1.0

CC:

bhu, rlary, williams

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-10-06 22:14:32 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
first pass at merging mikek's patch against current R2	none
Mike Kravitz' -rt patch for the R1-SR3 QLogic HBA driver	none
Patch to disable MSI	none

Description IBM Bug Proxy 2008-10-14 13:00:43 UTC

=Comment: #0=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Problem description:

Performance is widely changing with the #of blast threads and lun association
with controllers. 

Hardware Environment
   LS 21
   ds4k disks. 
   san boot

Blast completely successfully with no errors in the messages file.  

*===============================================================*
BLAST Ended on /SAN/test8 RC = 0 at  09/23/2008 12:01:09
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 38 minutes and 57 seconds
*===============================================================*
BLAST Ended on /SAN/test4 RC = 0 at  09/23/2008 12:01:59
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 39 minutes and 47 seconds
*===============================================================*
BLAST Ended on /SAN/test7 RC = 0 at  09/23/2008 12:55:55
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 33 minutes and 43 seconds
*===============================================================*
BLAST Ended on /SAN/test5 RC = 0 at  09/23/2008 13:34:09
*=== Run statistics ============================================*
| Elapsed time =  0 days, 2 hours, 11 minutes and 57 seconds
*===============================================================*

As you can see  /SAN/test7 and /SAN/test5 took much longer than the other two.
=Comment: #1=================================================
Venkateswarara Jujjuri <jvrao.com> - 
We are observing huge variation in the throughput.

Initial setup:

LUNs to the system are distributed among the two controllers
Even lun numbers preferred path through Controller A.
Odd lun numbers preferred path through controller B. 

Observation:
When the tests are run on the LUNs associated to one controller (preferred path
set to either controller A/B)
results are uniform. But when we start tests on set of LUNs that are not
associated with the same controller,
finishing time varies by huge margin.

Same IO test(blast) ran in the following scenarios:

Case-1:
Blast on /SAN/test5   finished in 20min

Case-2:
Started Blast on /SAN/test5 and /SAN/test7  
Both finished roughly in 30min

Case-3:
Started blast on /SAN/test4 /SAN/test5  /SAN/test7 and /SAN/test8

/SAN/test4 and /SAN/test5 finished in around 23 mins
/SAN/test7  took 1hour 7 mins
and 
/SAN/test5 took 1hour and 58mins
BLAST Ended on /SAN/test4 RC = 0 at  09/23/2008 16:42:20
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 22 minutes and 48 seconds
*===============================================================*
BLAST Ended on /SAN/test8 RC = 0 at  09/23/2008 16:42:35
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 23 minutes and 3 seconds
*===============================================================*
BLAST Ended on /SAN/test7 RC = 0 at  09/23/2008 17:27:30
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 7 minutes and 58 seconds
*===============================================================*
BLAST Ended on /SAN/test5 RC = 0 at  09/23/2008 18:17:49
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 58 minutes and 17 seconds
*===============================================================*

Case-4:
Same as Case-4 But I have moved the preferred path to Controller A for all Luns.
This took: roughly 1 hour 3 mins on all partitions.
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 1 minutes and 3 seconds
*===============================================================*
BLAST Ended on /SAN/test8 RC = 0 at  09/23/2008 20:48:28
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 3 minutes and 23 seconds
*===============================================================*
BLAST Ended on /SAN/test5 RC = 0 at  09/23/2008 20:48:47
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 3 minutes and 42 seconds
*===============================================================*
BLAST Ended on /SAN/test7 RC = 0 at  09/23/2008 20:50:20
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 5 minutes and 15 seconds
*===============================================================*
********************************************************

One would assume that the distributed configuration should have better results
than ..assigning all of them to 
same path. Any idea on why we are seeing this kind of behavior?

I have started tests by moving all LUNS to controller B and things look crazy.

*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 24 minutes and 20 seconds
*===============================================================*
BLAST Ended on /SAN/test4 RC = 0 at  09/23/2008 22:40:00
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 25 minutes and 28 seconds
*===============================================================*
BLAST Ended on /SAN/test7 RC = 0 at  09/23/2008 23:10:32
*=== Run statistics ============================================*
| Elapsed time =  0 days, 1 hours, 56 minutes and 0 seconds
*===============================================================*
BLAST Ended on /SAN/test8 RC = 0 at  09/24/2008 00:02:25
*=== Run statistics ============================================*
| Elapsed time =  0 days, 2 hours, 47 minutes and 53 seconds
*===============================================================*
=Comment: #3=================================================
Venkateswarara Jujjuri <jvrao.com> - 
DS4k is seeing various h/w issues. Path failover and page write error when no IO
is going on.

This morning observed lost enclosure path

Date/Time: 9/24/08 11:40:39 AM
Sequence number: 568
Event type: 282B
Description: Drive enclosure path redundancy lost
Event specific codes: 0/0/0
Event category: Internal
Component type: Enclosure
Component location: Enclosure 85
Logged by: Controller in slot B

May be all the issues are because of flaky controllers in ds4k. Calling service.
=Comment: #4=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Called 1800-IBM-SERV and opened a ticket # 33PLTK2
=Comment: #5=================================================
Venkateswarara Jujjuri <jvrao.com> - 
IBM-SERV escalated the issue to L3. Here is the ST conversation:

Brian H. Jackson: ah, ok, appears controllers had a panic too and still seeing
issues with host pathing problems too
4:56:04 PM: Brian H. Jackson: would like to send this to Product Engineering for
review, would tomorrow be ok for follow up
4:59:19 PM: Venkateswarara Jujjuri: that is absolutely fine
4:59:44 PM: Venkateswarara Jujjuri: will I receive the approx time..when the
person will be onsite?
5:00:22 PM: Brian H. Jackson: no, not sending an ssr onsite just yet, sending to
our level 3 support for further review, because of controller panics and caching
issues
5:00:51 PM: Brian H. Jackson: need them to review the case to for an action
plan, which may or may not include an ssr going on site
5:01:17 PM: Venkateswarara Jujjuri: So will I hear back form L3 tomorrow?
5:01:22 PM: Brian H. Jackson: yes
5:01:52 PM: Venkateswarara Jujjuri: ok. Holding on some release isues on this.
Thanks for helping out. Will wait for the further followup.
5:02:27 PM: Brian H. Jackson: ok, thanks for your patience, will talk to you
tomorrow 
=Comment: #7=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Getting delayed in getting response from L3. So I decided to do the run again.
Now all 4 blast threads appear to behave properly.

BLAST Ended on /SAN/test8 RC = 0 at  09/25/2008 10:51:01 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 18 minutes and 30 seconds
*===============================================================*
BLAST Ended on /SAN/test7 RC = 0 at  09/25/2008 10:52:00 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 19 minutes and 29 seconds
*===============================================================*
BLAST Ended on /SAN/test5 RC = 0 at  09/25/2008 10:52:42 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 20 minutes and 11 seconds
*===============================================================*
BLAST Ended on /SAN/test4 RC = 0 at  09/25/2008 10:52:50 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 20 minutes and 19 seconds
*===============================================================*

This is after the controller reboot (offline/online)
and elm3c29 reboot.

=Comment: #8=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Here is the next run
*****************************************************
BLAST Secondary Server running at 09/25/2008 13:27:42
BLAST server listening on port #39064
BLAST Configuration file blast.cfg being used.
BLAST does not support STAF on this platform.
  STAF Will not be used.
BLAST version  7.05.2979 Started in batch mode - Ctl-C to terminate
Test Loop 1 on device name /SAN/test7 at 09/25/2008 13:27:43
   Loop Seed ... 0x00008FF3
Test Loop 1 on device name /SAN/test4 at 09/25/2008 13:27:43
   Loop Seed ... 0x00008F1B
Test Loop 1 on device name /SAN/test5 at 09/25/2008 13:27:43
   Loop Seed ... 0x00008F77
Test Loop 1 on device name /SAN/test8 at 09/25/2008 13:27:43
   Loop Seed ... 0x0000DF51
BLAST Ended on /SAN/test8 RC = 0 at  09/25/2008 13:44:42 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 17 minutes and 0 seconds
*===============================================================*
BLAST Ended on /SAN/test7 RC = 0 at  09/25/2008 13:46:15 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 18 minutes and 33 seconds
*===============================================================*
BLAST Ended on /SAN/test5 RC = 0 at  09/25/2008 13:46:15 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 18 minutes and 33 seconds
*===============================================================*
BLAST Ended on /SAN/test4 RC = 0 at  09/25/2008 13:46:20 
*=== Run statistics ============================================*
| Elapsed time =  0 days, 0 hours, 18 minutes and 38 seconds
*===============================================================*
********************************************************
** Terminating BLAST on last test exit                **
********************************************************

=Comment: #9=================================================
Venkateswarara Jujjuri <jvrao.com> - 
After the reboot; No performance issues were observed.
But occasional IO error popping out.
This is very strange....on two counts.

1. multipathd is detecting some issue with a path and initiating path failover.
   This is strange because no errors were injected..and we SHOULD NOT see any 
   path failovers.

2. When the path is tried to failover, the mode select to achieve that also 
   failing. This indicates that there is some problem with ds4k.

3. Absolutely NO errors were logged to DS4k. Indicating that life is normal from
   its perspective.

4. My performance monitor which is running on DS4k storage manager aborted
saying that the device got reset.

Points 2 and 4 above indicate that the problem is with ds4k box. But no solid
proof yet.

I have collected all data and sent to the service person. L3 person from ds4k is
looking into the issue.

I have removed all the multipath stuff and started blast runs directly on the sd
devices. I will keep the defect posted with the latest.

Extract from /var/log/messages:

Sep 25 14:05:46 elm3c29 kernel: sd 0:0:0:4: [sde] Result: hostbyte=DID_BUS_BUSY
driverbyte=DRIVER_OK,SUGGEST_OK
Sep 25 14:05:46 elm3c29 kernel: end_request: I/O error, dev sde, sector 5540408  
Sep 25 14:05:46 elm3c29 kernel: device-mapper: multipath: Failing path 8:64. <<
Path failed.
Sep 25 14:05:46 elm3c29 multipathd: dm-13: add map (uevent)
Sep 25 14:05:46 elm3c29 multipathd: dm-13: devmap already registered
Sep 25 14:05:46 elm3c29 multipathd: 8:64: mark as failed
Sep 25 14:05:46 elm3c29 multipathd: mpath4: remaining active paths: 1 <<
detected that one more path is available for the failover.
Sep 25 14:05:46 elm3c29 kernel: sd 1:0:0:4: queueing MODE_SELECT command.<< Sent
mode select to change over the path.
Sep 25 14:05:46 elm3c29 kernel: sd 1:0:0:4: MODE_SELECT failed with sense 0x63f03.
06 -> Unit Attention
3F03 ->  INQUIRY DATA HAS CHANGED

We saw similar issue on 24th night. except that sense data is different
Sep 24 04:02:17 elm3c29 kernel: sd 0:0:0:2: MODE_SELECT failed with sense 0x62904.

2904->  DEVICE INTERNAL RESET
=Comment: #10=================================================
Venkateswarara Jujjuri <jvrao.com> - 
L3 came back and said; if we are not running LSI RDAC we are on our own.

collinsb.com - Bill Collins/Raleigh/IBM: if you are not loading the
engenio RDAC driver then you are on your own.

Moreover they suggest us to go through RPQ process to get the support for our
new driver. If IBM is not supporting it how can our customers use it??
=Comment: #11=================================================
Venkateswarara Jujjuri <jvrao.com> - 
This wk-end I ran the blast on Vanilla. It ran fine for a day. and saw an error..hosed up the
machine. More investigation reveled that .. SAN switch got reset and ports went offline.
But the interesting part of it is.. This is the same issue we have seen with -rt kernel without
scsi_dh_rdac driver. But with -rt kernel, somehow (??) san switch ports did not go offline..and
machine recovered.  Given the similarity of both these errors..I question if the problem caused
while running on -rt kernel also some how related to san switch???

Here is the error log:

On Vanilla kernel:
----------------
Sep 28 04:03:39 elm3c29 syslogd 1.4.1: restart.
Sep 28 07:59:11 elm3c29 kernel: sd 0:0:0:7: [sdk] Result: hostbyte=DID_BUS_BUSY
driverbyte=DRIVER_OK,SUGGEST_OK
Sep 28 07:59:11 elm3c29 kernel: end_request: I/O error, dev sdk, sector 18246176
Sep 28 07:59:11 elm3c29 kernel: device-mapper: multipath: Failing path 8:160.
Sep 28 07:59:11 elm3c29 kernel: device-mapper: multipath rdac: queueing MODE_SELECT command on 65:16
Sep 28 07:59:20 elm3c29 multipathd: mpath1: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath2: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath3: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath4: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath5: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath6: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath7: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath8: event checker started
Sep 28 07:59:20 elm3c29 multipathd: mpath9: event checker started
Sep 28 07:59:20 elm3c29 multipathd: path checkers start up
Sep 28 07:59:20 elm3c29 multipathd: 65:16: reinstated
Sep 28 07:59:20 elm3c29 multipathd: mpath7: switch to path group #1
Sep 28 07:59:20 elm3c29 multipathd: mpath7: switch to path group #1
Sep 28 07:59:25 elm3c29 multipathd: 8:160: mark as failed
Sep 28 07:59:25 elm3c29 kernel: device-mapper: multipath rdac: queueing MODE_SELECT command on 8:160
Sep 28 07:59:26 elm3c29 multipathd: mpath7: remaining active paths: 1
Sep 28 07:59:26 elm3c29 multipathd: dm-15: add map (uevent)
Sep 28 07:59:27 elm3c29 multipathd: dm-15: devmap already registered
Sep 28 07:59:27 elm3c29 multipathd: 8:160: reinstated
Sep 28 07:59:28 elm3c29 multipathd: mpath7: remaining active paths: 2
Sep 28 07:59:28 elm3c29 multipathd: mpath7: switch to path group #1
Sep 28 07:59:30 elm3c29 last message repeated 2 times
Sep 28 07:59:31 elm3c29 multipathd: dm-15: add map (uevent)
Sep 28 07:59:39 elm3c29 multipathd: dm-15: devmap already registered
Sep 28 07:59:39 elm3c29 multipathd: 8:160: reinstated
Sep 28 07:59:39 elm3c29 multipathd: 65:16: reinstated
Sep 28 21:05:42 elm3c29 kernel: usb 2-2: USB disconnect, address 2
Sep 28 21:06:17 elm3c29 kernel: usb 2-2: new full speed USB device using ohci_hcd and address 3
Sep 28 21:06:17 elm3c29 kernel: usb 2-2: configuration #1 chosen from 1 choice
--------------
---------------
----------

on -rt kernel:
---------------------
Sep 25 14:05:46 elm3c29 kernel: sd 0:0:0:4: [sde] Result: hostbyte=DID_BUS_BUSY
driverbyte=DRIVER_OK,SUGGEST_OK
Sep 25 14:05:46 elm3c29 kernel: end_request: I/O error, dev sde, sector 5540408
Sep 25 14:05:46 elm3c29 kernel: device-mapper: multipath: Failing path 8:64.
Sep 25 14:05:46 elm3c29 multipathd: dm-13: add map (uevent)
Sep 25 14:05:46 elm3c29 multipathd: dm-13: devmap already registered
Sep 25 14:05:46 elm3c29 multipathd: 8:64: mark as failed
Sep 25 14:05:46 elm3c29 multipathd: mpath4: remaining active paths: 1
Sep 25 14:05:46 elm3c29 kernel: sd 1:0:0:4: queueing MODE_SELECT command.
Sep 25 14:05:46 elm3c29 kernel: sd 1:0:0:4: MODE_SELECT failed with sense 0x63f03.
Sep 25 14:05:46 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540408
Sep 25 14:05:46 elm3c29 kernel: device-mapper: multipath: Failing path 8:224.
Sep 25 14:05:46 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540416
Sep 25 14:05:46 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540424
Sep 25 14:05:46 elm3c29 multipathd: dm-13: add map (uevent)
Sep 25 14:05:50 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540432
Sep 25 14:05:51 elm3c29 multipathd: dm-13: devmap already registered
Sep 25 14:05:53 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540440
Sep 25 14:05:54 elm3c29 multipathd: 8:224: mark as failed
Sep 25 14:05:55 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540448
Sep 25 14:05:56 elm3c29 multipathd: mpath4: remaining active paths: 0
Sep 25 14:05:58 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540456
Sep 25 14:05:58 elm3c29 multipathd: 8:64: reinstated
Sep 25 14:06:00 elm3c29 kernel: end_request: I/O error, dev sdo, sector 5540464
Sep 25 14:06:03 elm3c29 multipathd: mpath4: remaining active paths: 1
Sep 25 14:06:06 elm3c29 kernel: printk: 502 messages suppressed.
Sep 25 14:06:06 elm3c29 multipathd: 8:224: reinstated
Sep 25 14:06:07 elm3c29 kernel: Buffer I/O error on device dm-13, logical block 711311
Sep 25 14:06:08 elm3c29 multipathd: mpath4: remaining active paths: 2
Sep 25 14:06:09 elm3c29 kernel: lost page write due to I/O error on dm-13
Sep 25 14:06:11 elm3c29 multipathd: dm-13: add map (uevent)
Sep 25 14:06:14 elm3c29 kernel: Buffer I/O error on device dm-13, logical block 711396
Sep 25 14:06:15 elm3c29 multipathd: dm-13: devmap already registered
Sep 25 14:06:17 elm3c29 kernel: lost page write due to I/O error on dm-13
Sep 25 14:06:20 elm3c29 multipathd: dm-13: add map (uevent)
Sep 25 14:06:23 elm3c29 kernel: Buffer I/O error on device dm-13, logical block 711481
Sep 25 14:06:26 elm3c29 multipathd: dm-13: devmap already registered
Sep 25 14:06:26 elm3c29 kernel: lost page write due to I/O error on dm-13
...
..
Sep 25 21:56:14 elm3c29 kernel: usb 1-4.1.1: USB disconnect, address 6
Sep 25 21:56:14 elm3c29 kernel: hub 1-4.1:1.0: hub_port_status failed (err = -71)
Sep 25 21:56:14 elm3c29 kernel: printk: 573 messages suppressed.
Sep 25 21:56:14 elm3c29 kernel: hub 1-4.1:1.0: connect-debounce failed, port 1 disabled
Sep 25 21:56:14 elm3c29 kernel: hub 1-4.1:1.0: cannot disable port 1 (err = -71)
Sep 25 21:56:14 elm3c29 kernel: usb 1-4.1: USB disconnect, address 3
Sep 25 21:56:14 elm3c29 kernel: usb 1-4.1.2: USB disconnect, address 4
Sep 25 21:56:14 elm3c29 kernel: usb 1-4.1.2.3: USB disconnect, address 5
Sep 25 22:13:10 elm3c29 kernel: qla2xxx 0000:03:05.1: scsi(1:0:8): Abort command issued -- 1 2ae1b8
2002.

=Comment: #12=================================================
Venkateswarara Jujjuri <jvrao.com> - 
This defect is tracking 3 issues in parallel.
1. Performance 
2. Mailbox timeout.
3. Mode Select failure when an attempt is made to switch the path.

Investigation showed that #1 is not an issue. #3 may have fixed. So I will continue to focus on #2.

Why do I think # 1 is not an issue?

I have observed performance in various scenarios. 
We can see performance variations under following legitimate scenarios.

1. DS4700 is a shared disk subsystem...with two controllers. The logical disks
   carved on it are assigned to one of the controllers as its active/main path. 
   There could be many hosts that are connected to DS4700 and have LUNS assigned.
    All the hosts has to go through one of these controllers. If two hosts are running
    tests at the same time.. we have observed throughput difference from the host's view.

    In one specific case Blast on LUNs assigned to  controller-A finishing much faster 
   compared to LUNs assigned to controller-B. Investigation reveled that another
   Host is running tests on LUNS assigned to controller-B. This explains why the blast on 
   LUNs associated with controller-B are slow.

2. When no other machine in BC is using ds4k. I have not see any adverse performance 
    difference.

3. Blast cleans the test dir before starting to write/read. So, depending on the initial
    state of the test dir; for the first loop; some may take longer time..as it could have spent
    more time in deleting the content before starting to write.

What makes me think tha #3 might have got fixed?

On Thomas Phelan's advise we have changed the host type to LINUX and disabled AVT.
After this change, I have seen mailbox timeouts .. but did not see any mode select failure.

I will keep monitoring for this though.

Now the focus is on #2 (mailbox timeouts).
- As per chandra, scsi_dh_rdac does not sense or initiate path fail overs. Hence I am not focusing
on changes introduced by replacing dm_rdac with scsi_dh_rdac.

- Now we are looking closely at the at the adapter..may be f/w version??. 

=Comment: #13=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Ok; Looks like we have more than just a timeout problem. IO Failed... I will be
installing on different blade to try out....to avoid possible h/w problem.

Sep 30 22:55:15 elm3c29 kernel: sd 0:0:0:4: [sde] Result: hostbyte=DID_BUS_BUSY
driverbyte=DRIVER_OK,SUGGEST_OK
Sep 30 22:55:15 elm3c29 kernel: end_request: I/O error, dev sde, sector 2551152
Sep 30 22:55:15 elm3c29 kernel: device-mapper: multipath: Failing path 8:64.
Sep 30 22:55:15 elm3c29 multipathd: dm-7: add map (uevent)
Sep 30 22:55:15 elm3c29 multipathd: dm-7: devmap already registered
Sep 30 22:55:15 elm3c29 multipathd: 8:64: mark as failed
Sep 30 22:55:15 elm3c29 multipathd: mpath4: remaining active paths: 1
Sep 30 22:55:15 elm3c29 kernel: sd 1:0:0:4: queueing MODE_SELECT command.
Sep 30 22:55:15 elm3c29 kernel: sd 1:0:0:4: MODE_SELECT failed with sense
0x62904.
Sep 30 22:55:15 elm3c29 kernel: end_request: I/O error, dev sdo, sector 2551152
Sep 30 22:55:15 elm3c29 kernel: device-mapper: multipath: Failing path 8:224.
Sep 30 22:55:15 elm3c29 multipathd: dm-7: add map (uevent)
Sep 30 22:55:15 elm3c29 kernel: printk: 115 messages suppressed.
Sep 30 22:55:16 elm3c29 multipathd: dm-7: devmap already registered
Sep 30 22:55:20 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774523
Sep 30 22:55:21 elm3c29 multipathd: 8:224: mark as failed
Sep 30 22:55:21 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:23 elm3c29 multipathd: mpath4: remaining active paths: 0
Sep 30 22:55:24 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774524
Sep 30 22:55:24 elm3c29 multipathd: 8:64: reinstated
Sep 30 22:55:25 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:25 elm3c29 multipathd: mpath4: remaining active paths: 1
Sep 30 22:55:26 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774525
Sep 30 22:55:27 elm3c29 multipathd: dm-7: add map (uevent)
Sep 30 22:55:28 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:29 elm3c29 multipathd: dm-7: devmap already registered
Sep 30 22:55:30 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774526
Sep 30 22:55:31 elm3c29 multipathd: 8:224: reinstated
Sep 30 22:55:32 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:33 elm3c29 multipathd: mpath4: remaining active paths: 2
Sep 30 22:55:33 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774527
Sep 30 22:55:35 elm3c29 multipathd: dm-7: add map (uevent)
Sep 30 22:55:37 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:38 elm3c29 multipathd: dm-7: devmap already registered
Sep 30 22:55:39 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774528
Sep 30 22:55:41 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:41 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774529
Sep 30 22:55:41 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:42 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774530
Sep 30 22:55:42 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:43 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774531
Sep 30 22:55:43 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:43 elm3c29 kernel: Buffer I/O error on device dm-7, logical block
774532
Sep 30 22:55:43 elm3c29 kernel: lost page write due to I/O error on dm-7
Sep 30 22:55:44 elm3c29 kernel: Aborting journal on device dm-7.
Sep 30 22:55:45 elm3c29 kernel: ext3_abort called.
Sep 30 22:55:46 elm3c29 kernel: EXT3-fs error (device dm-7):
ext3_journal_start_sb: Detected aborted journal
Sep 30 22:55:46 elm3c29 kernel: Remounting filesystem read-only

====

jvrao wrote:
> I took out the #if flags and recompiled the kernel.  I got three timeouts so far... Looks like
once they are started.. they appear to come
> relatively quickly. Coincidental?? But all 3 timeouts are on command  0x54
>
> #define MBC_IOCB_COMMAND_A64            0x54    /* Execute IOCB command (64) */
>
> And twice on controller-1 and once on controller-0.   Here are the logs.
>
> Sep 30 21:06:54 elm3c29 kernel: qla2x00_mailbox_command(1): **** MB Command Timeout for cmd 54 ****
> Sep 30 21:06:54 elm3c29 kernel: qla2x00_mailbox_command(1): icontrol=8 jiffies=1011d2c7a
> Sep 30 21:06:54 elm3c29 kernel: qla2x00_mailbox_command(1): *** mailbox[0] = 0x4000 ***
> Sep 30 21:06:54 elm3c29 kernel: Mailbox registers:
> Sep 30 21:06:54 elm3c29 kernel: scsi(1): mbox 0 0x4000
> Sep 30 21:06:54 elm3c29 kernel: scsi(1): mbox 1 0x0081
> Sep 30 21:06:54 elm3c29 kernel: scsi(1): mbox 2 0x4dc6
> Sep 30 21:06:54 elm3c29 kernel: scsi(1): mbox 3 0x9200
> Sep 30 21:06:54 elm3c29 kernel: scsi(1): mbox 4 0x9400
> Sep 30 21:06:54 elm3c29 kernel: scsi(1): mbox 5 0x0000
> Sep 30 21:06:54 elm3c29 kernel: qla2xxx 0000:03:05.1: Mailbox command timeout occured. Issuing ISP
abort.
> Sep 30 21:06:54 elm3c29 kernel: qla2xxx 0000:03:05.1: Performing ISP error recovery - ha=
ffff81014ecd8610.
> Sep 30 21:06:54 elm3c29 kernel: qla2xxx 0000:03:05.1: LOOP UP detected (4 Gbps).
> Sep 30 21:06:55 elm3c29 kernel: qla2xxx 0000:03:05.1: scsi(1:0:1): Abort command issued -- 0
1d9eb4 2002.
> Sep 30 21:06:55 elm3c29 kernel: device-mapper: multipath: Failing path 8:160.
> Sep 30 21:06:55 elm3c29 multipathd: mpath1: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath2: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath3: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath4: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath5: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath6: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath7: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath8: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: mpath9: event checker started
> Sep 30 21:06:55 elm3c29 multipathd: path checkers start up
> Sep 30 21:06:55 elm3c29 multipathd: sdk: rdac checker reports path is down
> Sep 30 21:06:55 elm3c29 multipathd: checker failed path 8:160 in map mpath1
> Sep 30 21:06:55 elm3c29 multipathd: mpath1: remaining active paths: 1
> Sep 30 21:06:55 elm3c29 multipathd: dm-10: add map (uevent)
> Sep 30 21:06:57 elm3c29 multipathd: dm-10: devmap already registered
> Sep 30 21:07:00 elm3c29 multipathd: sdk: rdac checker reports path is down
> Sep 30 21:07:00 elm3c29 multipathd: 8:160: reinstated
> Sep 30 21:07:00 elm3c29 multipathd: mpath1: remaining active paths: 2
> Sep 30 21:07:00 elm3c29 multipathd: dm-10: add map (uevent)
> Sep 30 21:07:00 elm3c29 multipathd: dm-10: devmap already registered
>
>
> Sep 30 21:48:19 elm3c29 kernel: qla2x00_mailbox_command(1): **** MB Command Timeout for cmd 54 ****
> Sep 30 21:48:19 elm3c29 kernel: qla2x00_mailbox_command(1): icontrol=8 jiffies=101431787
> Sep 30 21:48:19 elm3c29 kernel: qla2x00_mailbox_command(1): *** mailbox[0] = 0x4000 ***
> Sep 30 21:48:19 elm3c29 kernel: Mailbox registers:
> Sep 30 21:48:19 elm3c29 kernel: scsi(1): mbox 0 0x4000
> Sep 30 21:48:19 elm3c29 kernel: scsi(1): mbox 1 0x0004
> Sep 30 21:48:19 elm3c29 kernel: scsi(1): mbox 2 0x0000
> Sep 30 21:48:19 elm3c29 kernel: scsi(1): mbox 3 0x4dd5
> Sep 30 21:48:19 elm3c29 kernel: scsi(1): mbox 4 0x0001
> Sep 30 21:48:19 elm3c29 kernel: scsi(1): mbox 5 0x0000
> Sep 30 21:48:19 elm3c29 kernel: qla2xxx 0000:03:05.1: Mailbox command timeout occured. Issuing ISP
abort.
> Sep 30 21:48:19 elm3c29 kernel: qla2xxx 0000:03:05.1: Performing ISP error recovery - ha=
ffff81014ecd8610.
> Sep 30 21:48:19 elm3c29 kernel: qla2xxx 0000:03:05.1: LOOP UP detected (4 Gbps).
> Sep 30 21:48:20 elm3c29 kernel: qla2xxx 0000:03:05.1: scsi(1:0:0): Abort command issued -- 0
238a33 2002.
> Sep 30 21:48:20 elm3c29 multipathd: sdh: rdac checker reports path is down
> Sep 30 21:48:22 elm3c29 kernel: device-mapper: multipath: Failing path 8:112.
> Sep 30 21:48:25 elm3c29 multipathd: checker failed path 8:112 in map mpath0
> Sep 30 21:48:32 elm3c29 multipathd: mpath0: remaining active paths: 1
> Sep 30 21:48:36 elm3c29 multipathd: dm-0: add map (uevent)
> Sep 30 21:48:38 elm3c29 multipathd: dm-0: devmap already registered
> Sep 30 21:48:39 elm3c29 multipathd: sdh: rdac checker reports path is down
> Sep 30 21:48:40 elm3c29 multipathd: 8:112: reinstated
> Sep 30 21:48:42 elm3c29 multipathd: mpath0: remaining active paths: 2
> Sep 30 21:48:43 elm3c29 multipathd: dm-0: add map (uevent)
> Sep 30 21:48:43 elm3c29 multipathd: dm-0: devmap already registered
>
>
> Sep 30 21:59:00 elm3c29 kernel: qla2x00_mailbox_command(0): **** MB Command Timeout for cmd 54 ****
> Sep 30 21:59:00 elm3c29 kernel: qla2x00_mailbox_command(0): icontrol=8 jiffies=1014ce255
> Sep 30 21:59:00 elm3c29 kernel: qla2x00_mailbox_command(0): *** mailbox[0] = 0x4000 ***
> Sep 30 21:59:00 elm3c29 kernel: Mailbox registers:
> Sep 30 21:59:00 elm3c29 kernel: scsi(0): mbox 0 0x4000
> Sep 30 21:59:00 elm3c29 kernel: scsi(0): mbox 1 0x0081
> Sep 30 21:59:00 elm3c29 kernel: scsi(0): mbox 2 0x4e5f
> Sep 30 21:59:00 elm3c29 kernel: scsi(0): mbox 3 0x8200
> Sep 30 21:59:00 elm3c29 kernel: scsi(0): mbox 4 0x9400
> Sep 30 21:59:00 elm3c29 kernel: scsi(0): mbox 5 0x0000
> Sep 30 21:59:00 elm3c29 kernel: qla2xxx 0000:03:05.0: Mailbox command timeout occured. Issuing ISP
abort.
> Sep 30 21:59:00 elm3c29 kernel: qla2xxx 0000:03:05.0: Performing ISP error recovery - ha=
ffff81014ecdc610.
> Sep 30 21:59:01 elm3c29 kernel: qla2xxx 0000:03:05.0: LOOP UP detected (4 Gbps).
> Sep 30 21:59:01 elm3c29 multipathd: sda: rdac checker reports path is down
> Sep 30 21:59:18 elm3c29 kernel: qla2xxx 0000:03:05.0: scsi(0:0:0): Abort command issued -- 0
220eb1 2002.
> Sep 30 22:00:00 elm3c29 multipathd: checker failed path 8:0 in map mpath0
> Sep 30 22:00:01 elm3c29 kernel: device-mapper: multipath: Failing path 8:0.
> Sep 30 22:00:02 elm3c29 multipathd: mpath0: remaining active paths: 1
> Sep 30 22:00:03 elm3c29 kernel: sd 1:0:0:0: queueing MODE_SELECT command.
> Sep 30 22:00:03 elm3c29 multipathd: dm-0: add map (uevent)
> Sep 30 22:00:05 elm3c29 kernel: sd 0:0:0:0: queueing MODE_SELECT command.
> Sep 30 22:00:06 elm3c29 multipathd: dm-0: devmap already registered
> Sep 30 22:00:08 elm3c29 multipathd: sda: rdac checker reports path is down
> Sep 30 22:00:10 elm3c29 multipathd: 8:0: reinstated
> Sep 30 22:00:12 elm3c29 multipathd: mpath0: remaining active paths: 2
> Sep 30 22:00:14 elm3c29 multipathd: mpath0: switch to path group #1
> Sep 30 22:00:15 elm3c29 multipathd: mpath0: switch to path group #1
> Sep 30 22:00:16 elm3c29 multipathd: dm-0: add map (uevent)
> Sep 30 22:00:17 elm3c29 multipathd: dm-0: devmap already registered
> Sep 30 22:00:18 elm3c29 multipathd: sda: rdac checker reports path is down
> Sep 30 22:00:19 elm3c29 multipathd: 8:0: reinstated
>
>
>
>
> mpath1 (3600a0b8000421a900000043048a1db0d) dm-10 IBM,1814      FAStT
> [size=5.0G][features=0][hwhandler=1 rdac]
> \_ round-robin 0 [prio=100][active]
> \_ 0:0:0:1 sdb 8:16  [active][ready]
> \_ round-robin 0 [prio=0][enabled]
> \_ 1:0:0:1 sdk 8:160 [active][ghost]
> mpath0 (3600a0b8000421b14000003824888dfd4) dm-0 IBM,1814      FAStT
> [size=70G][features=0][hwhandler=1 rdac]
> \_ round-robin 0 [prio=100][active]
> \_ 0:0:0:0 sda 8:0   [active][ready]
> \_ round-robin 0 [prio=0][enabled]
> \_ 1:0:0:0 sdh 8:112 [active][ghost]
>
> [root@elm3c29 ~]# df
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/mapper/mpath0p1  19840892  11442388   7374364  61% /
> /dev/mapper/mpath0p4  24011592  18030304   4761556  80% /test
> /dev/mapper/mpath0p3  23653464    176204  22256320   1% /home
> tmpfs                  1978832         0   1978832   0% /dev/shm
> /dev/mapper/mpath2p1     99122     45824     48180  49% /SAN/test2
> /dev/mapper/mpath3p1     99122     58274     35730  62% /SAN/test3
> /dev/mapper/mpath4     5160576   1240188   3658244  26% /SAN/test4
> /dev/mapper/mpath5     5160576   4075048    823384  84% /SAN/test5
> /dev/mapper/mpath6    20642428    176200  19417652   1% /SAN/test6
> /dev/mapper/mpath7    20642428   1274936  18318916   7% /SAN/test7
> /dev/mapper/mpath8    15481840    902092  13793316   7% /SAN/test8
> [root@elm3c29 ~]# dmsetup table
> mpath2: 0 204800 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:32 1000 round-robin 0 1 1 8:192 1000
> mpath1: 0 10485760 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:16 1000 round-robin 0 1 1 8:160 1000
> mpath0: 0 146800640 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:0 1000 round-robin 0 1 1 8:112 1000
> mpath9: 0 20971520 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:176 1000 round-robin 0 1 1 65:48 1000
> mpath8: 0 31457280 multipath 0 1 rdac 2 1 round-robin 0 1 1 65:32 1000 round-robin 0 1 1 8:144 1000
> mpath0p4: 0 48789405 linear 253:0 97996500
> mpath7: 0 41943040 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:128 1000 round-robin 0 1 1 65:16 1000
> mpath0p3: 0 48837600 linear 253:0 49158900
> mpath6: 0 41943040 multipath 0 1 rdac 2 1 round-robin 0 1 1 65:0 1000 round-robin 0 1 1 8:96 1000
> mpath0p2: 0 8193150 linear 253:0 40965750
> mpath5: 0 10485760 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:240 1000 round-robin 0 1 1 8:80 1000
> mpath0p1: 0 40965687 linear 253:0 63
> mpath4: 0 10485760 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:64 1000 round-robin 0 1 1 8:224 1000
> mpath1p1: 0 10477194 linear 253:10 62
> mpath2p1: 0 204750 linear 253:5 50
> mpath3: 0 204800 multipath 0 1 rdac 2 1 round-robin 0 1 1 8:208 1000 round-robin 0 1 1 8:48 1000
> mpath3p1: 0 204750 linear 253:8 50

=Comment: #14=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Two more mailbox timeouts overnight. Both are 0x54.
=Comment: #15=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Worked with Mikeand and Chandra. 
We discovered that the mailbox timeouts are the result of IO/inquiry timeout.
The command sent to the device is timing out; and to abort that adopter is
sending a
mailbox command. In turn the mailbox cmd is timing out...and is showing up in
the log file.

We got the latest RHEL5.3 qla2xxx driver, ported it onto R2-rt kernel. Booted
with this.
This has some issues with the bootup. but never the less it came up. I started
blast..
but even this driver exhibited the same issue.. mailbox timeouts.

Tonight we are running with more debug enabled...to collect additional data to
understand
the reason and patterns of the failure.

Plan for tomorrow:
I will be moving one of our blades to Mikeand/Chandra's blade and connect it to
their ds4700.
This experiment is aimed at avoiding any hardware/connectivity issues.
Tomorrow mostly I will be spending my time in moving/installing the blade and
test on it.
I am planning to move elm3c24.

=Comment: #16=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Moved elm3c24 onto MikeAnd's BC; installed. Having trouble in booting on SAN. Booted on Local disk.
Will start blast on sand disks shortly.

On elm3c29 we observed that every block of failure is started with Inquiry. Inquiry is mostly done
by the multipathd to figure out the health of each path. Just to test; I killed multipathd and
started the tests. Apparently it doesn't have much affect. Now also I see many CDB timeouts followed
by mailbox timeout in an effort to abort it.

=Comment: #17=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Tests failed on elm3c24, which is moved to MikeAnd's BC and is connected to
their DS4k .
This indicates that the problem is not too much tied to the external
hardware...and is related to reatltime OS.

We may have to spend more time on the interrupt mechanism on the -rt OS.

=Comment: #18=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Trying two experimental scenarios this morning:

1. Enforced poling to detect IO completion instead of depending on the interrupt handler.
2. Instrumented to dump the qlogic adapter f/w...when we are about to send out the
    abort mailbox command. This will tell us the view on the adapter.

These two experiments are to find out where the problem is.

If the adapter is receiving the IO completions but somehow driver/ -rt kernel is dropping the
interrupts? or the adapter itself is not receiving IO done? These experiment should shed more light.

=Comment: #19=================================================
John G. Stultz <johnstul.com> - 
Dumb question: is our hardware qla2100, qla2300, or qla2400 hardware? I'm just scanning the code
paths and I don't want to be looking at the wrong set.
=Comment: #20=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #19)
> Dumb question: is our hardware qla2100, qla2300, or qla2400 hardware? I'm just
> scanning the code paths and I don't want to be looking at the wrong set.
> 

qla2400
=Comment: #21=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Collected f/w dump; sent it over for decode.
=Comment: #22=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Interesting things in /var/log/messages:

Here is the sequence:

<JV> 41 IOCBs timed out.. I can see 41 entries in /var/log/messages like the one below:

Oct  6 20:02:05 elm3c24 kernel: sd 0:0:0:1: [sdb] CDB: Write(10): 2a 00 00 23 f3 a0 00 00 48 00
Oct  6 20:02:05 elm3c24 kernel: sd 0:0:0:1: [sdb] Done: TIMEOUT
Oct  6 20:02:05 elm3c24 kernel: sd 0:0:0:1: [sdb] Result: hostbyte=DID_OK
driverbyte=DRIVER_OK,SUGGEST_OK

<JV> Then eh kicked-in and tried to abort each of these 41s. But the mailbox command to abort itself
timed out.

Oct  6 20:02:05 elm3c24 kernel: qla2x00_mailbox_command(0): **** MB Command Timeout for cmd 54 ****
Oct  6 20:02:05 elm3c24 kernel: qla2xxx 0000:03:05.0: Mailbox command timeout occured. Issuing ISP
abort.

<JV> Then I see 41 entries from eh reporting:

Oct  6 20:02:05 elm3c24 kernel: scsi_eh_done scmd: ffff81014d815240 result: 0
Oct  6 20:02:05 elm3c24 kernel: scsi_send_eh_cmnd: scmd: ffff81014d815240, timeleft: 9999
Oct  6 20:02:05 elm3c24 kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002

<JV> 
Where the ret code 2002 means SUCCESS.  So IOCB finished successfully but we(driver/kernel) have not
received an interrupt indicating that? Mike, does this indicate
problem with the interrupt deliver/handler?

=Comment: #23=================================================
Darren V. Hart <dvhltc.com> - 
(In reply to comment #22)

> <JV> 
> Where the ret code 2002 means SUCCESS.  So IOCB finished successfully but
> we(driver/kernel) have not received an interrupt indicating that? Mike, does
> this indicate
> problem with the interrupt deliver/handler?
> 

While this is happening, are there any high usage RT threads running?

=Comment: #24=================================================
Venkateswarara Jujjuri <jvrao.com> - 

> Where the ret code 2002 means SUCCESS.  So IOCB finished successfully but
> we(driver/kernel) have not received an interrupt indicating that? Mike, does
> this indicate
> problem with the interrupt deliver/handler?
> 

This success is for the TUR we sent out as part of eh handler. Nothing to do
with the actual IOCB that timedout.

=Comment: #25=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #23)
> (In reply to comment #22)
> 
> > <JV> 
> > Where the ret code 2002 means SUCCESS.  So IOCB finished successfully but
> > we(driver/kernel) have not received an interrupt indicating that? Mike, does
> > this indicate
> > problem with the interrupt deliver/handler?
> > 
> 
> While this is happening, are there any high usage RT threads running?
> 

Nope. I don't think so. No load on the machine except the blast threads.. which is not running at rt
prio.
=Comment: #26=================================================
John G. Stultz <johnstul.com> - 
So do we have logs for a failure at a sufficient debug level to see whats going on before the
following message is tripped?

> Sep 30 21:06:54 elm3c29 kernel: qla2x00_mailbox_command(1): **** MB Command Timeout for cmd 54 ****
> Sep 30 21:06:54 elm3c29 kernel: qla2x00_mailbox_command(1): icontrol=8 jiffies=1011d2c7a
> Sep 30 21:06:54 elm3c29 kernel: qla2x00_mailbox_command(1): *** mailbox[0] = 0x4000 ***

Specifically, what I'm wondering, is in qla2x00_mailbox_command, we use the ha->mbx_intr_sem (a
compat_sem which gave us grief earlier) to block for the mbox completion interrupt.

At that point, either the isr triggers, hopefully sets ha->flags.mbox_int and then up's the
mbx_intr_sem, OR for some reason no interrupt occurs, or we don't follow the right code path. This
would result in the qla2x00_mbx_sem_timeout() function to finally trip and up the mbx_intr_sem,
allowing the qla2x00_mailbox_command function to continue, note the mbox_int value isn't set, and
start the timeout recovery process.

Now, if the second case is what's happening, we should be able to set the debug level high enough
(or just remove the DEBUG11 conditionals) to see if the qla2x00_sem_timeout is actually entered or
if we're going through a different timeout code path.

So do we have any logs that might provide this info? Or do we need a patch to test this theory?
=Comment: #27=================================================
John G. Stultz <johnstul.com> - 
From discussion at lunch, I was working on getting a kernel with the PREEMPT_RT disabled (leaving
IRQ threading enabled) to see if this issue is related to specifically IRQ threading or possibly the
sleeping spinlock changes. Unfortunately editing the config system is more difficult then I recall,
so I've been unable to get it working today. I'll spend more time on this tomorrow to try to
generate a test kernel.
=Comment: #28=================================================
TIMOTHY R. CHAVEZ <tinytim.com> - 
(In reply to comment #27)
> From discussion at lunch, I was working on getting a kernel with the PREEMPT_RT
> disabled (leaving IRQ threading enabled) to see if this issue is related to
> specifically IRQ threading or possibly the sleeping spinlock changes.
> Unfortunately editing the config system is more difficult then I recall, so
> I've been unable to get it working today. I'll spend more time on this tomorrow
> to try to generate a test kernel.
> 

I don't recall seeing this problem with R1-SR3 and I haven't looked at this bug closely enough, but
have you taken a look at the changes Mike Kravitz made to the QLogic driver for R1-SR3?  Maybe you
can get some ideas there?  Remember, Mike had resolved the first problem we had with the mis-cast. 
Perhaps he's caught this problem as well.  I've attached his original patch that applies to the
R1-SR3 QLogic Driver (qla2xxx-8.01.07.15)
=Comment: #29=================================================
TIMOTHY R. CHAVEZ <tinytim.com> - 

Mike Kravitz' -rt patch for the R1-SR3 QLogic HBA driver

=Comment: #30=================================================
John G. Stultz <johnstul.com> - 
Good call Tim!

There are quite a bit more adjustments for compat sems. Might be good to try to set those up in a
similar way. Also the different down_timeout function might directly affect this issue.

I'll try to make the same adjustments to the R2 kernel and we can try to reproduce.

JV: How long does this issue take to reproduce?

=Comment: #31=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #30)

> JV: How long does this issue take to reproduce?
> 
It takes anywhere between 2-10 hours. 

We have 2 boxes. So, John if you have a port of this patch onto R2, we can surely try it on one box.

=Comment: #32=================================================
John G. Stultz <johnstul.com> - 

first pass at merging mikek's patch against current R2

Here's the SVN patch adding most of the changes from MikeK's patch. Def a first run, but builds and
we can use it for testing.
=Comment: #33=================================================
John G. Stultz <johnstul.com> - 
Built kernel with MikeK's patch:
http://kernel.beaverton.ibm.com/jtcltc/misc/kernel-rt-2.6.24.7-75ibmrt2.8mk.x86_64.rpm
http://kernel.beaverton.ibm.com/jtcltc/misc/kernel-rt-devel-2.6.24.7-75ibmrt2.8mk.x86_64.rpm

=Comment: #34=================================================
Michael S. Anderson <andmike.ibm.com> - 
Well it would be good to ensure things are correct for compat sems the mailbox
timeouts are a secondary failure resulting from the abort call for timed out IO
commands. 

Based on debug output from last night showing that there are completions on the
response queue and the interrupt bit appears set in the qla status register I
think it would be a good to determine if using the msi based interrupts is
having an effect. For a test run we could either disable msi for the whole
system using "nomsi" or we could patch the qla driver to not request msi.

=Comment: #35=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #34)
> For a test run we could either disable msi for the whole
> system using "nomsi" or we could patch the qla driver to not request msi.
> 

I will build qla driver by disabling msi and start the tests elm3c24.
=Comment: #36=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #33)
> Built kernel with MikeK's patch:
> http://kernel.beaverton.ibm.com/jtcltc/misc/kernel-rt-2.6.24.7-75ibmrt2.8mk.x86_64.rpm
> http://kernel.beaverton.ibm.com/jtcltc/misc/kernel-rt-devel-2.6.24.7-75ibmrt2.8mk.x86_64.rpm
> 

John, I will start tests on elm3c29 with this kernel. 
Initial analysis doesn't suggest that this patch may fix our problem. But surely worth a try.:)
We never know. :)  Thanks for generating these RPMs.  I will start tests shortly.

Brief analysis the patch:
1. Converting mbx_cmd_sem  to compat_semaphore. 
     Mike's original patch converted *all* semaphores to compat_semaphores. 
    But, even with John's patch we will have one semaphore (vport_sem) left as-is.

   The current problem manifestation starts with IO timeouts followed by mbox timeout.
   and this semaphore (mbox_cmd_sem) has nothing to do with IO timeouts.

2. Changing init_MUTEX_LOCKED(&ha->mbx_intr_sem); to sema_init(&ha->mbx_intr_sem, 0);

    Quick look at the source code suggests that init_MUTEX_LOCKED() is a macro which
    breaks down into sema_init(). 

3. qla2x00_down_timeout() changes.
    Did not find much difference between msleep_interruptible() and the code that replaced it.

=Comment: #37=================================================
John G. Stultz <johnstul.com> - 
(In reply to comment #36)
> Brief analysis the patch:
> 1. Converting mbx_cmd_sem  to compat_semaphore. 
>      Mike's original patch converted *all* semaphores to compat_semaphores. 
>     But, even with John's patch we will have one semaphore (vport_sem) left
> as-is.
> 
>    The current problem manifestation starts with IO timeouts followed by mbox
> timeout.
>    and this semaphore (mbox_cmd_sem) has nothing to do with IO timeouts.

You're probably right on this point, since the cmd_sem is really only used in the mailbox_command
function. And yes, the vport_sem should be looked at too. I was initially concerned that we could
get some sort of double up situation that would result in only one interrupt being processed. But
looking more at the code, I'm not sure I still see that as a possibility.

> 2. Changing init_MUTEX_LOCKED(&ha->mbx_intr_sem); to
> sema_init(&ha->mbx_intr_sem, 0);
> 
>     Quick look at the source code suggests that init_MUTEX_LOCKED() is a macro
> which
>     breaks down into sema_init(). 

Yea, I didn't think it would be different, but I preserved mike's changes just in case. But your
analysis is correct.

> 3. qla2x00_down_timeout() changes.
>     Did not find much difference between msleep_interruptible() and the code
> that replaced it.

Yep, again, same as my last comment.

=Comment: #38=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #33)
> Built kernel with MikeK's patch:
> http://kernel.beaverton.ibm.com/jtcltc/misc/kernel-rt-2.6.24.7-75ibmrt2.8mk.x86_64.rpm
> http://kernel.beaverton.ibm.com/jtcltc/misc/kernel-rt-devel-2.6.24.7-75ibmrt2.8mk.x86_64.rpm
> 

Hard luck. Observed mailbox timeouts few mins into the blast run.

[root@elm3c29 blast]# uname -a
Linux elm3c29 2.6.24.7-75ibmrt2.8mk #1 SMP PREEMPT RT Wed Oct 8 13:03:12 EDT 2008 x86_64 x86_64
x86_64 GNU/Linux

Starting blast:

BLAST version  7.05.2979 Started in batch mode - Ctl-C to terminate 
Test Loop 1 on device name /SAN/test4 at 10/08/2008 12:29:05

/var/log/messages output:

Oct  8 12:34:53 elm3c29 kernel: qla2xxx 0000:03:05.1: Mailbox command timeout occured. Issuing ISP
abort.
Oct  8 12:34:53 elm3c29 kernel: qla2xxx 0000:03:05.1: Performing ISP error recovery - ha=
ffff81014ecdc610.
Oct  8 12:34:54 elm3c29 kernel: qla2xxx 0000:03:05.1: LOOP UP detected (4 Gbps).
Oct  8 12:34:54 elm3c29 kernel: qla2xxx 0000:03:05.1: scsi(1:0:0): Abort command issued -- 0 e112 2002.

=Comment: #39=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #26)

> Specifically, what I'm wondering, is in qla2x00_mailbox_command, we use the
> ha->mbx_intr_sem (a compat_sem which gave us grief earlier) to block for the
> mbox completion interrupt.
> 
> At that point, either the isr triggers, hopefully sets ha->flags.mbox_int and
> then up's the mbx_intr_sem, OR for some reason no interrupt occurs, or we don't
> follow the right code path. This would result in the qla2x00_mbx_sem_timeout()
> function to finally trip and up the mbx_intr_sem, allowing the
> qla2x00_mailbox_command function to continue, note the mbox_int value isn't
> set, and start the timeout recovery process.
> 
> Now, if the second case is what's happening, we should be able to set the debug
> level high enough (or just remove the DEBUG11 conditionals) to see if the
> qla2x00_sem_timeout is actually entered or if we're going through a different
> timeout code path.

I am pretty sure that we are going through the qla2x00_mbx_sem_timeout().
Because, In the earlier bug this was the the routine where we got stuck causing the system hang.
Dump and stack trace shows it.

We fixed this timeout function to compat_semaphore and then the timeout is triggered moving the
qla2x00_mailbox_command() forward.

But just to put a clear message in  /var/log/messages I am removing DEBUG11() condition in this
routine...on elm3c24.
=Comment: #40=================================================
John G. Stultz <johnstul.com> - 
Since JV is testing the non-msi path, I figured I'd take a look at the two irq paths (msi and
non-msi) at the driver level for differences and here's what I came up with:

--- irq.list	2008-10-08 14:58:25.000000000 -0700
+++ msi.list	2008-10-08 14:58:18.000000000 -0700
@@ -1,27 +1,20 @@
-irqreturn_t
-qla24xx_intr_handler(int irq, void *dev_id)
+static irqreturn_t
+qla24xx_msix_default(int irq, void *dev_id)
 {
         scsi_qla_host_t *ha;
-        struct device_reg_24xx __iomem *reg; 
+        struct device_reg_24xx __iomem *reg;
         int             status;
         unsigned long   flags;
-        unsigned long   iter;
         uint32_t        stat;
         uint32_t        hccr;
         uint16_t        mb[4];

-        ha = (scsi_qla_host_t *) dev_id; 
-        if (!ha) {
-                printk(KERN_INFO
-                    "%s(): NULL host pointer\n", __func__);
-                return IRQ_NONE;
-        }
-
+        ha = dev_id;
         reg = &ha->iobase->isp24;
         status = 0;

         spin_lock_irqsave(&ha->hardware_lock, flags);
-        for (iter = 50; iter--; ) {
+        do {
                 stat = RD_REG_DWORD(&reg->host_status);
                 if (stat & HSRX_RISC_PAUSED) {
                         if (pci_channel_offline(ha->pdev))
@@ -66,8 +59,7 @@
                         break;
                 }
                 WRT_REG_DWORD(&reg->hccr, HCCRX_CLR_RISC_INT);
-                RD_REG_DWORD_RELAXED(&reg->hccr);
-        }
+        } while (0);
         spin_unlock_irqrestore(&ha->hardware_lock, flags);

         if (test_bit(MBX_INTR_WAIT, &ha->mbx_cmd_flags) &&

Fairly minimal changes. Although it does look like the MSI version could theoretically get stuck in
an infinite while loop (while the non-msi version only will do 50 iterations). Not sure if that's
connected to the issue here or not.

Outside of these differences,  there are more MSI vs regular irq differences at the hardirq stub
level, but I need to look into that some more.
=Comment: #41=================================================
Michael S. Anderson <andmike.ibm.com> - 
The qla is using msi but not the msix extension. I thought the same thing when I quickly looked at
the code, but when you look through the qla2x00_request_irqs function along with the pci ids of the
card we are using you can see we only enable msi.
=Comment: #42=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Qlogic pointed us to the latest firmware, and asked use to try it out.

I have downloaded the new firmware from :
ftp://ftp.qlogic.com/outgoing/linux/firmware/ql2400_fw.bin_mid

Rebuilt the kernel with this new qlogic f/w..started the test.
But, this did not help  mailbox timeouts showed up in 3 hours of blast run.

=Comment: #43=================================================
TIMOTHY R. CHAVEZ <tinytim.com> - 
(In reply to comment #42)
> Qlogic pointed us to the latest firmware, and asked use to try it out.
> 
> I have downloaded the new firmware from :
> ftp://ftp.qlogic.com/outgoing/linux/firmware/ql2400_fw.bin_mid
> 
> Rebuilt the kernel with this new qlogic f/w..started the test.
> But, this did not help  mailbox timeouts showed up in 3 hours of blast run.
> 

Have you tried running blast against the R1-SR3 version?  If this issue does not appear, perhaps you
can start bisecting between it and the latest driver you're testing with?
=Comment: #44=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #43)
> Have you tried running blast against the R1-SR3 version?  If this issue does
> not appear, perhaps you can start bisecting between it and the latest driver
> you're testing with?
> 
Tim, No I did not get a chance to try blast on R1-SR3. 
But the interrupt change (from MIS to APIC) appears to work. 
Driver with this change is running for more than 15 hours on 3c24 without any timeouts. 
I will update the defect with more details in my next comment.

=Comment: #45=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #44)
> (In reply to comment #43)
> > Have you tried running blast against the R1-SR3 version?  If this issue does
> > not appear, perhaps you can start bisecting between it and the latest driver
> > you're testing with?
> > 
> Tim, No I did not get a chance to try blast on R1-SR3. 

studying R1-SR3 code revels that .. it doesn't have MIS interrupts... and defaults to APIC interrupts.

This vindicates our latest change/experiment of moving away from MIS-EDGE to APIC interrupts on R2.

=Comment: #46=================================================
John G. Stultz <johnstul.com> - 
JV has committed a workaround for this issue (basically just disables msi interrupts for the qla2xxx
driver). I'm working to build a developer release build so we can do wider testing.
=Comment: #47=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #46)
> JV has committed a workaround for this issue (basically just disables msi
> interrupts for the qla2xxx driver). I'm working to build a developer release
> build so we can do wider testing.
> 

There are couple of other ways of disable MSI interrupts on the entire system.

1. Provide pci=nomsi on the boot flags. This disables MSI interrupts at the system level.
2. Change configuration perameters to disable CONFIG_PCI_MSI in the .config file.

=Comment: #48=================================================
Venkateswarara Jujjuri <jvrao.com> - 

Patch to disable MSI

This is the patch we used to work around MSI in R2.

Comment 1 IBM Bug Proxy 2008-10-14 13:00:51 UTC

Created attachment 320295 [details]
first pass at merging mikek&apos;s patch against current R2

Comment 2 IBM Bug Proxy 2008-10-14 13:00:57 UTC

Created attachment 320296 [details]
Mike Kravitz&apos; -rt patch for the R1-SR3 QLogic HBA driver

Comment 3 IBM Bug Proxy 2008-10-14 13:01:03 UTC

Created attachment 320297 [details]
Patch to disable MSI

Comment 4 IBM Bug Proxy 2009-07-31 13:11:46 UTC

------- Comment From sripathik.com 2009-07-31 09:06 EDT-------
Closing.

Comment 5 Clark Williams 2009-10-06 22:14:32 UTC

closing