Bug 676914 - [Intel 6.2 Bug] System unresponsive after multipath failover/failback test (mptsas)
[Intel 6.2 Bug] System unresponsive after multipath failover/failback test (m...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
Unspecified Unspecified
high Severity high
: rc
: 6.2
Assigned To: Tomas Henzl
Red Hat Kernel QE team
: TestOnly
Depends On:
Blocks: 670196
  Show dependency treegraph
 
Reported: 2011-02-11 15:08 EST by joseph.r.gruher
Modified: 2013-03-19 13:35 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-04-27 13:44:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
logs (452.21 KB, application/x-zip-compressed)
2011-02-11 15:08 EST, joseph.r.gruher
no flags Details
logs of run with RHEL6.1 snapshot 2 x64 (100.23 KB, application/x-zip-compressed)
2011-04-10 15:13 EDT, joseph.r.gruher
no flags Details
Use this .203 debug patch (in git format) (3.05 KB, patch)
2011-04-14 02:25 EDT, kashyap
no flags Details | Diff
root cause analysis (36.50 KB, application/msword)
2011-04-15 12:52 EDT, joseph.r.gruher
no flags Details
logs with rhel6.1 snap 3 x64, 3.4.18.203, phase 18 firmware (1.30.00) (1.21 MB, application/x-zip-compressed)
2011-04-22 14:42 EDT, joseph.r.gruher
no flags Details

  None (edit)
Description joseph.r.gruher 2011-02-11 15:08:37 EST
Created attachment 478304 [details]
logs

Description of problem:
When running dual storage controller multipath testing the system occasionally fails to discover the new path when a failed controller comes back online.  If the surviving path is then removed when the other controller fails, no paths are left, and the system fails.

Version-Release number of selected component (if applicable):
RHEL6 GM.

Added patch per here:
https://bugzilla.redhat.com/show_bug.cgi?id=652024

Added 3.4.18 driver per here:
https://bugzilla.redhat.com/show_bug.cgi?id=653706
https://bugzilla.redhat.com/show_bug.cgi?id=672278

How reproducible:
Install RHEL6 to 30 GB LUN.  Apply patches and driver and configure multipath.  Run Iozone IO and automated test script.  Script will reset one storage controller, wait for it to come online, sleep three minutes, reset the other controller, wait for it to come online, sleep three minutes, and repeat.  System must be run for a while before this failure occurs, I've seen it take a couple hundred resets before the problem hits.

Actual results:
System fails when storage controller comes online, path is not added, then other controller is reset and no paths are left.

Expected results:
Path should reliably be added when storage controller comes online.

Additional info:
Attaching full console logs from testing and also log with failure analysis from Ilgu at Promise (CC'd).
Comment 2 RHEL Product and Program Management 2011-02-11 15:29:57 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 3 Ondrej Vasik 2011-02-11 18:15:55 EST
basesystem is just metapackage with no content, you have probably meant kernel ... reassigning.
Comment 4 Russell Doty 2011-02-12 14:19:50 EST
All of the proposed patches are included in other bugs, and are in ON_QA.

Can this bug now be considered a TESTONLY bug?
Comment 5 joseph.r.gruher 2011-02-12 18:25:24 EST
I believe the patches loaded from other bugs are all planned for RHEL6.1.  They are needed to create a multipath test scenario where we don't run into known issues.  I'm not sure what TESTONLY signifies but we consider this a legitimate bug needing root cause and resolution.
Comment 6 Russell Doty 2011-03-03 10:34:06 EST
Added TestOnly keyword to indicate that no code changes are required for this bug, but it needs to be tested once the RHEL 6.1 beta is available.

Preliminary testing on the RHEL 6.1 Alpha (now available) would be helpful.
Comment 7 Russell Doty 2011-03-27 10:37:44 EDT
Does this problem still exist in the RHEL 6.1 Beta?
Comment 8 joseph.r.gruher 2011-04-08 14:26:41 EDT
Setting up testing with 6.1 snapshot 2 today.  Will run over the weekend and post logs.
Comment 9 joseph.r.gruher 2011-04-10 15:13:52 EDT
Created attachment 491103 [details]
logs of run with RHEL6.1 snapshot 2 x64

Ran two overnight failover/failback with IO test runs this weekend.  Both failed with system unresponsive and scrolling errors of this form on the console:

mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ

Attaching logs from one run as an example.  I think this is a separate issue from this bug.  Logs could be checked for the "fails to discover added path" problem.  If present, it still exists in RHEL6.1 snapshot 2, but it not present it may just be this other issue blocking us from running long enough to reproduce it.
Comment 11 joseph.r.gruher 2011-04-12 17:49:38 EDT
We are considering whether this issue could be the cause of the failure in the latest logs (comment #9).  Just a theory at this point.

http://66.135.57.166/lists/linux-fsdevel/msg42068.html

I am going to try to reproduce and trigger a dump with NMI for further analysis.
Comment 12 Tomas Henzl 2011-04-13 12:01:40 EDT
(In reply to comment #9)
> Created attachment 491103 [details]
> logs of run with RHEL6.1 snapshot 2 x64
> 
> Ran two overnight failover/failback with IO test runs this weekend.  Both
> failed with system unresponsive and scrolling errors of this form on the
> console:
> 
> mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
> mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
> mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
> 
> Attaching logs from one run as an example.  I think this is a separate issue
> from this bug.  Logs could be checked for the "fails to discover added path"
> problem.  If present, it still exists in RHEL6.1 snapshot 2, but it not present
> it may just be this other issue blocking us from running long enough to
> reproduce it.

The first logs seems to come from 2.6.32-71 while the log from comment #9 is based on kernel 2.6.32-128 right? In between some patches were added to the mptsas, for example the inDMD patch. This could explain the differences - if you agree I'll mark the original logs as obsolete.

In the new log from comment#9 there are some errors:
Buffer I/O error on device sdd, logical block 0
sd 0:0:1:1: [sde] Device not ready
sd 0:0:1:1: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:1:1: [sde] Sense Key : Not Ready [current] 
sd 0:0:1:1: [sde] Add. Sense: Logical unit not accessible, target port in standby state
sd 0:0:1:1: [sde] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
It looks like this^ is in the logs even before the test started? Is it possible that it comes from an damaged disk?
Comment 13 joseph.r.gruher 2011-04-13 13:31:20 EDT
I'm fine with obsoleting the original logs and focusing on the new issue.  If we can resolve the new issue and then find the original issue still exists we can always circle back to it.

The errors you mention come from the standby halves of the multipath paths.  They are mostly inaccessible while in standby (cannot complete IOs or most SCSI commands).  Linux attempts to probe them at boot time and it results in errors.
Comment 14 kashyap 2011-04-14 02:25:30 EDT
Created attachment 491967 [details]
Use this .203 debug patch (in git format)

Please run the test again using .203 debug driver patch and also make sure kdump is enable.
I need vmcore file to debug further.

1. change "crashkernel=auto" to "crashkernel=512M" in /boot/grub/menu.lst
2. Check kdump is running. "Service kdump status".
"It must be Kdump is operational" if it is not operational run kdump.
"Service kdump start"
3. When issue is hit, please crash the kernel using Sysrq key as below

NOTE: make sure sysrq is enable in your system
Run "echo 1 > /proc/sys/kernel/sysrq"
Alt Key + SysRq Key + "c". [ "c" is charactor C on keyboard ]

Please attach vmcore file to debug further + console redirect log.

Thanks, Kashyap
Comment 15 joseph.r.gruher 2011-04-15 11:30:58 EDT
I have a test system running per Kashyap's guidance.  Previously the system had failed within a few hours, this one has so far run overnight and is stubbornly refusing to fail.  We'll keep running it through the weekend and update when we have the debug data.
Comment 16 joseph.r.gruher 2011-04-15 12:52:37 EDT
Created attachment 492434 [details]
root cause analysis

Attaching failure analysis on RHEL6.0 from Kashyap. Behavior seems the same on RHEL6.1 snap 2 so I think it is reasonable to assume the same root cause applies but I will also keep trying to get a dump from RHEL6.1 for confirmation. Need feedback from Red Hat on root cause analysis.
Comment 17 Tomas Henzl 2011-04-18 05:10:34 EDT
(In reply to comment #16)
> Attaching failure analysis on RHEL6.0 from Kashyap. Behavior seems the same on
> RHEL6.1 snap 2 so I think it is reasonable to assume the same root cause
> applies but I will also keep trying to get a dump from RHEL6.1 for
> confirmation. Need feedback from Red Hat on root cause analysis.
I probably need to read it once again, to understand this well. From paragraph 9. it looks like letting the driver respond with BLK_EH_NOT_HANDLED every time means a longer error handling, but no problems with multipath?
Comment 18 kashyap 2011-04-18 07:22:41 EDT
(In reply to comment #17)
> (In reply to comment #16)
> > Attaching failure analysis on RHEL6.0 from Kashyap. Behavior seems the same on
> > RHEL6.1 snap 2 so I think it is reasonable to assume the same root cause
> > applies but I will also keep trying to get a dump from RHEL6.1 for
> > confirmation. Need feedback from Red Hat on root cause analysis.
> I probably need to read it once again, to understand this well. From paragraph
> 9. it looks like letting the driver respond with BLK_EH_NOT_HANDLED every time
> means a longer error handling, but no problems with multipath?

Little history behind LSI introduced "mptsas_eh_timed_out" callback.

Main reason to add "mptsas_eh_timed_out" is multipathing driver is using blk_flush_queue() (which is really not healthy mechanism to interact with LLD. ).
There are some LLD which has it's own Error handling mechanism. Just to hook those kind of LLD eh_time_out call back is useful. mptfusion does not have Error handling code, it purely depends upon Scsi Mid Layer for EEH.

Though mptsas has added "mptsas_eh_timed_out" just to avoid device to go Offline.
e.a without "mptsas_eh_timed_out" mid layer will call EEH for those scmd and eventually TUR for the device will time out, because no actual device is attached there. Later, you will notice Device will be taken offline by SML.

Using this function( in current form) driver will escape SML error handling returning "EH_RESET_TIMER".
Though this is not an accurate return value, it is worth having this code.
.

I am not really happy with multipath driver's call blk_flush_queue(). When I did very basic search on google. I found lots of other proplem associated with blk_flush_queue() call from multipath driver. Eventually that function has been removed from dm-mpath.c. Please refer below link.

https://lkml.org/lkml/2011/2/15/603

associated git log :
-----------------------
commit 09c9d4c9b6a2b5909ae3c6265e4cd3820b636863
Author: Mike Snitzer <snitzer@redhat.com>
Date: Thu Jan 13 19:59:46 2011 +0000

    dm mpath: disable blk_abort_queue

    Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2
      dm: Call blk_abort_queue on failed paths

    Multipath began to use blk_abort_queue() to allow for
    lower latency path deactivation. This was found to
    cause list corruption:

       the cmd gets blk_abort_queued/timedout run on it and the scsi eh
       somehow is able to complete and run scsi_queue_insert while
       scsi_request_fn is still trying to process the request.

       https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html
-----------------------

In my opinion, Adding commit "09c9d4c9b6a2b5909ae3c6265e4cd3820b636863" in respective Distro will be the solution for this issue too.

Thanks, Kashyap
Comment 19 Tomas Henzl 2011-04-18 10:43:11 EDT
(In reply to comment #18) 
> In my opinion, Adding commit "09c9d4c9b6a2b5909ae3c6265e4cd3820b636863" in
> respective Distro will be the solution for this issue too.

The patch "dm mpath: disable blk_abort_queue" seems to be added to our -92 kernel. The first logs come from -71 kernel, while the last report is from a 2.6.32-128 kernel - this means there must be another reason for this issue.
Comment 20 kashyap 2011-04-18 12:05:40 EDT
(In reply to comment #19)
> (In reply to comment #18) 
> > In my opinion, Adding commit "09c9d4c9b6a2b5909ae3c6265e4cd3820b636863" in
> > respective Distro will be the solution for this issue too.
> 
> The patch "dm mpath: disable blk_abort_queue" seems to be added to our -92
> kernel. The first logs come from -71 kernel, while the last report is from a
> 2.6.32-128 kernel - this means there must be another reason for this issue.

Tomas, Thanks for the input.

Joe, As per Tomas, RHEL6.1 will not suffer to have same issue.
So what I guess here is, lets run the test on RHEL6.1 and share vmcore file so that we can identify the root cause with RHEL6.1
Comment 21 Russell Doty 2011-04-20 17:07:38 EDT
This bug has missed Snapshot 5, the last 6.1 snapshot. Moving to 6.2.
Comment 22 joseph.r.gruher 2011-04-22 14:42:16 EDT
Created attachment 494287 [details]
logs with rhel6.1 snap 3 x64, 3.4.18.203, phase 18 firmware (1.30.00)

I've been able to run three blades of RHEL61 snap 3 x64 for about four days (about 1000 system failovers or 3000 if you count the blades individually) with IO without any problems.  All days used the 3.4.18.203 driver.  Two days used the phase 18 firmware and two days used the phase 20 firmware.  I'm going to put the 3.4.18 driver back in and run through the weekend.  If that survives I think we could probably call this resolved, if Kashyap agrees.

I'm attaching some logs just for reference.
Comment 23 joseph.r.gruher 2011-04-27 13:44:19 EDT
Using RHEL6.1 x64 snapshot 3 beta I'm unable to create any further failures.

I did two days testing with constant IO and FO/FB activity with each of these:
-3.4.18.203 and phase 20 firmware with debug enabled
-3.4.18.203 and phase 18 firmware with debug enabled
-3.4.18 and phase 18 firmware with normal output (this should be the "normal" configuration)

I think this can be closed based on changes included in RHEL6.1.
Comment 24 Tomas Henzl 2011-04-28 07:28:40 EDT
(In reply to comment #23)
> Using RHEL6.1 x64 snapshot 3 beta I'm unable to create any further failures.
> 
> I did two days testing with constant IO and FO/FB activity with each of these:
> -3.4.18.203 and phase 20 firmware with debug enabled
> -3.4.18.203 and phase 18 firmware with debug enabled
> -3.4.18 and phase 18 firmware with normal output (this should be the "normal"
> configuration)
> 
> I think this can be closed based on changes included in RHEL6.1.

Does it mean you used in your latest test a newer RHEL6.1 version or a newer firmware version or is something else different?
Comment 25 joseph.r.gruher 2011-04-28 12:33:57 EDT
Hi Tom-

Can you clarify the question?

My newest testing used the configurations described in comment 23... RHEL6.1 x64 snapshot 3 beta, phase 18 and phase 20 SAS controller firmware, 3.4.18 and 3.4.18.203 SAS drivers, and also debug output both enabled and disabled.

Which earlier testing did you want to compare against?
Comment 26 Tomas Henzl 2011-04-29 05:59:05 EDT
(In reply to comment #25)
> Hi Tom-
> 
> Can you clarify the question?

Sorry, I messed the versions. There is failed test log from snapshot2 and you latest report is based on snapshot3 - this can explain the difference.

Thank, Tomas

> 
> My newest testing used the configurations described in comment 23... RHEL6.1
> x64 snapshot 3 beta, phase 18 and phase 20 SAS controller firmware, 3.4.18 and
> 3.4.18.203 SAS drivers, and also debug output both enabled and disabled.
> 
> Which earlier testing did you want to compare against?
Comment 27 joseph.r.gruher 2011-05-02 14:37:36 EDT
Hi Tom-

In the snapshot 2 testing we assumed we had hit the failure condition due to the output as below:

mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ

However, the system was never unresponsive, and per Kashyap the print does not necessarily indicate any failure condition, it is just a bad print that will be cleared up in the next driver release (3.4.19).  

We moved the subsequent round of testing to snapshot 3 purely to try and stay up to date with the latest RHEL6.1 release.  In extensive further testing on that release we were never able to produce the actual original RHEL6.0 failure condition of an unresponsive system.  So, we conclude the system unresponsive issue is fixed in RHEL6.1, possibly by the patch Kashyap indicated, or possibly through some of the many other changes from 6.0.  The snapshot 2 result is basically a false positive where we thought we had hit a failure but were mistaken and it should be ignored.  Hope that clarifies the situation.
Comment 28 Tomas Henzl 2011-05-03 08:16:29 EDT
(In reply to comment #27)
Hi Joseph,
thanks for the explanation, I think it's definitely OK leave it closed.

Thanks, Tomas

Note You need to log in before you can comment on or make changes to this bug.