| Summary: | [Intel 6.2 Bug] System unresponsive after multipath failover/failback test (mptsas) | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | joseph.r.gruher <joseph.r.gruher> | ||||||||||||
| Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | 6.0 | CC: | coughlan, craig, hui.xiao, ilgu.hong, jane.lv, jvillalo, jwilleford, kashyap.desai, keve.a.gabbert, luyu, rdoty | ||||||||||||
| Target Milestone: | rc | Keywords: | TestOnly | ||||||||||||
| Target Release: | 6.2 | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2011-04-27 17:44:19 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 670196 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
joseph.r.gruher
2011-02-11 20:08:37 UTC
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. basesystem is just metapackage with no content, you have probably meant kernel ... reassigning. All of the proposed patches are included in other bugs, and are in ON_QA. Can this bug now be considered a TESTONLY bug? I believe the patches loaded from other bugs are all planned for RHEL6.1. They are needed to create a multipath test scenario where we don't run into known issues. I'm not sure what TESTONLY signifies but we consider this a legitimate bug needing root cause and resolution. Added TestOnly keyword to indicate that no code changes are required for this bug, but it needs to be tested once the RHEL 6.1 beta is available. Preliminary testing on the RHEL 6.1 Alpha (now available) would be helpful. Does this problem still exist in the RHEL 6.1 Beta? Setting up testing with 6.1 snapshot 2 today. Will run over the weekend and post logs. Created attachment 491103 [details]
logs of run with RHEL6.1 snapshot 2 x64
Ran two overnight failover/failback with IO test runs this weekend. Both failed with system unresponsive and scrolling errors of this form on the console:
mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ
Attaching logs from one run as an example. I think this is a separate issue from this bug. Logs could be checked for the "fails to discover added path" problem. If present, it still exists in RHEL6.1 snapshot 2, but it not present it may just be this other issue blocking us from running long enough to reproduce it.
We are considering whether this issue could be the cause of the failure in the latest logs (comment #9). Just a theory at this point. http://66.135.57.166/lists/linux-fsdevel/msg42068.html I am going to try to reproduce and trigger a dump with NMI for further analysis. (In reply to comment #9) > Created attachment 491103 [details] > logs of run with RHEL6.1 snapshot 2 x64 > > Ran two overnight failover/failback with IO test runs this weekend. Both > failed with system unresponsive and scrolling errors of this form on the > console: > > mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ > mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ > mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ > > Attaching logs from one run as an example. I think this is a separate issue > from this bug. Logs could be checked for the "fails to discover added path" > problem. If present, it still exists in RHEL6.1 snapshot 2, but it not present > it may just be this other issue blocking us from running long enough to > reproduce it. The first logs seems to come from 2.6.32-71 while the log from comment #9 is based on kernel 2.6.32-128 right? In between some patches were added to the mptsas, for example the inDMD patch. This could explain the differences - if you agree I'll mark the original logs as obsolete. In the new log from comment#9 there are some errors: Buffer I/O error on device sdd, logical block 0 sd 0:0:1:1: [sde] Device not ready sd 0:0:1:1: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 0:0:1:1: [sde] Sense Key : Not Ready [current] sd 0:0:1:1: [sde] Add. Sense: Logical unit not accessible, target port in standby state sd 0:0:1:1: [sde] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00 It looks like this^ is in the logs even before the test started? Is it possible that it comes from an damaged disk? I'm fine with obsoleting the original logs and focusing on the new issue. If we can resolve the new issue and then find the original issue still exists we can always circle back to it. The errors you mention come from the standby halves of the multipath paths. They are mostly inaccessible while in standby (cannot complete IOs or most SCSI commands). Linux attempts to probe them at boot time and it results in errors. Created attachment 491967 [details]
Use this .203 debug patch (in git format)
Please run the test again using .203 debug driver patch and also make sure kdump is enable.
I need vmcore file to debug further.
1. change "crashkernel=auto" to "crashkernel=512M" in /boot/grub/menu.lst
2. Check kdump is running. "Service kdump status".
"It must be Kdump is operational" if it is not operational run kdump.
"Service kdump start"
3. When issue is hit, please crash the kernel using Sysrq key as below
NOTE: make sure sysrq is enable in your system
Run "echo 1 > /proc/sys/kernel/sysrq"
Alt Key + SysRq Key + "c". [ "c" is charactor C on keyboard ]
Please attach vmcore file to debug further + console redirect log.
Thanks, Kashyap
I have a test system running per Kashyap's guidance. Previously the system had failed within a few hours, this one has so far run overnight and is stubbornly refusing to fail. We'll keep running it through the weekend and update when we have the debug data. Created attachment 492434 [details]
root cause analysis
Attaching failure analysis on RHEL6.0 from Kashyap. Behavior seems the same on RHEL6.1 snap 2 so I think it is reasonable to assume the same root cause applies but I will also keep trying to get a dump from RHEL6.1 for confirmation. Need feedback from Red Hat on root cause analysis.
(In reply to comment #16) > Attaching failure analysis on RHEL6.0 from Kashyap. Behavior seems the same on > RHEL6.1 snap 2 so I think it is reasonable to assume the same root cause > applies but I will also keep trying to get a dump from RHEL6.1 for > confirmation. Need feedback from Red Hat on root cause analysis. I probably need to read it once again, to understand this well. From paragraph 9. it looks like letting the driver respond with BLK_EH_NOT_HANDLED every time means a longer error handling, but no problems with multipath? (In reply to comment #17) > (In reply to comment #16) > > Attaching failure analysis on RHEL6.0 from Kashyap. Behavior seems the same on > > RHEL6.1 snap 2 so I think it is reasonable to assume the same root cause > > applies but I will also keep trying to get a dump from RHEL6.1 for > > confirmation. Need feedback from Red Hat on root cause analysis. > I probably need to read it once again, to understand this well. From paragraph > 9. it looks like letting the driver respond with BLK_EH_NOT_HANDLED every time > means a longer error handling, but no problems with multipath? Little history behind LSI introduced "mptsas_eh_timed_out" callback. Main reason to add "mptsas_eh_timed_out" is multipathing driver is using blk_flush_queue() (which is really not healthy mechanism to interact with LLD. ). There are some LLD which has it's own Error handling mechanism. Just to hook those kind of LLD eh_time_out call back is useful. mptfusion does not have Error handling code, it purely depends upon Scsi Mid Layer for EEH. Though mptsas has added "mptsas_eh_timed_out" just to avoid device to go Offline. e.a without "mptsas_eh_timed_out" mid layer will call EEH for those scmd and eventually TUR for the device will time out, because no actual device is attached there. Later, you will notice Device will be taken offline by SML. Using this function( in current form) driver will escape SML error handling returning "EH_RESET_TIMER". Though this is not an accurate return value, it is worth having this code. . I am not really happy with multipath driver's call blk_flush_queue(). When I did very basic search on google. I found lots of other proplem associated with blk_flush_queue() call from multipath driver. Eventually that function has been removed from dm-mpath.c. Please refer below link. https://lkml.org/lkml/2011/2/15/603 associated git log : ----------------------- commit 09c9d4c9b6a2b5909ae3c6265e4cd3820b636863 Author: Mike Snitzer <snitzer> Date: Thu Jan 13 19:59:46 2011 +0000 dm mpath: disable blk_abort_queue Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2 dm: Call blk_abort_queue on failed paths Multipath began to use blk_abort_queue() to allow for lower latency path deactivation. This was found to cause list corruption: the cmd gets blk_abort_queued/timedout run on it and the scsi eh somehow is able to complete and run scsi_queue_insert while scsi_request_fn is still trying to process the request. https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html ----------------------- In my opinion, Adding commit "09c9d4c9b6a2b5909ae3c6265e4cd3820b636863" in respective Distro will be the solution for this issue too. Thanks, Kashyap (In reply to comment #18) > In my opinion, Adding commit "09c9d4c9b6a2b5909ae3c6265e4cd3820b636863" in > respective Distro will be the solution for this issue too. The patch "dm mpath: disable blk_abort_queue" seems to be added to our -92 kernel. The first logs come from -71 kernel, while the last report is from a 2.6.32-128 kernel - this means there must be another reason for this issue. (In reply to comment #19) > (In reply to comment #18) > > In my opinion, Adding commit "09c9d4c9b6a2b5909ae3c6265e4cd3820b636863" in > > respective Distro will be the solution for this issue too. > > The patch "dm mpath: disable blk_abort_queue" seems to be added to our -92 > kernel. The first logs come from -71 kernel, while the last report is from a > 2.6.32-128 kernel - this means there must be another reason for this issue. Tomas, Thanks for the input. Joe, As per Tomas, RHEL6.1 will not suffer to have same issue. So what I guess here is, lets run the test on RHEL6.1 and share vmcore file so that we can identify the root cause with RHEL6.1 This bug has missed Snapshot 5, the last 6.1 snapshot. Moving to 6.2. Created attachment 494287 [details]
logs with rhel6.1 snap 3 x64, 3.4.18.203, phase 18 firmware (1.30.00)
I've been able to run three blades of RHEL61 snap 3 x64 for about four days (about 1000 system failovers or 3000 if you count the blades individually) with IO without any problems. All days used the 3.4.18.203 driver. Two days used the phase 18 firmware and two days used the phase 20 firmware. I'm going to put the 3.4.18 driver back in and run through the weekend. If that survives I think we could probably call this resolved, if Kashyap agrees.
I'm attaching some logs just for reference.
Using RHEL6.1 x64 snapshot 3 beta I'm unable to create any further failures. I did two days testing with constant IO and FO/FB activity with each of these: -3.4.18.203 and phase 20 firmware with debug enabled -3.4.18.203 and phase 18 firmware with debug enabled -3.4.18 and phase 18 firmware with normal output (this should be the "normal" configuration) I think this can be closed based on changes included in RHEL6.1. (In reply to comment #23) > Using RHEL6.1 x64 snapshot 3 beta I'm unable to create any further failures. > > I did two days testing with constant IO and FO/FB activity with each of these: > -3.4.18.203 and phase 20 firmware with debug enabled > -3.4.18.203 and phase 18 firmware with debug enabled > -3.4.18 and phase 18 firmware with normal output (this should be the "normal" > configuration) > > I think this can be closed based on changes included in RHEL6.1. Does it mean you used in your latest test a newer RHEL6.1 version or a newer firmware version or is something else different? Hi Tom- Can you clarify the question? My newest testing used the configurations described in comment 23... RHEL6.1 x64 snapshot 3 beta, phase 18 and phase 20 SAS controller firmware, 3.4.18 and 3.4.18.203 SAS drivers, and also debug output both enabled and disabled. Which earlier testing did you want to compare against? (In reply to comment #25) > Hi Tom- > > Can you clarify the question? Sorry, I messed the versions. There is failed test log from snapshot2 and you latest report is based on snapshot3 - this can explain the difference. Thank, Tomas > > My newest testing used the configurations described in comment 23... RHEL6.1 > x64 snapshot 3 beta, phase 18 and phase 20 SAS controller firmware, 3.4.18 and > 3.4.18.203 SAS drivers, and also debug output both enabled and disabled. > > Which earlier testing did you want to compare against? Hi Tom- In the snapshot 2 testing we assumed we had hit the failure condition due to the output as below: mptscsih: ioc0: WARNING - qcmd: SCpnt=ffff8802699e1ac0 timeout + 60HZ However, the system was never unresponsive, and per Kashyap the print does not necessarily indicate any failure condition, it is just a bad print that will be cleared up in the next driver release (3.4.19). We moved the subsequent round of testing to snapshot 3 purely to try and stay up to date with the latest RHEL6.1 release. In extensive further testing on that release we were never able to produce the actual original RHEL6.0 failure condition of an unresponsive system. So, we conclude the system unresponsive issue is fixed in RHEL6.1, possibly by the patch Kashyap indicated, or possibly through some of the many other changes from 6.0. The snapshot 2 result is basically a false positive where we thought we had hit a failure but were mistaken and it should be ignored. Hope that clarifies the situation. (In reply to comment #27) Hi Joseph, thanks for the explanation, I think it's definitely OK leave it closed. Thanks, Tomas |