Bug 646513
Summary: | HP_GETHOSTINFO ioctl always causes mpt controller reset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Jeffery <djeffery> | ||||||||
Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 5.5 | CC: | cww, dhoward, fge, jpirko, kashyap.desai, nhorman, plyons, qcai, tgraf | ||||||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
A call to the HP_GETHOSTINFO ioctl (I/O Control) in the mptctl module could result in the MPT (Message Passing Technology) fusion driver being reset due to erroneous detection of completed ioctl commands. With this update, the message context sent to the mptctl module is stored (previously, it was zeroed). When an ioctl command completes, the saved message context is used to recognize the completion of the message, thus resolving the faulty detection.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-07-21 10:00:32 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 684128, 707606 | ||||||||||
Attachments: |
|
Description
David Jeffery
2010-10-25 14:28:38 UTC
Created attachment 455550 [details]
Save and restore the MsgContext around the memset
The patch saves the MsgContext value before the memset call, and then restores it after the memset completes. The code watching for the message's completion will then watch for the correct context ID instead of watching for 0.
(In reply to comment #1) > The patch saves the MsgContext value before the memset call, and then restores > it after the memset completes. The code watching for the message's completion > will then watch for the correct context ID instead of watching for 0. David, thanks for the analysis and the patch. Are you going to post it upstream? Hi Kashyap, is this solution OK for you? If yes, would you care about upstream posting? -tomash I review the patch. Logic is understood. Can we do one optimization ? + IstwiRWRequest->MsgContext = msgcontext; IstwiRWRequest->Function = MPI_FUNCTION_TOOLBOX; IstwiRWRequest->Tool = MPI_TOOLBOX_ISTWI_READ_WRITE_TOOL; IstwiRWRequest->MsgContext = mpi_hdr->MsgContext; <--We can remove this. Do you agree ? I will post this patch to upstream. Created attachment 480432 [details] Save and restore the MsgContext around the memset - new version (In reply to comment #5) > I review the patch. Logic is understood. > > Can we do one optimization ? > > + IstwiRWRequest->MsgContext = msgcontext; > IstwiRWRequest->Function = MPI_FUNCTION_TOOLBOX; > IstwiRWRequest->Tool = MPI_TOOLBOX_ISTWI_READ_WRITE_TOOL; > IstwiRWRequest->MsgContext = mpi_hdr->MsgContext; <--We can remove this. > > Do you agree ? I will post this patch to upstream. I agree, and I have asked David for review. I think we can remove other lines too, as mpi_hdr is no more used. Kashyap, please go ahead an post upstream whatever version you want, add me to cc, I'll then continue with needed steps here. Posted for internal review - a corresponding upstream post - http://www.spinics.net/lists/linux-scsi/msg50852.html Patch(es) available in kernel-2.6.18-248.el5 Detailed testing feedback is always welcomed. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A call to the HP_GETHOSTINFO ioctl (I/O Control) in the mptctl module could result in the MPT (Message Passing Technology) fusion driver being reset due to erroneous detection of completed ioctl commands. With this update, the message context sent to the mptctl module is stored (previously, it was zeroed). When an ioctl command completes, the saved message context is used to recognize the completion of the message, thus resolving the faulty detection. I have tried to make a ioctl call to /dev/mptctl, but failed with error: mptctl.h:104: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’ token Any sample code for testing HP_GETHOSTINFO? Created attachment 501903 [details]
ioctl HP_GETHOSTINFO test code.
Test code attached.
in -194 kernel, we will got these dmesg:
===
mptctl: ioc0: WARNING - mptctl_hp_hostinfo: failed
mptbase: ioc0: Initiating recovery
===
Which indicate contorler reset.
in -264 kernel, no reset has been performed.
VERIFY this bug. (In reply to comment #17) > I have tried to make a ioctl call to /dev/mptctl, but failed with error: > mptctl.h:104: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’ > token This was a compiler error? > Any sample code for testing HP_GETHOSTINFO? Sorry, I wanted to verify it on a mptsas box, so I'm now too late with my test code. Tomas, Thanks. I already solved my problem for that error by pull all related macros out from mptctl.h. My test code was attached, I think comment #18 is capable for verify this bug. If you have time, you might explain why I got kernel warning: "mptctl: ioc0: WARNING - mptctl_hp_hostinfo: failed" (In reply to comment #21) > If you have time, you might explain why I got kernel warning: "mptctl: ioc0: > WARNING - mptctl_hp_hostinfo: failed" This indicates that a timeout happened in the mptctl_hp_hostinfo - it's what the patch is about - (from comment #0) > message to the adapter restores The message's context ID. Thus, the message's > completion doesn't notify mptctl_hp_hostinfo(). When mptctl_hp_hostinfo then > times out, it calls mptctl_timeout_expired() which then resets the adapter. Btw. the fw and some other informations are read fine regardless of this timeout - it's used only to fill the rsvd field. (In reply to comment #22) > (In reply to comment #21) > Btw. the fw and some other informations are read fine regardless of this > timeout - it's used only to fill the rsvd field. Yes. The test code confirmed that by reading hp_host_info_t. But the timeout warning is still there after patched. That is by design or my test code's fault? (sorry, I don't have time for checking mptctl_hp_hostinfo()) (In reply to comment #23) > (In reply to comment #22) > > (In reply to comment #21) > > Btw. the fw and some other informations are read fine regardless of this > > timeout - it's used only to fill the rsvd field. > Yes. The test code confirmed that by reading hp_host_info_t. But the timeout > warning is still there after patched. That is by design or my test code's > fault? That shouldn't be there - I'm sure with my tests it wasn't, the program responded immediately with the patch, without it took several several seconds and there was the message in dmesg. I'll try it with your test program and will let you know. (In reply to comment #24) > > Yes. The test code confirmed that by reading hp_host_info_t. But the timeout > > warning is still there after patched. That is by design or my test code's > > fault? > That shouldn't be there - I'm sure with my tests it wasn't, the program > responded immediately with the patch, without it took several several seconds > and there was the message in dmesg. > I'll try it with your test program and will let you know. I'm testing with your code right now - it returns immediately and there are no messages in dmesg. kernel 2.6.18-264, a timeout occurs with 2.6.18-238 ... host_no 0 fw_version 01.27.00.00 serial_number I will re-test on the server hp-ml310g5-01.rhts.eng.bos.redhat.com (currently used by someone else, I am waiting in the queue). Will let you know. Like you said, the test code return right after executed. ====== host_no 0 fw_version 01.23.34.00 serial_number P62190AGKUZ0AG ====== No timeout error on 2.6.18-264.el5 x84_64 OK to verify. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |